Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong encoding detection #25

Closed
artemklevtsov opened this issue Nov 24, 2014 · 4 comments
Closed

Wrong encoding detection #25

artemklevtsov opened this issue Nov 24, 2014 · 4 comments

Comments

@artemklevtsov
Copy link

@artemklevtsov artemklevtsov commented Nov 24, 2014

Hi.

rvest::html("http://winrus.com/cpage_r.htm")

Return a broken symbols but

XML::htmlParse("http://winrus.com/cpage_r.htm")

works without any additional actions.

@hadley
Copy link
Member

@hadley hadley commented Nov 24, 2014

Hmmm, that's a weird one - I can't even figure out what the encoding is supposed to be.

@artemklevtsov
Copy link
Author

@artemklevtsov artemklevtsov commented Nov 24, 2014

It's cp1251. I have another example: http://psytests.org/.

@hadley
Copy link
Member

@hadley hadley commented Nov 25, 2014

A few more diagnostics:

library("httr")
library("rvest")
url <- "http://psytests.org"

# No encoding in http request
r <- GET(url)
headers(r)$`Content-Type`

# So default text content from httr is bad
content(r, "text")

# stringi thinks encoding is ISO-8859-1
as.data.frame(stringi::stri_enc_detect(content(r, "raw"))[[1]])

# But it's not
stringi::stri_encode(content(r, "raw"), "ISO-8859-1")

# It's actually cp1251
stringi::stri_encode(content(r, "raw"), "cp1251")

# Which also works when we give it to content
content(r, "text", encoding = "cp1251")

# But not when we give it to rvest::html
rvest::html("http://psytests.org", encoding = "cp1251")

@artemklevtsov
Copy link
Author

@artemklevtsov artemklevtsov commented Nov 26, 2014

rvest::html("http://psytests.org", encoding = "cp1251")

Translates to

rvest:::html.response(httr::GET("http://psytests.org"), encoding = "cp1251")

In rvest:::html.response we can see that text <- httr::content(x, "text") called without encoding arg. Then XML::htmlParse applied on the broken string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants