Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should be able to supply encoding #40

Closed
hadley opened this issue Jun 19, 2014 · 8 comments
Closed

Should be able to supply encoding #40

hadley opened this issue Jun 19, 2014 · 8 comments

Comments

@hadley
Copy link
Member

@hadley hadley commented Jun 19, 2014

Output should always be utf-8

@hadley
Copy link
Member Author

@hadley hadley commented Jun 23, 2014

@romainfrancois can you look into this? We need to be able to accept arbitrary encoding and convert to utf-8 for R.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jun 23, 2014

I'll have a look at how it is done in R. Encoding is something I don't quite understand yet and so as been ignored in Rcpp, etc ...

So this would be an argument to the function or would we have to detect encoding somehow ?

@hadley
Copy link
Member Author

@hadley hadley commented Jun 23, 2014

It would be an argument to a function. Detecting encoding automatically is difficult - the stringi package has some code using ICU (see bottom of http://docs.rexamine.com/stringi/compat_tab_conversion.html)

@hadley
Copy link
Member Author

@hadley hadley commented Mar 9, 2015

I think I have a handle on how to do this now - need to use iconv

@pachevalier
Copy link

@pachevalier pachevalier commented Apr 10, 2015

Actually, there is no automatic conversion to UTF-8. I think we could automatically detect the encoding of a file using the chardet command line.

> system("chardet sources/DE_PF_et_FA.txt")
sources/DE_PF_et_FA.txt: windows-1252 with confidence 0.73

It seems to work pretty well. The fileEncoding option would still be useful to me.

@hadley
Copy link
Member Author

@hadley hadley commented Apr 10, 2015

@blaquans right, there's no automatical conversion because this is an open issue. Character encoding detection is difficult to do well automatically and I think is dangerous to turn on by default.

@okumuralab
Copy link

@okumuralab okumuralab commented Apr 11, 2015

Base functions accept encodings such as read.csv(..., fileEncoding="SJIS").

@hadley
Copy link
Member Author

@hadley hadley commented Sep 9, 2015

The interface will probably get nicer, but this now works :)

x <- c("こんにちは")
x
#> [1] "こんにちは"
Encoding(x)
#> [1] "UTF-8"

y <- iconv(x, "UTF-8", "shift-jis")
y
#> [1] "\x82\xb1\x82\xf1\x82\u0242\xbf\x82\xcd"
Encoding(y)
#> [1] "unknown"

ja <- locale("ja", encoding = "shift-jis")
z <- parse_character(y, locale = ja)
z
#> [1] "こんにちは"
Encoding(z)
#> [1] "UTF-8"
@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants