New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should be able to supply encoding #40

Closed
hadley opened this Issue Jun 19, 2014 · 8 comments

Comments

Projects
None yet
4 participants
@hadley
Member

hadley commented Jun 19, 2014

Output should always be utf-8

@hadley

This comment has been minimized.

Member

hadley commented Jun 23, 2014

@romainfrancois can you look into this? We need to be able to accept arbitrary encoding and convert to utf-8 for R.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Jun 23, 2014

I'll have a look at how it is done in R. Encoding is something I don't quite understand yet and so as been ignored in Rcpp, etc ...

So this would be an argument to the function or would we have to detect encoding somehow ?

@hadley

This comment has been minimized.

Member

hadley commented Jun 23, 2014

It would be an argument to a function. Detecting encoding automatically is difficult - the stringi package has some code using ICU (see bottom of http://docs.rexamine.com/stringi/compat_tab_conversion.html)

@hadley

This comment has been minimized.

Member

hadley commented Mar 9, 2015

I think I have a handle on how to do this now - need to use iconv

@pachevalier

This comment has been minimized.

pachevalier commented Apr 10, 2015

Actually, there is no automatic conversion to UTF-8. I think we could automatically detect the encoding of a file using the chardet command line.

> system("chardet sources/DE_PF_et_FA.txt")
sources/DE_PF_et_FA.txt: windows-1252 with confidence 0.73

It seems to work pretty well. The fileEncoding option would still be useful to me.

@hadley

This comment has been minimized.

Member

hadley commented Apr 10, 2015

@blaquans right, there's no automatical conversion because this is an open issue. Character encoding detection is difficult to do well automatically and I think is dangerous to turn on by default.

@okumuralab

This comment has been minimized.

okumuralab commented Apr 11, 2015

Base functions accept encodings such as read.csv(..., fileEncoding="SJIS").

@hadley

This comment has been minimized.

Member

hadley commented Sep 9, 2015

The interface will probably get nicer, but this now works :)

x <- c("こんにちは")
x
#> [1] "こんにちは"
Encoding(x)
#> [1] "UTF-8"

y <- iconv(x, "UTF-8", "shift-jis")
y
#> [1] "\x82\xb1\x82\xf1\x82\u0242\xbf\x82\xcd"
Encoding(y)
#> [1] "unknown"

ja <- locale("ja", encoding = "shift-jis")
z <- parse_character(y, locale = ja)
z
#> [1] "こんにちは"
Encoding(z)
#> [1] "UTF-8"

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.