-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in reading Unicode text from Excel file to R #125
Comments
Hi, I am Vietnamese too, and I got exactly the same issue as you did. Surprisingly, the solution itself is very simple and straightforward. I update the system parameter as below. Then, Vietnamese characters are read correctly by using read_excel. Please be noted, if you print out using console, you can see Vietnamese, but if you use View command to view dataset, R may not displayed them properly. Please double check if my solution works for you. |
Hi nortonle, |
Hi, It's weird. I am running on Windows too. I executed the following command on my workplace laptop (windows 7), it worked. I just tried to executed it on my personal laptop (windows 10). Although it gave a warning, but it worked perfectly fine. Would you please share the error message of yours? |
Hi nortonle, Here is a warning message (like yours): Warning message: When I read data to an object aaa using this command: aaa=read_excel("C:/Book1.xls") The result is: Hố khoan instead of "Hố khoan" I run in RStudio 0.99.467 ; R version 3.2.2 ; Windows 7 32bit. |
Hi, |
I was facing with the problem like yours, i see the problem come from you had viewed whole your dataframe/datatable directly. Let try view only 1 field/columns.
|
It would be helpful to get some example sheets from the people in this thread. Also: please clarify how you are inspecting the imported data frame (for example, printing the data frame in the Console versus using How to provide a readxl reprexWe're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:
How to provide your own xls/xlsx file? In order of preference:
|
Hi jennybc,
library(readxl)
read_excel("D:/samples.xlsx")
#> Mô t<U+1EA3>
#> 1 BÙN SÉT
#> 2 SÉT l<U+1EAB>n s<U+1EA1>n laterite
#> 3 C�T THÔ l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh
#> 4 C�T M<U+1ECA>N
#> 5 C�T TRUNG
#> 6 SÉT l<U+1EAB>n k<U+1EBF>t vón silic Session infosessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: i386-w64-mingw32/i386 (32-bit)
#> Running under: Windows 7 (build 7601) Service Pack 1
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] readxl_0.1.1
#>
#> loaded via a namespace (and not attached):
#> [1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 tools_3.3.2
#> [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.9 stringi_1.1.2
#> [9] rmarkdown_1.3 knitr_1.15.1 stringr_1.2.0 digest_0.6.12
#> [13] evaluate_0.10 |
@leminhson can you please confirm that you're on windows? @jennybc to diagnose on the mac, you'll need Sys.setlocale(, "en_US.ISO8859-1")
read_excel("~/Desktop/samples.xlsx") This is almost certainly caused by assigning |
@hadley |
@leminhson changing locales will not fix the problem because you are on windows. |
Here's what I see on my Mac in a branch where I am (print) debugging: devtools::load_all(".")
#> Loading readxl
#> Re-compiling readxl
#> <output suppressed>
df <- read_excel(test_sheet("vietnamese-characters.xlsx"))
#> storing a column name: Mô tả
#> storing cell contents: BÙN SÉT
#> storing cell contents: SÉT lẫn sạn laterite
#> storing cell contents: CÁT THÔ lẫn sạn thạch anh
#> storing cell contents: CÁT MỊN
#> storing cell contents: CÁT TRUNG
#> storing cell contents: SÉT lẫn kết vón silic
df
#> # A tibble: 6 × 1
#> `Mô tả`
#> <chr>
#> 1 BÙN SÉT
#> 2 SÉT lẫn sạn laterite
#> 3 CÁT THÔ lẫn sạn thạch anh
#> 4 CÁT MỊN
#> 5 CÁT TRUNG
#> 6 SÉT lẫn kết van silic
Encoding(df[[1]])
#> [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
df[[1]][1]
#> [1] "BÙN SÉT"
as.matrix(df)
#> Mô tả
#> [1,] "BÙN SÉT"
#> [2,] "SÉT lẫn sạn laterite"
#> [3,] "CÁT THÔ lẫn sạn thạch anh"
#> [4,] "CÁT MỊN"
#> [5,] "CÁT TRUNG"
#> [6,] "SÉT lẫn kết vón silic" Then I do as you say and change the locale: Sys.setlocale(locale = "en_US.ISO8859-1")
#> [1] "en_US.ISO8859-1/en_US.ISO8859-1/en_US.ISO8859-1/C/en_US.ISO8859-1/en_CA.UTF-8"
df <- read_excel(test_sheet("vietnamese-characters.xlsx"))
#> storing a column name: Mô tả
#> storing cell contents: BÙN SÉT
#> storing cell contents: SÉT lẫn sạn laterite
#> storing cell contents: CÁT THÔ lẫn sạn thạch anh
#> storing cell contents: CÁT MỊN
#> storing cell contents: CÁT TRUNG
#> storing cell contents: SÉT lẫn kết vón silic
df
#> # A tibble: 6 � 1
#> `M� t<U+1EA3>`
#> <chr>
#> 1 B�N S�T
#> 2 S�T l<U+1EAB>n s<U+1EA1>n laterite
#> 3 C�T TH� l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh
#> 4 C�T M<U+1ECA>N
#> 5 C�T TRUNG
#> 6 S�T l<U+1EAB>n k<U+1EBF>t v�n silic
Encoding(df[[1]])
#> [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
df[[1]][1]
#> [1] "B�N S�T"
as.matrix(df)
#> M� t<U+1EA3>
#> [1,] "B�N S�T"
#> [2,] "S�T l<U+1EAB>n s<U+1EA1>n laterite"
#> [3,] "C�T TH� l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh"
#> [4,] "C�T M<U+1ECA>N"
#> [5,] "C�T TRUNG"
#> [6,] "S�T l<U+1EAB>n k<U+1EBF>t v�n silica" On the C++ side, it seems like the encoding is correctly specified. And, if I'm interpreting the above correctly, the character vector arrives in R with UTF-8 encoding. And yet something is clearly not right. I can work on other things for now and we could discuss on Friday. |
How are you doing the print debugging? In C++ or R? C++ doesn't have a notion of string encoding. I think this suggests rapidxml isn't converting to utf-8. You'll probably need to printing binary representation as hex to debug. |
@leminhson what do you see if you run this code? x <- "BÙN SÉT"
x |
I think changing the locale is the only way to fix the problem ;) @leminhson should set his locale to a Windows codepage with support for Vietnamese characters, then everything will display properly. |
@leminhson Does this solve the problem? Sys.setlocale("LC_CTYPE", "English_United States.1258") |
Here's some output from a branch that prints details on strings. Suggests that the strings are read and stored correctly and encoded as UTF-8. So, as conversation above indicates, this appears to be a printing problem and presumably one that is not specific to readxl and the tibbles it produces. I hope that @lionel-'s suggestion to switch locales is fruitful. > devtools::load_all()
Loading readxl
> Sys.getlocale()
[1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8"
> Sys.setlocale(locale = "en_US.ISO8859-1")
[1] "en_US.ISO8859-1/en_US.ISO8859-1/en_US.ISO8859-1/C/en_US.ISO8859-1/en_CA.UTF-8"
> Sys.getlocale()
[1] "en_US.ISO8859-1/en_US.ISO8859-1/en_US.ISO8859-1/C/en_US.ISO8859-1/en_CA.UTF-8"
> df <- read_excel(test_sheet("vietnamese-characters.xlsx"))
Mô tả
str.size() = 8
M: 4d
�: c3
�: b4
: 20
t: 74
�: e1
�: ba
�: a3
BÙN SÉT
str.size() = 9
B: 42
�: c3
�: 99
N: 4e
: 20
S: 53
�: c3
�: 89
T: 54
SÉT lẫn sạn laterite
str.size() = 25
S: 53
�: c3
�: 89
T: 54
: 20
l: 6c
�: e1
�: ba
�: ab
n: 6e
: 20
s: 73
�: e1
�: ba
�: a1
n: 6e
: 20
l: 6c
a: 61
t: 74
e: 65
r: 72
i: 69
t: 74
e: 65
CÁT THÔ lẫn sạn thạch anh
str.size() = 33
C: 43
�: c3
�: 81
T: 54
: 20
T: 54
H: 48
�: c3
�: 94
: 20
l: 6c
�: e1
�: ba
�: ab
n: 6e
: 20
s: 73
�: e1
�: ba
�: a1
n: 6e
: 20
t: 74
h: 68
�: e1
�: ba
�: a1
c: 63
h: 68
: 20
a: 61
n: 6e
h: 68
CÁT MỊN
str.size() = 10
C: 43
�: c3
�: 81
T: 54
: 20
M: 4d
�: e1
�: bb
�: 8a
N: 4e
CÁT TRUNG
str.size() = 10
C: 43
�: c3
�: 81
T: 54
: 20
T: 54
R: 52
U: 55
N: 4e
G: 47
SÉT lẫn kết vón silic
str.size() = 27
S: 53
�: c3
�: 89
T: 54
: 20
l: 6c
�: e1
�: ba
�: ab
n: 6e
: 20
k: 6b
�: e1
�: ba
�: bf
t: 74
: 20
v: 76
�: c3
�: b3
n: 6e
: 20
s: 73
i: 69
l: 6c
i: 69
c: 63
> df
# A tibble: 6 � 1
`M� t<U+1EA3>`
<chr>
1 B�N S�T
2 S�T l<U+1EAB>n s<U+1EA1>n laterite
3 C�T TH� l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh
4 C�T M<U+1ECA>N
5 C�T TRUNG
6 S�T l<U+1EAB>n k<U+1EBF>t v�n silica
## these are the correct bytes for all of these characters
> (z <- names(df)[1])
[1] "M� t<U+1EA3>"
> charToRaw(z)
[1] 4d c3 b4 20 74 e1 ba a3
> (z <- df[[1]][1])
[1] "B�N S�T"
> charToRaw(z)
[1] 42 c3 99 4e 20 53 c3 89 54 |
@hadley Yes. If I type directly in R command window, Vietnamese characters are displayed correctly
|
@leminhson What does Sys.setlocale("LC_CTYPE", "English_United States.1258") |
After switching the locale to 1258, what do you get when you do this on your sample data frame:
|
Your current default localeNote I asked to see output of In your default locale, where the string below prints correctly, what encoding is reported? I.e. run this Sys.getlocale()
#> [1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8"
x <- 'BÙN SÉT'
x
#> [1] "BÙN SÉT"
Encoding(x)
#> [1] "UTF-8" Re: changing your locale:You may not be able to change the locale in a running RStudio session on Windows, so let's not give up on @lionel- 's suggestion just yet. You might need to put Also, will you run these tests (reading your example sheet and the above) in R in the Console, i.e. not in RStudio, just to make sure that has nothing to do with it? |
might be relevant: https://stat.ethz.ch/pipermail/r-devel/2015-May/071250.html |
@jennybc Here is the result from your instruction:
|
@lionel- Yeah, that is why I included |
Also relevant 😐: Improve UTF-8 support on Windows, RConsortium/wishlist#2 by @kevinushey |
so @leminhson, does If not, what happens when you do this: Sys.setlocale("LC_CTYPE", "English_United States.1258")
df[[1]]
enc2native(df[[1]]) |
@jennybc Here is the result when running from R console (not in RStudio):
These Unicode character codes are correct. However the characters are not displayed as they are but they show the codes only. |
@leminhson are you running the latest GitHub version of readxl? |
@hadley Yes. The version of readxl is 0.1.1 |
@leminhson did you miss my comment? #125 (comment) |
@leminhson that's the current CRAN version, not the current github version. Please run |
@hadley The current github version of readxl is 0.1.1.9000
However, if I follow the steps from @jennybc without using reprex package, the result in R console is perfect. But if we view a dataframe df, the result is Unicode character codes only.
|
Ok, I'm happy that the problem is not on the readxl end, but instead lies somewhere else. |
@hadley Yes. But we do not know which part is the cause yet. If I type directly Vietnamese characters in R or in RStudio, the result is always correct. That means the problem is not by the setting of locale, not by the displaying function... |
Collectively, we've now spent a lot of time on this issue. It is getting very close to the point where I don't think we can afford to spend more. It's a bummer that we might not be able to fully resolve your issue, but we don't have unlimited resources and this problem is clearly only affecting a very small number of people, and it's highly likely that readxl is already doing all that it can. |
And it now seems clear that it's the data frame printing bug in R, I don't think this is related to readxl. @leminhson |
The ultimate problem is likely just that R's This implies that, if you have UTF-8 characters that are not representable in the current locale, you're hosed. |
I'm going to close this. We've established it's not a readxl-specific issue, but an example of general printing difficulty with Windows + R data frames + Unicode characters. Thanks for all the help everyone! This thread will still be a useful reference going forward. |
@leminhson I might also add: as we've said, these strings are being read and stored just fine, this is "only" a printing problem. So if you can tolerate the ugly, you can work with the data frame as it is. But if you really want nice printing, you might explicitly convert these strings from UTF-8 to Latin-1 by using |
The first suggestion in the very top - just type in console: Sys.setlocale("LC_ALL", 'en_US.UTF-8') |
Now the package readxl can read Vietnamese characters without any error. I do not know what is the reason: due to new version of R (3.5.0) or new version of readxl (1.1.0) ??? |
I had all updated R and packages but still needed to apply the fix to be able to see the name of cities in Poland in proper text and not some strange codes that did not make sense..
… On 14 Aug 2018, at 10.56, leminhson ***@***.***> wrote:
Now the package readxl can read Vietnamese characters without any error. I do not know what is the reason: due to new version of R (3.5.0) or new version of readxl (1.1.0) ???
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#125 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALD9ErbXlEJvKYjd2k8Z0DFQCegdyGA-ks5uQpDPgaJpZM4F27wm>.
|
I had the same issue, years later. I just saved the excel file as a csv, didn't have any problems after that. |
Thank you for your solution.On 14 Aug 2019 01:32, Benjamin Braun <notifications@github.com> wrote:I had the same issue, years later. I just saved the excel file as a csv, didn't have any problems after that.
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
|
`# set local encoding for Serbian language '# load the packages '# load the dataset '# look at the dataset '# Take only the columns needed '#See what we got '# Get a library for translating Cyrillic to Latin '# take all the string data from dositej '# Calculate how much are there math free norm in Novi Sad Note: Thanks R for solving the issues! :) |
The command read_excel reads Unicode string from Excel to R and returns a string with non-Unicode characters.
Ex: A string "Sét lẫn laterite" is converted to "Sét l<U+1EAB>n laterite"
The text was updated successfully, but these errors were encountered: