Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in reading Unicode text from Excel file to R #125

Closed
leminhson opened this issue Sep 3, 2015 · 46 comments
Closed

Error in reading Unicode text from Excel file to R #125

leminhson opened this issue Sep 3, 2015 · 46 comments
Labels
bug an unexpected problem or unintended behavior
Projects

Comments

@leminhson
Copy link

The command read_excel reads Unicode string from Excel to R and returns a string with non-Unicode characters.

Ex: A string "Sét lẫn laterite" is converted to "Sét l<U+1EAB>n laterite"

@nortonle
Copy link

Hi,

I am Vietnamese too, and I got exactly the same issue as you did. Surprisingly, the solution itself is very simple and straightforward. I update the system parameter as below.
Sys.setlocale("LC_ALL", 'en_US.UTF-8')

Then, Vietnamese characters are read correctly by using read_excel. Please be noted, if you print out using console, you can see Vietnamese, but if you use View command to view dataset, R may not displayed them properly. Please double check if my solution works for you.

@leminhson
Copy link
Author

Hi nortonle,
Thank you for your solution. However, the syntax Sys.setlocale("LC_ALL", 'en_US.UTF-8') does not work on Windows system.

@leminhson leminhson reopened this Dec 19, 2015
@nortonle
Copy link

Hi,

It's weird. I am running on Windows too. I executed the following command on my workplace laptop (windows 7), it worked. I just tried to executed it on my personal laptop (windows 10). Although it gave a warning, but it worked perfectly fine. Would you please share the error message of yours?
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored

@leminhson
Copy link
Author

Hi nortonle,

Here is a warning message (like yours):

Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF8") :
OS reports request to set locale to "en_US.UTF8" cannot be honored

When I read data to an object aaa using this command: aaa=read_excel("C:/Book1.xls")

The result is: Hố khoan instead of "Hố khoan"

I run in RStudio 0.99.467 ; R version 3.2.2 ; Windows 7 32bit.

@adpgithub
Copy link

Hi,
I am also facing the same issue. My Excel file contains a name
O’Donnell
which changes to
O’Donnell
when loaded using read_excel() (readxl Version: 0.1.0)
Does anyone know any workaround or fix for this issue?
Thanks.

@kyotin
Copy link

kyotin commented Jun 27, 2016

I was facing with the problem like yours, i see the problem come from you had viewed whole your dataframe/datatable directly. Let try view only 1 field/columns.

>df[1:2,] name price Ford EcoSport 2014 621.000.000 camry nh<U+1EAD>p kh<U+1EA9>u 2.0e 770.000.000 >df[1:2,"name"] [1] "Ford EcoSport 2014" "camry nhập khẩu 2.0e"
You can also use '$' e.g df$name.

@jennybc
Copy link
Member

jennybc commented Feb 5, 2017

It would be helpful to get some example sheets from the people in this thread. Also: please clarify how you are inspecting the imported data frame (for example, printing the data frame in the Console versus using View() in RStudio).

How to provide a readxl reprex

We're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:

  • An actual xls or xlsx file. Pick one:
    • Your personal xls or xlsx: try to strip it down to the minimal size and complexity to demonstrate your point. And, obviously, remove any sensitive data.
    • A publicly available xls or xlsx: provide URL and the code you used to download.
  • A small bit of R code that uses readxl on the provided xls or xlsx file and demonstrates your point.
    • Consider using the reprex package to prepare this. In addition to nice formatting, this ensures your reprex is self-contained.
  • Any details about your environment that seem clearly relevant, such as operating system.
    reprex(..., si = TRUE)
    will append a standard summary, folded neatly away, at the bottom of your reprex.

How to provide your own xls/xlsx file? In order of preference:

  1. Attach the file directly to your issue. Instructions are always at the bottom of the issue or comment box.
  2. Share via DropBox or Google Drive and provide the link in your issue.
  3. Explain you absolutely cannot provide a relevant file via github.com and offer to provide privately.
  4. Don't share a file and realize you're hoping for, e.g., a bug fix with no concrete example to go on.

@leminhson
Copy link
Author

Hi jennybc,
I follow your instruction of using reprex package.

  1. The data file "samples.xlsx" is attached directly to this comment.

  2. A code is used to read data from Excel to R:
    library(readxl)
    read_excel("D:/samples.xlsx")

  3. Run the command reprex(si=TRUE) in RStudio. Here is the result (not correct as expected):

library(readxl)
read_excel("D:/samples.xlsx")
#>                                     Mô t<U+1EA3>
#> 1                                        BÙN SÉT
#> 2             SÉT l<U+1EAB>n s<U+1EA1>n laterite
#> 3 C�T THÔ l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh
#> 4                                 C�T M<U+1ECA>N
#> 5                                      C�T TRUNG
#> 6            SÉT l<U+1EAB>n k<U+1EBF>t vón silic
Session info
sessionInfo()
#> R version 3.3.2 (2016-10-31)
#> Platform: i386-w64-mingw32/i386 (32-bit)
#> Running under: Windows 7 (build 7601) Service Pack 1
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] readxl_0.1.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] backports_1.0.5 magrittr_1.5    rprojroot_1.2   tools_3.3.2    
#>  [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.9     stringi_1.1.2  
#>  [9] rmarkdown_1.3   knitr_1.15.1    stringr_1.2.0   digest_0.6.12  
#> [13] evaluate_0.10
  1. The correct result should be like this:
    correct data

samples.xlsx

@jennybc jennybc added this to TODO in jennybc Feb 26, 2017
@hadley
Copy link
Member

hadley commented Mar 9, 2017

@leminhson can you please confirm that you're on windows?

@jennybc to diagnose on the mac, you'll need

Sys.setlocale(, "en_US.ISO8859-1")
read_excel("~/Desktop/samples.xlsx")

This is almost certainly caused by assigning std::string into a CharacterVector somewhere, because that loses the UTF-8 encoding information that should be applied. I'm reasonably certain that RapidXml always returns UTF-8 encoded strings (although you should double-check that) - the problem is not that the data stored in the string is incorrect, it's that R isn't correctly informed about the encoding

@leminhson
Copy link
Author

leminhson commented Mar 10, 2017

@hadley
Yes, I am using Windows 7.
After applying Sys.setlocale(, "English_United States.1252"), the result is still incorrect.
I think the problem is that the character code 1252 does not have Vietnamese characters. We need to set UTF-8 to display Vietnamese characters properly.

@hadley
Copy link
Member

hadley commented Mar 10, 2017

@leminhson changing locales will not fix the problem because you are on windows.

@jennybc
Copy link
Member

jennybc commented Mar 21, 2017

@hadley

Here's what I see on my Mac in a branch where I am (print) debugging:

devtools::load_all(".")
#> Loading readxl
#> Re-compiling readxl
#> <output suppressed>

df <- read_excel(test_sheet("vietnamese-characters.xlsx"))
#> storing a column name:  Mô tả
#> storing cell contents:  BÙN SÉT
#> storing cell contents:  SÉT lẫn sạn laterite
#> storing cell contents:  CÁT THÔ lẫn sạn thạch anh
#> storing cell contents:  CÁT MỊN
#> storing cell contents:  CÁT TRUNG
#> storing cell contents:  SÉT lẫn kết vón silic

df
#> # A tibble: 6 × 1
#>                     `Mô tả`
#>                       <chr>
#> 1                   BÙN SÉT
#> 2      SÉT lẫn sạn laterite
#> 3 CÁT THÔ lẫn sạn thạch anh
#> 4                   CÁT MỊN
#> 5                 CÁT TRUNG
#> 6     SÉT lẫn kết van silic

Encoding(df[[1]])
#> [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
df[[1]][1]
#> [1] "BÙN SÉT"
as.matrix(df)
#>      Mô tả                      
#> [1,] "BÙN SÉT"                  
#> [2,] "SÉT lẫn sạn laterite"     
#> [3,] "CÁT THÔ lẫn sạn thạch anh"
#> [4,] "CÁT MỊN"                  
#> [5,] "CÁT TRUNG"                
#> [6,] "SÉT lẫn kết vón silic" 

Then I do as you say and change the locale:

Sys.setlocale(locale = "en_US.ISO8859-1")
#> [1] "en_US.ISO8859-1/en_US.ISO8859-1/en_US.ISO8859-1/C/en_US.ISO8859-1/en_CA.UTF-8"

df <- read_excel(test_sheet("vietnamese-characters.xlsx"))
#> storing a column name:  Mô tả
#> storing cell contents:  BÙN SÉT
#> storing cell contents:  SÉT lẫn sạn laterite
#> storing cell contents:  CÁT THÔ lẫn sạn thạch anh
#> storing cell contents:  CÁT MỊN
#> storing cell contents:  CÁT TRUNG
#> storing cell contents:  SÉT lẫn kết vón silic

df
#> # A tibble: 6 � 1
#>                                      `M� t<U+1EA3>`
#>                                                  <chr>
#> 1                                        B�N S�T
#> 2                S�T l<U+1EAB>n s<U+1EA1>n laterite
#> 3 C�T TH� l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh
#> 4                                    C�T M<U+1ECA>N
#> 5                                         C�T TRUNG
#> 6            S�T l<U+1EAB>n k<U+1EBF>t v�n silic

Encoding(df[[1]])
#> [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"

df[[1]][1]
#> [1] "B�N S�T"

as.matrix(df)
#>      M� t<U+1EA3>                                
#> [1,] "B�N S�T"                           
#> [2,] "S�T l<U+1EAB>n s<U+1EA1>n laterite"   
#> [3,] "C�T TH� l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh"
#> [4,] "C�T M<U+1ECA>N"                       
#> [5,] "C�T TRUNG"                            
#> [6,] "S�T l<U+1EAB>n k<U+1EBF>t v�n silica"

On the C++ side, it seems like the encoding is correctly specified. And, if I'm interpreting the above correctly, the character vector arrives in R with UTF-8 encoding. And yet something is clearly not right.

I can work on other things for now and we could discuss on Friday.

@jennybc jennybc added the bug an unexpected problem or unintended behavior label Mar 21, 2017
@hadley
Copy link
Member

hadley commented Mar 22, 2017

How are you doing the print debugging? In C++ or R? C++ doesn't have a notion of string encoding.

I think this suggests rapidxml isn't converting to utf-8. You'll probably need to printing binary representation as hex to debug.

@hadley
Copy link
Member

hadley commented Mar 23, 2017

@leminhson what do you see if you run this code?

x <- "BÙN SÉT"
x

@lionel-
Copy link
Member

lionel- commented Mar 23, 2017

changing locales will not fix the problem because you are on windows.

I think changing the locale is the only way to fix the problem ;)

@leminhson should set his locale to a Windows codepage with support for Vietnamese characters, then everything will display properly.

@lionel-
Copy link
Member

lionel- commented Mar 23, 2017

@leminhson Does this solve the problem?

Sys.setlocale("LC_CTYPE", "English_United States.1258")

@jennybc
Copy link
Member

jennybc commented Mar 23, 2017

Here's some output from a branch that prints details on strings. Suggests that the strings are read and stored correctly and encoded as UTF-8. So, as conversation above indicates, this appears to be a printing problem and presumably one that is not specific to readxl and the tibbles it produces.

I hope that @lionel-'s suggestion to switch locales is fruitful.

> devtools::load_all()
Loading readxl
> Sys.getlocale()
[1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8"
> Sys.setlocale(locale = "en_US.ISO8859-1")
[1] "en_US.ISO8859-1/en_US.ISO8859-1/en_US.ISO8859-1/C/en_US.ISO8859-1/en_CA.UTF-8"
> Sys.getlocale()
[1] "en_US.ISO8859-1/en_US.ISO8859-1/en_US.ISO8859-1/C/en_US.ISO8859-1/en_CA.UTF-8"

> df <- read_excel(test_sheet("vietnamese-characters.xlsx"))
Mô tả
  str.size() = 8
  M: 4d: c3: b4
   : 20
  t: 74: e1: ba: a3
BÙN SÉT
  str.size() = 9
  B: 42: c3: 99
  N: 4e
   : 20
  S: 53: c3: 89
  T: 54
SÉT ln sn laterite
  str.size() = 25
  S: 53: c3: 89
  T: 54
   : 20
  l: 6c: e1: ba: ab
  n: 6e
   : 20
  s: 73: e1: ba: a1
  n: 6e
   : 20
  l: 6c
  a: 61
  t: 74
  e: 65
  r: 72
  i: 69
  t: 74
  e: 65
CÁT THÔ ln sn thch anh
  str.size() = 33
  C: 43: c3: 81
  T: 54
   : 20
  T: 54
  H: 48: c3: 94
   : 20
  l: 6c: e1: ba: ab
  n: 6e
   : 20
  s: 73: e1: ba: a1
  n: 6e
   : 20
  t: 74
  h: 68: e1: ba: a1
  c: 63
  h: 68
   : 20
  a: 61
  n: 6e
  h: 68
CÁT MN
  str.size() = 10
  C: 43: c3: 81
  T: 54
   : 20
  M: 4d: e1: bb: 8a
  N: 4e
CÁT TRUNG
  str.size() = 10
  C: 43: c3: 81
  T: 54
   : 20
  T: 54
  R: 52
  U: 55
  N: 4e
  G: 47
SÉT ln kết vón silic
  str.size() = 27
  S: 53: c3: 89
  T: 54
   : 20
  l: 6c: e1: ba: ab
  n: 6e
   : 20
  k: 6b: e1: ba: bf
  t: 74
   : 20
  v: 76: c3: b3
  n: 6e
   : 20
  s: 73
  i: 69
  l: 6c
  i: 69
  c: 63

> df
# A tibble: 6 � 1
                                     `M� t<U+1EA3>`
                                                 <chr>
1                                        BN ST
2                ST l<U+1EAB>n s<U+1EA1>n laterite
3 CT THl<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh
4                                    CT M<U+1ECA>N
5                                         CT TRUNG
6            ST l<U+1EAB>n k<U+1EBF>t vn silica

## these are the correct bytes for all of these characters
> (z <- names(df)[1])
[1] "M� t<U+1EA3>"

> charToRaw(z)
[1] 4d c3 b4 20 74 e1 ba a3

> (z <- df[[1]][1])
[1] "B�N S�T"

> charToRaw(z)
[1] 42 c3 99 4e 20 53 c3 89 54

@leminhson
Copy link
Author

leminhson commented Mar 24, 2017

@hadley Yes. If I type directly in R command window, Vietnamese characters are displayed correctly

x <- 'BÙN SÉT'
x
[1] "BÙN SÉT"

@jennybc
Copy link
Member

jennybc commented Mar 24, 2017

@leminhson What does Sys.getlocale() report? Did you try @lionel-'s suggestion to change your locale?

Sys.setlocale("LC_CTYPE", "English_United States.1258")

@leminhson
Copy link
Author

leminhson commented Mar 24, 2017

@lionel- Sys.setlocale("LC_CTYPE", "English_United States.1258") could not solve this problem.

@jennybc Sys.getlocale( ) answers:
[1] "LC_COLLATE=English_United States.1258;LC_CTYPE=English_United States.1258;LC_MONETARY=English_United States.1258;LC_NUMERIC=C;LC_TIME=English_United States.1258"

@lionel-
Copy link
Member

lionel- commented Mar 24, 2017

After switching the locale to 1258, what do you get when you do this on your sample data frame:

enc2native(df[[1]])

@jennybc
Copy link
Member

jennybc commented Mar 24, 2017

@leminhson

Your current default locale

Note I asked to see output of Sys.getlocale() for your default locale: "get" not "set".

In your default locale, where the string below prints correctly, what encoding is reported? I.e. run this

Sys.getlocale()
#> [1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8"
x <- 'BÙN SÉT'
x
#> [1] "BÙN SÉT"
Encoding(x)
#> [1] "UTF-8"

Re: changing your locale:

You may not be able to change the locale in a running RStudio session on Windows, so let's not give up on @lionel- 's suggestion just yet. You might need to put Sys.setlocale("LC_CTYPE", "English_United States.1258") in a startup file such as ~/.Rprofile and restart. Then run Sys.getlocale() to confirm the change took and try readxl::read_excel("samples.xlsx") again.

Also, will you run these tests (reading your example sheet and the above) in R in the Console, i.e. not in RStudio, just to make sure that has nothing to do with it?

@lionel-
Copy link
Member

lionel- commented Mar 24, 2017

@leminhson
Copy link
Author

leminhson commented Mar 24, 2017

@jennybc Here is the result from your instruction:

Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
x <- 'BÙN SÉT'
x
[1] "BÙN SÉT"
Encoding(x)
[1] "latin1"

@jennybc
Copy link
Member

jennybc commented Mar 24, 2017

@lionel- Yeah, that is why I included as.matrix(df) in my investigation far above, but it's true we've never had @leminhson do same. And clearly I'm not really emulating this user's problem very well.

@jennybc
Copy link
Member

jennybc commented Mar 24, 2017

Also relevant 😐: Improve UTF-8 support on Windows, RConsortium/wishlist#2 by @kevinushey

@lionel-
Copy link
Member

lionel- commented Mar 24, 2017

so @leminhson, does df[[1]] display any better than df?

If not, what happens when you do this:

Sys.setlocale("LC_CTYPE", "English_United States.1258")

df[[1]]
enc2native(df[[1]])

@leminhson
Copy link
Author

leminhson commented Mar 24, 2017

@jennybc Here is the result when running from R console (not in RStudio):

Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1258;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

library(readxl)

x <- read_excel("D:/Samples.xlsx")

x
Mô t<U+1EA3>
1 BÙN SÉT
2 SÉT l<U+1EAB>n s<U+1EA1>n laterite
3 CÁT THÔ l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh
4 CÁT M<U+1ECA>N
5 CÁT TRUNG
6 SÉT l<U+1EAB>n k<U+1EBF>t vón silic

These Unicode character codes are correct. However the characters are not displayed as they are but they show the codes only.

@hadley
Copy link
Member

hadley commented Mar 24, 2017

@leminhson are you running the latest GitHub version of readxl?

@leminhson
Copy link
Author

@hadley Yes. The version of readxl is 0.1.1

@lionel-
Copy link
Member

lionel- commented Mar 24, 2017

@leminhson did you miss my comment? #125 (comment)

@hadley
Copy link
Member

hadley commented Mar 24, 2017

@leminhson that's the current CRAN version, not the current github version. Please run install_github("tidyverse/readxl")

@leminhson
Copy link
Author

leminhson commented Mar 24, 2017

@hadley The current github version of readxl is 0.1.1.9000
@lionel- The result of enc2native(df[[1]]) after changing the locale English United States.1258 is not fruitful:

enc2native(df[[1]])
[1] "BÙN SÉT"
[2] "SÉT l<U+1EAB>n s<U+1EA1>n laterite"
[3] "CÁT THÔ l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh"
[4] "CÁT M<U+1ECA>N"
[5] "CÁT TRUNG"
[6] "SÉT l<U+1EAB>n k<U+1EBF>t vón silic"

However, if I follow the steps from @jennybc without using reprex package, the result in R console is perfect. But if we view a dataframe df, the result is Unicode character codes only.

library(readxl)
df <- read_excel("D:/samples.xlsx")
Encoding(df[[1]])
[1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"

as.matrix(df)
Mô tả
[1,] "BÙN SÉT"
[2,] "SÉT lẫn sạn laterite"
[3,] "CÁT THÔ lẫn sạn thạch anh"
[4,] "CÁT MỊN"
[5,] "CÁT TRUNG"
[6,] "SÉT lẫn kết vón silic"

View(df)
1 BÙN SÉT
2 SÉT l<U+1EAB>n s<U+1EA1>n laterite
3 CÁT THÔ l<U+1EAB>n s<U+1EA1>n th<U+1EA1>ch anh
4 CÁT M<U+1ECA>N
5 CÁT TRUNG
6 SÉT l<U+1EAB>n k<U+1EBF>t vón silic

@hadley
Copy link
Member

hadley commented Mar 24, 2017

Ok, I'm happy that the problem is not on the readxl end, but instead lies somewhere else.

@leminhson
Copy link
Author

@hadley Yes. But we do not know which part is the cause yet. If I type directly Vietnamese characters in R or in RStudio, the result is always correct. That means the problem is not by the setting of locale, not by the displaying function...

@hadley
Copy link
Member

hadley commented Mar 24, 2017

Collectively, we've now spent a lot of time on this issue. It is getting very close to the point where I don't think we can afford to spend more. It's a bummer that we might not be able to fully resolve your issue, but we don't have unlimited resources and this problem is clearly only affecting a very small number of people, and it's highly likely that readxl is already doing all that it can.

@lionel-
Copy link
Member

lionel- commented Mar 24, 2017

And it now seems clear that it's the data frame printing bug in R, I don't think this is related to readxl.

@leminhson df[[1]] prints just as well as as.matrix(df) right?

@kevinushey
Copy link
Contributor

The ultimate problem is likely just that R's print() method on data.frames tries to round-trip characters through the active encoding, which is obviously lossy when converting UTF-8-encoded characters.

screen shot 2017-03-24 at 9 15 15 am

This implies that, if you have UTF-8 characters that are not representable in the current locale, you're hosed.

jennybc added a commit that referenced this issue Mar 24, 2017
Inspired by investigations re: #125
@jennybc
Copy link
Member

jennybc commented Mar 24, 2017

I'm going to close this. We've established it's not a readxl-specific issue, but an example of general printing difficulty with Windows + R data frames + Unicode characters. Thanks for all the help everyone! This thread will still be a useful reference going forward.

@jennybc jennybc closed this as completed Mar 24, 2017
@jennybc
Copy link
Member

jennybc commented Mar 24, 2017

@leminhson I might also add: as we've said, these strings are being read and stored just fine, this is "only" a printing problem. So if you can tolerate the ugly, you can work with the data frame as it is. But if you really want nice printing, you might explicitly convert these strings from UTF-8 to Latin-1 by using iconv() on the affected variables. But then you will have lost the UTF-8 encoding, which is superior in the long run.

@cjens
Copy link

cjens commented Aug 14, 2018

The first suggestion in the very top - just type in console: Sys.setlocale("LC_ALL", 'en_US.UTF-8')
it immediately solved all my issues with reading in Polish language text strings in R!

@leminhson
Copy link
Author

Now the package readxl can read Vietnamese characters without any error. I do not know what is the reason: due to new version of R (3.5.0) or new version of readxl (1.1.0) ???

@cjens
Copy link

cjens commented Aug 14, 2018 via email

@bpbraun
Copy link

bpbraun commented Aug 13, 2019

I had the same issue, years later. I just saved the excel file as a csv, didn't have any problems after that.

@leminhson
Copy link
Author

leminhson commented Aug 14, 2019 via email

@KaticaR
Copy link

KaticaR commented Sep 23, 2019

`# set local encoding for Serbian language
Sys.getlocale("LC_ALL")
Sys.setlocale(locale = 'Serbian (Cyrillic)')'

'# load the packages
library("readxl")
library("dplyr")'

'# load the dataset
dositej <- read_excel("slobodna_radna_mesta-17.05.2019.xls")'

'# look at the dataset
head(dositej)'

'# Take only the columns needed
dositej <- dositej[,c(1,2,3,8,10,16)]'

'#See what we got
View(dositej)'

'# Get a library for translating Cyrillic to Latin
library(stringi)'

'# take all the string data from dositej
i <- 1:5'
'# and translate it to Latin
dositej[ , i] <- apply(dositej[ , i], 2, function(x) stri_trans_general(x, "Serbian-Latin/BGN"))'
'# Now view it again to see what come out :)
Sys.setlocale(locale = 'Serbian (Latin)')
View(dositej)'

'# Calculate how much are there math free norm in Novi Sad
math_norma <- dositej %>%
filter(opstina == "Novi Sad", predmet == "Matematika") %>%
select(norma_slobodno) %>%
sum()`

Note:
The data in Cyrillic or in Latin looks ok when View ,
but trying to filter the data in Base R got me NA's,
for all data types in data frame. Using dplyr solved the problem,
so math_norma at the end is a number, not NA.

Thanks R for solving the issues! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
Development

No branches or pull requests