We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm trying to parse factors with scandinavian letters from a file in iso-8859-1 encoding.
I made a gist of the problem with sample files in UTF-8 and iso-8859-1.
library("readr") library("dplyr") ##################################################################### # works as expected when column types are chr and file is iso-8859-1 read_delim( "scandinavian.csv", ";", locale = locale("fi", encoding = "iso-8859-1"), escape_double = FALSE, col_names = FALSE, trim_ws = TRUE ) %>% glimpse() %>% select(X11,X12) # X11 X12 # <chr> <chr> # 1 VAS VÄNST ##################################################################### # works as expected when column types are fctr and file is UTF-8 read_delim( "scandinavian-utf.csv", ";", locale = locale("fi", encoding = "UTF-8"), escape_double = FALSE, col_names = FALSE, col_types = cols(X11 = col_factor(c("VAS")), X12 = col_factor(c("VÄNST"))), trim_ws = TRUE ) %>% glimpse() %>% select(X11,X12) # X11 X12 # <chr> <chr> # 1 VAS VÄNST ##################################################################### # works unexpectedly when column types are fctr and file is iso-8859-1 read_delim( "scandinavian.csv", ";", locale = locale("fi", encoding = "iso-8859-1"), escape_double = FALSE, col_names = FALSE, col_types = cols(X11 = col_factor(c("VAS")), X12 = col_factor(c("VÄNST"))), trim_ws = TRUE ) %>% glimpse() %>% select(X11,X12) # X11 X12 # <chr> <chr> # 1 VAS NA # should see: # X11 X12 # <chr> <chr> # 1 VAS VÄNST
scandinavian-utf.csv
K ;01;091;A;001A;HEL;HEL;06;01;01;VAS ;VÄNST ;0002;Kruununhaka A ;Kronohagen A ;Riku ;Ahola ;1;031;valtiotieteiden kandidaatti
scandinavian.csv
The data is from the Finnish ministry of justice election results. The test files used in the above code can be generated this way:
curl -O http://tulospalvelu.vaalit.fi/K2012/data/k-2012_ehd_maa.csv.zip unzip k-2012_ehd_maa.csv.zip head -n 1 kv-2012_teat_maa.csv > scandinavian.csv iconv -f ISO-8859-1 -t UTF-8 scandinavian.csv > scandinavian-utf.csv $ file -I scandinavian.csv scandinavian.csv: text/plain; charset=iso-8859-1 $ file -I scandinavian-utf.csv scandinavian-utf.csv: text/plain; charset=utf-8
The text was updated successfully, but these errors were encountered:
Can confirm this issue, the problem is we are not converting the data to UTF-8. Reprex
library(readr) encoded <- function(x, encoding) { Encoding(x) <- encoding x } read_csv(encoded("test\nA\n\xC4\n", "latin1"), col_types = cols(col_factor(c("A", "Ä"))), locale = locale(encoding = "latin1")) #> Warning: 1 parsing failure. #> row col expected actual file #> 2 test value in level set � literal data #> #> # A tibble: 2 × 1 #> test #> <fctr> #> 1 A #> 2 NA
Sorry, something went wrong.
Encode factor data
41fe6cb
Fixes #615
d3cdcd2
No branches or pull requests
I'm trying to parse factors with scandinavian letters from a file in iso-8859-1 encoding.
I made a gist of the problem with sample files in UTF-8 and iso-8859-1.
scandinavian-utf.csv
scandinavian.csv
The data is from the Finnish ministry of justice election results. The test files used in the above code can be generated this way:
The text was updated successfully, but these errors were encountered: