New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug when parsing col_factor from iso-8859-1 file #615

Closed
pe3 opened this Issue Feb 14, 2017 · 1 comment

Comments

2 participants
@pe3

pe3 commented Feb 14, 2017

I'm trying to parse factors with scandinavian letters from a file in iso-8859-1 encoding.

I made a gist of the problem with sample files in UTF-8 and iso-8859-1.

library("readr")
library("dplyr")

#####################################################################
# works as expected when column types are chr and file is iso-8859-1

read_delim(
  "scandinavian.csv",
  ";",
  locale = locale("fi", encoding = "iso-8859-1"),
  escape_double = FALSE,
  col_names = FALSE,
  trim_ws = TRUE
) %>% glimpse() %>% select(X11,X12)

#      X11   X12
#    <chr> <chr>
#  1   VAS VÄNST

#####################################################################
# works as expected when column types are fctr and file is UTF-8

read_delim(
  "scandinavian-utf.csv",
  ";",
  locale = locale("fi", encoding = "UTF-8"),
  escape_double = FALSE,
  col_names = FALSE,
  col_types = cols(X11 = col_factor(c("VAS")), X12 = col_factor(c("VÄNST"))),
  trim_ws = TRUE
) %>% glimpse() %>% select(X11,X12)

#      X11   X12
#    <chr> <chr>
#  1   VAS VÄNST

#####################################################################
# works unexpectedly when column types are fctr and file is iso-8859-1

read_delim(
  "scandinavian.csv",
  ";",
  locale = locale("fi", encoding = "iso-8859-1"),
  escape_double = FALSE,
  col_names = FALSE,
  col_types = cols(X11 = col_factor(c("VAS")), X12 = col_factor(c("VÄNST"))),
  trim_ws = TRUE
) %>% glimpse() %>% select(X11,X12)

#      X11   X12
#    <chr> <chr>
#  1   VAS    NA

# should see:
#      X11   X12
#    <chr> <chr>
#  1   VAS VÄNST

scandinavian-utf.csv

K ;01;091;A;001A;HEL;HEL;06;01;01;VAS ;VÄNST ;0002;Kruununhaka A ;Kronohagen A ;Riku ;Ahola ;1;031;valtiotieteiden kandidaatti

scandinavian.csv

K ;01;091;A;001A;HEL;HEL;06;01;01;VAS ;VÄNST ;0002;Kruununhaka A ;Kronohagen A ;Riku ;Ahola ;1;031;valtiotieteiden kandidaatti

The data is from the Finnish ministry of justice election results. The test files used in the above code can be generated this way:

curl -O http://tulospalvelu.vaalit.fi/K2012/data/k-2012_ehd_maa.csv.zip
unzip k-2012_ehd_maa.csv.zip
head -n 1 kv-2012_teat_maa.csv > scandinavian.csv
iconv -f ISO-8859-1 -t UTF-8  scandinavian.csv  > scandinavian-utf.csv
$ file -I scandinavian.csv 
scandinavian.csv: text/plain; charset=iso-8859-1
$ file -I scandinavian-utf.csv
scandinavian-utf.csv: text/plain; charset=utf-8

@jimhester jimhester added this to To do in jimhester Feb 20, 2017

@jimhester

This comment has been minimized.

Member

jimhester commented Feb 21, 2017

Can confirm this issue, the problem is we are not converting the data to UTF-8.
Reprex

library(readr)
encoded <- function(x, encoding) {
  Encoding(x) <- encoding
  x
}

read_csv(encoded("test\nA\n\xC4\n", "latin1"), col_types = cols(col_factor(c("A", "Ä"))), locale = locale(encoding = "latin1"))
#> Warning: 1 parsing failure.
#> row  col           expected actual         file
#>   2 test value in level set   � literal data
#> 
#> # A tibble: 2 × 1
#>     test
#>   <fctr>
#> 1      A
#> 2     NA

jimhester added a commit that referenced this issue Feb 21, 2017

jimhester added a commit that referenced this issue Feb 21, 2017

jimhester added a commit that referenced this issue Feb 21, 2017

@jimhester jimhester moved this from To do to Needs Review in jimhester Feb 21, 2017

jimhester added a commit that referenced this issue Feb 22, 2017

@jimhester jimhester closed this in d3cdcd2 Feb 22, 2017

@lock lock bot locked and limited conversation to collaborators Sep 24, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.