New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multibyte encodings (e.g. UTF-16LE) #306
Comments
Can you please provide a reproducible example? |
Hi, please see the data file and codes in folder below https://www.dropbox.com/sh/6vw6irfe0m3f5bm/AACmdssmln4iZjxkpzbE1MEaa?dl=0 thanks! Longyi Bi bilongyi@gmail.com On Tue, Nov 3, 2015 at 12:25 PM, Hadley Wickham notifications@github.com
|
Hi any follow-up on this issue? thanks! |
having a similar problem. even when I try loading the file using a file connection with explicit encoding, the read fails: |
I simplified the example so it only needs R to work (no external files required).
|
Narrowing down this issue. I suspect the tokenizer in readr does not account for multibyte line feeds. check_encodings <- function(text, encoding) {
text_utf16 <- iconv(text,from="UTF-8",to=encoding, toRaw = TRUE)[[1]]
ds <- readr:::datasource_string(text, skip=0)
lines <- readr:::read_lines_(ds, locale_ = locale())
ds_raw <- readr:::datasource_raw(text_utf16, skip=0, comment="")
lines_raw <- readr:::read_lines_(ds_raw, locale_ = locale(encoding=encoding))
assertthat::are_equal(lines, lines_raw)
}
check_encodings("3\t2", "UTF-16LE") # works
check_encodings("3\t2\n", "UTF-16LE") # fails: Error: incomplete multibyte sequence
check_encodings("3\t2", "UTF-16BE") # works
check_encodings("3\t2\n", "UTF-16BE") # fails: Error: incomplete multibyte sequence
@hadley, This issue may require pushing the locale information into the tokenizer (at least the file encoding information to have the right encoding for '\n'). I don't know if that is fine for you or if you wanted the tokenizer to be locale-independent. This is assuming that my suspicions about not considering multibyte line feeds are correct.... I feel I can do a small C++ patch, but if I am right then several parts need fixing in the tokenizer. Could you please take a look into this issue and give some advice? |
@zeehio Yeah, this is a big issue that will need some thought. In general, readr currently assumes that it can read byte-by-byte, and anything else will require quite of lot of work/thought. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I'm currently faced with this issue as well. Are there any plans to fix this issue? Really pitty not beeing able to use my beloved readr methods for the current project though :-( BR Daniel |
If of help I could read UTF-16 encoded CSV with read.delim("data-raw/swiss_2020-06-01.csv",
sep = ";",
stringsAsFactors = FALSE,
fileEncoding = 'UTF-16') %>%
as_tibble()
|
I had the same problem. I could read UTF-16LE encoded file with the This dataset could be a great example, link: https://sisa.msal.gov.ar/datos/descargas/covid-19/files/Covid19Casos.csv Source: http://datos.salud.gob.ar/dataset/covid-19-casos-registrados-en-la-republica-argentina The dataset size is around 300 MB and contains information about COVID19 cases in Argentina. Output of the file command in Linux. $ file Covid19Casos.csv Covid19Casos.csv: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators Output of library(readr) guess_encoding("Covid19Casos.csv") # A tibble: 3 x 2 encoding confidence 1 UTF-16LE 1 2 ISO-8859-1 0.68 3 ISO-8859-2 0.48 The function dataset <- read.csv("Covid19Casos.csv", fileEncoding="UTF-16LE") |
Closed by tidyverse/vroom@7d54cda |
Hi, I am trying to read in a file with UTF-16LE encoding
which can be done with base package codes
but when I try to use readr to do the same
I got the error Error: Incomplete multibyte sequence
Can you please help fix it? Thanks for your advice!
The text was updated successfully, but these errors were encountered: