Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upSupport multibyte encodings (e.g. UTF-16LE) #306
Comments
This comment has been minimized.
This comment has been minimized.
Can you please provide a reproducible example? |
This comment has been minimized.
This comment has been minimized.
Hi, please see the data file and codes in folder below https://www.dropbox.com/sh/6vw6irfe0m3f5bm/AACmdssmln4iZjxkpzbE1MEaa?dl=0 thanks! Longyi Bi bilongyi@gmail.com On Tue, Nov 3, 2015 at 12:25 PM, Hadley Wickham notifications@github.com
|
This comment has been minimized.
This comment has been minimized.
Hi any follow-up on this issue? thanks! |
This comment has been minimized.
This comment has been minimized.
PedramNavid
commented
Dec 16, 2015
having a similar problem. even when I try loading the file using a file connection with explicit encoding, the read fails: |
This comment has been minimized.
This comment has been minimized.
I simplified the example so it only needs R to work (no external files required).
|
This comment has been minimized.
This comment has been minimized.
Narrowing down this issue. I suspect the tokenizer in readr does not account for multibyte line feeds. check_encodings <- function(text, encoding) {
text_utf16 <- iconv(text,from="UTF-8",to=encoding, toRaw = TRUE)[[1]]
ds <- readr:::datasource_string(text, skip=0)
lines <- readr:::read_lines_(ds, locale_ = locale())
ds_raw <- readr:::datasource_raw(text_utf16, skip=0, comment="")
lines_raw <- readr:::read_lines_(ds_raw, locale_ = locale(encoding=encoding))
assertthat::are_equal(lines, lines_raw)
}
check_encodings("3\t2", "UTF-16LE") # works
check_encodings("3\t2\n", "UTF-16LE") # fails: Error: incomplete multibyte sequence
check_encodings("3\t2", "UTF-16BE") # works
check_encodings("3\t2\n", "UTF-16BE") # fails: Error: incomplete multibyte sequence
@hadley, This issue may require pushing the locale information into the tokenizer (at least the file encoding information to have the right encoding for '\n'). I don't know if that is fine for you or if you wanted the tokenizer to be locale-independent. This is assuming that my suspicions about not considering multibyte line feeds are correct.... I feel I can do a small C++ patch, but if I am right then several parts need fixing in the tokenizer. Could you please take a look into this issue and give some advice? |
This comment has been minimized.
This comment has been minimized.
@zeehio Yeah, this is a big issue that will need some thought. In general, readr currently assumes that it can read byte-by-byte, and anything else will require quite of lot of work/thought. |
bilydr commentedNov 3, 2015
Hi, I am trying to read in a file with UTF-16LE encoding
which can be done with base package codes
but when I try to use readr to do the same
I got the error Error: Incomplete multibyte sequence
Can you please help fix it? Thanks for your advice!