New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multibyte encodings (e.g. UTF-16LE) #306

Open
bilydr opened this Issue Nov 3, 2015 · 7 comments

Comments

Projects
None yet
5 participants
@bilydr

bilydr commented Nov 3, 2015

Hi, I am trying to read in a file with UTF-16LE encoding

which can be done with base package codes

df <- read.delim(file1, stringsAsFactors = FALSE, fileEncoding = 'UTF-16LE')

but when I try to use readr to do the same

df <- read_tsv(file1, locale = locale(encoding = 'UTF-16LE'))

I got the error Error: Incomplete multibyte sequence

Can you please help fix it? Thanks for your advice!

@hadley

This comment has been minimized.

Member

hadley commented Nov 3, 2015

Can you please provide a reproducible example?

@bilydr

This comment has been minimized.

bilydr commented Nov 4, 2015

Hi, please see the data file and codes in folder below

https://www.dropbox.com/sh/6vw6irfe0m3f5bm/AACmdssmln4iZjxkpzbE1MEaa?dl=0

thanks!

Longyi Bi

bilongyi@gmail.com
Mobile: +33-668655292

On Tue, Nov 3, 2015 at 12:25 PM, Hadley Wickham notifications@github.com
wrote:

Can you please provide a reproducible example?


Reply to this email directly or view it on GitHub
#306 (comment).

@bilydr

This comment has been minimized.

bilydr commented Nov 27, 2015

Hi any follow-up on this issue? thanks!

@PedramNavid

This comment has been minimized.

PedramNavid commented Dec 16, 2015

having a similar problem. even when I try loading the file using a file connection with explicit encoding, the read fails:
data.table(read_csv(file("data/Parking_Tags_data_2008.csv", encoding="UCS-2LE"))) for example, reads garbage, while read.csv works fine.

@zeehio

This comment has been minimized.

Contributor

zeehio commented Feb 29, 2016

I simplified the example so it only needs R to work (no external files required).

tmp_file_name <- tempfile()
# This is the Byte Order Mark:
rawbom <- raw(2)
rawbom[1] <- as.raw(255)
rawbom[2] <- as.raw(254)
# This is the text. 
text <- "1\t2\n"
# Converted to UTF-16LE
text_utf16 <- iconv(text,from="UTF-8",to="UTF-16LE", toRaw = TRUE)
# Write the BOM and the text to a file
fd <- file(tmp_file_name, "wb")
writeBin(rawbom, fd)
writeBin(text_utf16[[1]], fd)
close(fd)
# The solution, to compare:
df_test <- data.frame(V1=1, V2=2)
# Using read.delim no problem:
works <- read.delim(tmp_file_name, header=FALSE, fileEncoding = "UTF-16LE")
assertthat::are_equal(works[1,], df_test)
# Using readr, error:
fails <- readr::read_table(tmp_file_name, col_names=FALSE, locale=locale(encoding="UTF-16LE"))
assertthat::are_equal(fails, df_test)
@zeehio

This comment has been minimized.

Contributor

zeehio commented Mar 22, 2016

Narrowing down this issue. I suspect the tokenizer in readr does not account for multibyte line feeds.

check_encodings <- function(text, encoding) {
  text_utf16 <- iconv(text,from="UTF-8",to=encoding, toRaw = TRUE)[[1]]
  ds <- readr:::datasource_string(text, skip=0)
  lines <- readr:::read_lines_(ds, locale_ = locale())
  ds_raw <- readr:::datasource_raw(text_utf16, skip=0, comment="")
  lines_raw <- readr:::read_lines_(ds_raw, locale_ = locale(encoding=encoding))
  assertthat::are_equal(lines, lines_raw)
}

check_encodings("3\t2", "UTF-16LE")  # works
check_encodings("3\t2\n", "UTF-16LE")  # fails:  Error: incomplete multibyte sequence

check_encodings("3\t2", "UTF-16BE") # works
check_encodings("3\t2\n", "UTF-16BE")   # fails:  Error: incomplete multibyte sequence

@hadley, This issue may require pushing the locale information into the tokenizer (at least the file encoding information to have the right encoding for '\n'). I don't know if that is fine for you or if you wanted the tokenizer to be locale-independent. This is assuming that my suspicions about not considering multibyte line feeds are correct.... I feel I can do a small C++ patch, but if I am right then several parts need fixing in the tokenizer. Could you please take a look into this issue and give some advice?

@hadley

This comment has been minimized.

Member

hadley commented Jun 2, 2016

@zeehio Yeah, this is a big issue that will need some thought. In general, readr currently assumes that it can read byte-by-byte, and anything else will require quite of lot of work/thought.

@hadley hadley changed the title from Error: Incomplete multibyte sequence to Support multibyte encodings (e.g. UTF-16LE) Jun 2, 2016

@jimhester jimhester added this to Multi-byte in jimhester Feb 7, 2017

@hadley hadley removed the ready label Feb 10, 2017

@jimhester jimhester removed this from Multi-byte in jimhester Feb 10, 2017

@jimhester jimhester added this to the backlog milestone Nov 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment