Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multibyte encodings (e.g. UTF-16LE) #306

bilydr opened this issue Nov 3, 2015 · 15 comments

Support multibyte encodings (e.g. UTF-16LE) #306

bilydr opened this issue Nov 3, 2015 · 15 comments


Copy link

@bilydr bilydr commented Nov 3, 2015

Hi, I am trying to read in a file with UTF-16LE encoding

which can be done with base package codes

df <- read.delim(file1, stringsAsFactors = FALSE, fileEncoding = 'UTF-16LE')

but when I try to use readr to do the same

df <- read_tsv(file1, locale = locale(encoding = 'UTF-16LE'))

I got the error Error: Incomplete multibyte sequence

Can you please help fix it? Thanks for your advice!

Copy link

@hadley hadley commented Nov 3, 2015

Can you please provide a reproducible example?

Copy link

@bilydr bilydr commented Nov 4, 2015

Hi, please see the data file and codes in folder below


Longyi Bi
Mobile: +33-668655292

On Tue, Nov 3, 2015 at 12:25 PM, Hadley Wickham

Can you please provide a reproducible example?

Reply to this email directly or view it on GitHub
#306 (comment).

Copy link

@bilydr bilydr commented Nov 27, 2015

Hi any follow-up on this issue? thanks!

Copy link

@PedramNavid PedramNavid commented Dec 16, 2015

having a similar problem. even when I try loading the file using a file connection with explicit encoding, the read fails:
data.table(read_csv(file("data/Parking_Tags_data_2008.csv", encoding="UCS-2LE"))) for example, reads garbage, while read.csv works fine.

Copy link

@zeehio zeehio commented Feb 29, 2016

I simplified the example so it only needs R to work (no external files required).

tmp_file_name <- tempfile()
# This is the Byte Order Mark:
rawbom <- raw(2)
rawbom[1] <- as.raw(255)
rawbom[2] <- as.raw(254)
# This is the text. 
text <- "1\t2\n"
# Converted to UTF-16LE
text_utf16 <- iconv(text,from="UTF-8",to="UTF-16LE", toRaw = TRUE)
# Write the BOM and the text to a file
fd <- file(tmp_file_name, "wb")
writeBin(rawbom, fd)
writeBin(text_utf16[[1]], fd)
# The solution, to compare:
df_test <- data.frame(V1=1, V2=2)
# Using read.delim no problem:
works <- read.delim(tmp_file_name, header=FALSE, fileEncoding = "UTF-16LE")
assertthat::are_equal(works[1,], df_test)
# Using readr, error:
fails <- readr::read_table(tmp_file_name, col_names=FALSE, locale=locale(encoding="UTF-16LE"))
assertthat::are_equal(fails, df_test)
Copy link

@zeehio zeehio commented Mar 22, 2016

Narrowing down this issue. I suspect the tokenizer in readr does not account for multibyte line feeds.

check_encodings <- function(text, encoding) {
  text_utf16 <- iconv(text,from="UTF-8",to=encoding, toRaw = TRUE)[[1]]
  ds <- readr:::datasource_string(text, skip=0)
  lines <- readr:::read_lines_(ds, locale_ = locale())
  ds_raw <- readr:::datasource_raw(text_utf16, skip=0, comment="")
  lines_raw <- readr:::read_lines_(ds_raw, locale_ = locale(encoding=encoding))
  assertthat::are_equal(lines, lines_raw)

check_encodings("3\t2", "UTF-16LE")  # works
check_encodings("3\t2\n", "UTF-16LE")  # fails:  Error: incomplete multibyte sequence

check_encodings("3\t2", "UTF-16BE") # works
check_encodings("3\t2\n", "UTF-16BE")   # fails:  Error: incomplete multibyte sequence

@hadley, This issue may require pushing the locale information into the tokenizer (at least the file encoding information to have the right encoding for '\n'). I don't know if that is fine for you or if you wanted the tokenizer to be locale-independent. This is assuming that my suspicions about not considering multibyte line feeds are correct.... I feel I can do a small C++ patch, but if I am right then several parts need fixing in the tokenizer. Could you please take a look into this issue and give some advice?

Copy link

@hadley hadley commented Jun 2, 2016

@zeehio Yeah, this is a big issue that will need some thought. In general, readr currently assumes that it can read byte-by-byte, and anything else will require quite of lot of work/thought.

@hadley hadley changed the title Error: Incomplete multibyte sequence Support multibyte encodings (e.g. UTF-16LE) Jun 2, 2016
@hadley hadley removed the ready label Feb 10, 2017
@jimhester jimhester added this to the backlog milestone Nov 19, 2018

This comment has been hidden.


This comment has been hidden.


This comment has been hidden.


This comment has been hidden.


This comment has been hidden.

Copy link

@dhofstetter dhofstetter commented Dec 5, 2019

I'm currently faced with this issue as well.

Are there any plans to fix this issue? Really pitty not beeing able to use my beloved readr methods for the current project though :-(

BR Daniel

Copy link

@espinielli espinielli commented Jun 13, 2020

If of help I could read UTF-16 encoded CSV with

           sep = ";",
           stringsAsFactors = FALSE,
           fileEncoding = 'UTF-16') %>% 

readr::read_csv2 and data.table::fread did not work.

Copy link

@anadiedrichs anadiedrichs commented Sep 14, 2020

I had the same problem. I could read UTF-16LE encoded file with the read.csv function.

This dataset could be a great example, link:


The dataset size is around 300 MB and contains information about COVID19 cases in Argentina.

Output of the file command in Linux.

$ file Covid19Casos.csv 

Covid19Casos.csv: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

Output of guess_encoding function



# A tibble: 3 x 2
  encoding   confidence
1 UTF-16LE         1   
2 ISO-8859-1       0.68
3 ISO-8859-2       0.48

The function read.csv worked in this case.

dataset <- read.csv("Covid19Casos.csv", fileEncoding="UTF-16LE")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants