Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multibyte encodings (e.g. UTF-16LE) #306

Closed
bilydr opened this issue Nov 3, 2015 · 16 comments
Closed

Support multibyte encodings (e.g. UTF-16LE) #306

bilydr opened this issue Nov 3, 2015 · 16 comments
Labels
feature a feature request or enhancement multibyte 🦋

Comments

@bilydr
Copy link

bilydr commented Nov 3, 2015

Hi, I am trying to read in a file with UTF-16LE encoding

which can be done with base package codes

df <- read.delim(file1, stringsAsFactors = FALSE, fileEncoding = 'UTF-16LE')

but when I try to use readr to do the same

df <- read_tsv(file1, locale = locale(encoding = 'UTF-16LE'))

I got the error Error: Incomplete multibyte sequence

Can you please help fix it? Thanks for your advice!

@hadley
Copy link
Member

hadley commented Nov 3, 2015

Can you please provide a reproducible example?

@bilydr
Copy link
Author

bilydr commented Nov 4, 2015

Hi, please see the data file and codes in folder below

https://www.dropbox.com/sh/6vw6irfe0m3f5bm/AACmdssmln4iZjxkpzbE1MEaa?dl=0

thanks!

Longyi Bi

bilongyi@gmail.com
Mobile: +33-668655292

On Tue, Nov 3, 2015 at 12:25 PM, Hadley Wickham notifications@github.com
wrote:

Can you please provide a reproducible example?


Reply to this email directly or view it on GitHub
#306 (comment).

@bilydr
Copy link
Author

bilydr commented Nov 27, 2015

Hi any follow-up on this issue? thanks!

@PedramNavid
Copy link

having a similar problem. even when I try loading the file using a file connection with explicit encoding, the read fails:
data.table(read_csv(file("data/Parking_Tags_data_2008.csv", encoding="UCS-2LE"))) for example, reads garbage, while read.csv works fine.

@zeehio
Copy link
Contributor

zeehio commented Feb 29, 2016

I simplified the example so it only needs R to work (no external files required).

tmp_file_name <- tempfile()
# This is the Byte Order Mark:
rawbom <- raw(2)
rawbom[1] <- as.raw(255)
rawbom[2] <- as.raw(254)
# This is the text. 
text <- "1\t2\n"
# Converted to UTF-16LE
text_utf16 <- iconv(text,from="UTF-8",to="UTF-16LE", toRaw = TRUE)
# Write the BOM and the text to a file
fd <- file(tmp_file_name, "wb")
writeBin(rawbom, fd)
writeBin(text_utf16[[1]], fd)
close(fd)
# The solution, to compare:
df_test <- data.frame(V1=1, V2=2)
# Using read.delim no problem:
works <- read.delim(tmp_file_name, header=FALSE, fileEncoding = "UTF-16LE")
assertthat::are_equal(works[1,], df_test)
# Using readr, error:
fails <- readr::read_table(tmp_file_name, col_names=FALSE, locale=locale(encoding="UTF-16LE"))
assertthat::are_equal(fails, df_test)

@zeehio
Copy link
Contributor

zeehio commented Mar 22, 2016

Narrowing down this issue. I suspect the tokenizer in readr does not account for multibyte line feeds.

check_encodings <- function(text, encoding) {
  text_utf16 <- iconv(text,from="UTF-8",to=encoding, toRaw = TRUE)[[1]]
  ds <- readr:::datasource_string(text, skip=0)
  lines <- readr:::read_lines_(ds, locale_ = locale())
  ds_raw <- readr:::datasource_raw(text_utf16, skip=0, comment="")
  lines_raw <- readr:::read_lines_(ds_raw, locale_ = locale(encoding=encoding))
  assertthat::are_equal(lines, lines_raw)
}

check_encodings("3\t2", "UTF-16LE")  # works
check_encodings("3\t2\n", "UTF-16LE")  # fails:  Error: incomplete multibyte sequence

check_encodings("3\t2", "UTF-16BE") # works
check_encodings("3\t2\n", "UTF-16BE")   # fails:  Error: incomplete multibyte sequence

@hadley, This issue may require pushing the locale information into the tokenizer (at least the file encoding information to have the right encoding for '\n'). I don't know if that is fine for you or if you wanted the tokenizer to be locale-independent. This is assuming that my suspicions about not considering multibyte line feeds are correct.... I feel I can do a small C++ patch, but if I am right then several parts need fixing in the tokenizer. Could you please take a look into this issue and give some advice?

@hadley
Copy link
Member

hadley commented Jun 2, 2016

@zeehio Yeah, this is a big issue that will need some thought. In general, readr currently assumes that it can read byte-by-byte, and anything else will require quite of lot of work/thought.

@hadley hadley changed the title Error: Incomplete multibyte sequence Support multibyte encodings (e.g. UTF-16LE) Jun 2, 2016
@hadley hadley added feature a feature request or enhancement ready labels Jun 2, 2016
@hadley hadley removed the ready label Feb 10, 2017
@jimhester jimhester added this to the backlog milestone Nov 19, 2018
@jvschoen

This comment has been minimized.

@batpigandme

This comment has been minimized.

@jvschoen

This comment has been minimized.

@zeehio

This comment has been minimized.

@jvschoen

This comment has been minimized.

@dhofstetter
Copy link

I'm currently faced with this issue as well.

Are there any plans to fix this issue? Really pitty not beeing able to use my beloved readr methods for the current project though :-(

BR Daniel

@espinielli
Copy link

If of help I could read UTF-16 encoded CSV with

read.delim("data-raw/swiss_2020-06-01.csv",
           sep = ";",
           stringsAsFactors = FALSE,
           fileEncoding = 'UTF-16') %>% 
  as_tibble()

readr::read_csv2 and data.table::fread did not work.

@anadiedrichs
Copy link

I had the same problem. I could read UTF-16LE encoded file with the read.csv function.

This dataset could be a great example, link: https://sisa.msal.gov.ar/datos/descargas/covid-19/files/Covid19Casos.csv

Source: http://datos.salud.gob.ar/dataset/covid-19-casos-registrados-en-la-republica-argentina

The dataset size is around 300 MB and contains information about COVID19 cases in Argentina.

Output of the file command in Linux.

$ file Covid19Casos.csv 

Covid19Casos.csv: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

Output of guess_encoding function

library(readr)

guess_encoding("Covid19Casos.csv")

# A tibble: 3 x 2
  encoding   confidence
             
1 UTF-16LE         1   
2 ISO-8859-1       0.68
3 ISO-8859-2       0.48

The function read.csv worked in this case.

dataset <- read.csv("Covid19Casos.csv", fileEncoding="UTF-16LE")

@jimhester jimhester removed this from the backlog milestone May 11, 2021
@jimhester
Copy link
Collaborator

Closed by tidyverse/vroom@7d54cda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement multibyte 🦋
Projects
None yet
Development

No branches or pull requests

10 participants