Files encoded with UTF-8 BOM are (still) not supported #500

huftis · 2016-08-08T18:58:31Z

I was encouraged to file this bug report by https://blog.rstudio.org/2016/08/05/readr-1-0-0/#comments. Basically, bug #263 is not fixed in readr 1.0.0, despite being closed as fixed. Files encoded with UTF-8 with BOM (e.g. CSV files saved by Excel) are not correctly read:

The BOM character is not being discarded.
It the column names are quoted, the quotes are not being removed from the first column.
The character encoding name UTF-8-BOM, introduced in R 3.0.0, is not recognised.

Here’s an an example (ZIP file containing an UTF-8-BOM-encoded CSV file). The file looks like this (with an embedded BOM):

"foo","bar","b☺z"
1,3,5
2,4,6

Reproducible example:

library(readr)
d = read_csv("utf8-bom.csv")
d

This is shown as

Source: local data frame [2 x 3]

  "foo"   bar   b☺z
  <int> <int> <int>
1     1     3     5
2     2     4     6

when run under an UTF-8 locale. On other locales (e.g. Windows), the name of the first column is instead shown as <U+FEFF>"foo" (though some other ways of showing the column hides the U+FEFF character reference). In any case, it contains an embedded BOM character, as witnessed by

> charToRaw(names(d)[1])
[1] ef bb bf 22 66 6f 6f 22

> nchar(names(d))
[1] 6 3 3

under both Windows and Linux.

Expected results: The BOM and the two quote characters should be discarded, so the above commands should result in:

> charToRaw(names(d)[1])
[1] 66 6f 6f

> nchar(names(d))
[1] 3 3 3

Also, trying to explicitly use the UTF-8-BOM encoding results in an error message:

> d = read_csv("utf8-bom.csv", locale = locale(encoding = "UTF-8-BOM"))
Error: Unknown encoding UTF-8-BOM

While the base read.csv() function works fine:

> d = read.csv("utf8-bom.csv", fileEncoding="UTF-8-BOM")
> nchar(names(d))
[1] 3 3 3

(Here not specifying a fileEncoding argument also may or may not work, depending on which platform you’re on. It works on my UTF-8-based Linux box.)

The text was updated successfully, but these errors were encountered:

DavisDaddy · 2016-11-04T01:00:24Z

I've run into this problem as well. Google Finance evidently uses BOM encoding. As an example, try to use read_csv on the "Download to spreadsheet" link at:

https://www.google.com/finance/historical?cid=13772865&startdate=Dec+1%2C+2011&enddate=Jan+31%2C+2012&num=30&ei=8sgWWIDZDKiLiwKhpJbwCg

I presume that any other, Google Finance, link would behave the same way. Just as noted above, the base-R version does deal appropriately with this situation:

URLg <- <the URL above>
read.csv(URLg,
    stringsAsFactors=FALSE,
    encoding="UTF-8-BOM")

testlnord · 2016-11-28T11:34:18Z

It seems like they added skipBom(...) function to Source.h file. But it's never called in the master branch.
Lines:

    // Skip byte order mark, if needed
    begin_ = skipBom(begin_, end_);

exists in SourceFile.h, SourceRaw.h, SourceString.h files in original PR by @jimhester. Seems like this changes were reverted.

jimhester · 2016-11-28T16:51:49Z

Sorry as noted the change was inadvertently reverted, should now be available with devel readr as of (70b2a3c)

devtools::install_github("tidyverse/readr")

jimhester closed this as completed Nov 28, 2016

lock bot locked and limited conversation to collaborators Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files encoded with UTF-8 BOM are (still) not supported #500

Files encoded with UTF-8 BOM are (still) not supported #500

huftis commented Aug 8, 2016 •

edited

DavisDaddy commented Nov 4, 2016

testlnord commented Nov 28, 2016

jimhester commented Nov 28, 2016

Files encoded with UTF-8 BOM are (still) not supported #500

Files encoded with UTF-8 BOM are (still) not supported #500

Comments

huftis commented Aug 8, 2016 • edited

DavisDaddy commented Nov 4, 2016

testlnord commented Nov 28, 2016

jimhester commented Nov 28, 2016

huftis commented Aug 8, 2016 •

edited