Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files encoded with UTF-8 BOM are (still) not supported #500

Closed
huftis opened this issue Aug 8, 2016 · 3 comments
Closed

Files encoded with UTF-8 BOM are (still) not supported #500

huftis opened this issue Aug 8, 2016 · 3 comments

Comments

@huftis
Copy link

huftis commented Aug 8, 2016

I was encouraged to file this bug report by https://blog.rstudio.org/2016/08/05/readr-1-0-0/#comments. Basically, bug #263 is not fixed in readr 1.0.0, despite being closed as fixed. Files encoded with UTF-8 with BOM (e.g. CSV files saved by Excel) are not correctly read:

  1. The BOM character is not being discarded.
  2. It the column names are quoted, the quotes are not being removed from the first column.
  3. The character encoding name UTF-8-BOM, introduced in R 3.0.0, is not recognised.

Here’s an an example (ZIP file containing an UTF-8-BOM-encoded CSV file). The file looks like this (with an embedded BOM):

"foo","bar","b☺z"
1,3,5
2,4,6

Reproducible example:

library(readr)
d = read_csv("utf8-bom.csv")
d

This is shown as

Source: local data frame [2 x 3]

  "foo"   bar   b☺z
  <int> <int> <int>
1     1     3     5
2     2     4     6

when run under an UTF-8 locale. On other locales (e.g. Windows), the name of the first column is instead shown as <U+FEFF>"foo" (though some other ways of showing the column hides the U+FEFF character reference). In any case, it contains an embedded BOM character, as witnessed by

> charToRaw(names(d)[1])
[1] ef bb bf 22 66 6f 6f 22

> nchar(names(d))
[1] 6 3 3

under both Windows and Linux.

Expected results: The BOM and the two quote characters should be discarded, so the above commands should result in:

> charToRaw(names(d)[1])
[1] 66 6f 6f

> nchar(names(d))
[1] 3 3 3

Also, trying to explicitly use the UTF-8-BOM encoding results in an error message:

> d = read_csv("utf8-bom.csv", locale = locale(encoding = "UTF-8-BOM"))
Error: Unknown encoding UTF-8-BOM

While the base read.csv() function works fine:

> d = read.csv("utf8-bom.csv", fileEncoding="UTF-8-BOM")
> nchar(names(d))
[1] 3 3 3

(Here not specifying a fileEncoding argument also may or may not work, depending on which platform you’re on. It works on my UTF-8-based Linux box.)

@DavisDaddy
Copy link

I've run into this problem as well. Google Finance evidently uses BOM encoding. As an example, try to use read_csv on the "Download to spreadsheet" link at:

https://www.google.com/finance/historical?cid=13772865&startdate=Dec+1%2C+2011&enddate=Jan+31%2C+2012&num=30&ei=8sgWWIDZDKiLiwKhpJbwCg

I presume that any other, Google Finance, link would behave the same way. Just as noted above, the base-R version does deal appropriately with this situation:

URLg <- <the URL above>
read.csv(URLg,
    stringsAsFactors=FALSE,
    encoding="UTF-8-BOM")

@testlnord
Copy link

It seems like they added skipBom(...) function to Source.h file. But it's never called in the master branch.
Lines:

    // Skip byte order mark, if needed
    begin_ = skipBom(begin_, end_);

exists in SourceFile.h, SourceRaw.h, SourceString.h files in original PR by @jimhester. Seems like this changes were reverted.

@jimhester
Copy link
Collaborator

Sorry as noted the change was inadvertently reverted, should now be available with devel readr as of (70b2a3c)

devtools::install_github("tidyverse/readr")

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants