You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was encouraged to file this bug report by https://blog.rstudio.org/2016/08/05/readr-1-0-0/#comments. Basically, bug #263 is not fixed in readr 1.0.0, despite being closed as fixed. Files encoded with UTF-8 with BOM (e.g. CSV files saved by Excel) are not correctly read:
The BOM character is not being discarded.
It the column names are quoted, the quotes are not being removed from the first column.
The character encoding name UTF-8-BOM, introduced in R 3.0.0, is not recognised.
Here’s an an example (ZIP file containing an UTF-8-BOM-encoded CSV file). The file looks like this (with an embedded BOM):
"foo","bar","b☺z"
1,3,5
2,4,6
Reproducible example:
library(readr)
d = read_csv("utf8-bom.csv")
d
This is shown as
Source: local data frame [2 x 3]
"foo" bar b☺z
<int> <int> <int>
1 1 3 5
2 2 4 6
when run under an UTF-8 locale. On other locales (e.g. Windows), the name of the first column is instead shown as <U+FEFF>"foo" (though some other ways of showing the column hides the U+FEFF character reference). In any case, it contains an embedded BOM character, as witnessed by
I've run into this problem as well. Google Finance evidently uses BOM encoding. As an example, try to use read_csv on the "Download to spreadsheet" link at:
I presume that any other, Google Finance, link would behave the same way. Just as noted above, the base-R version does deal appropriately with this situation:
I was encouraged to file this bug report by https://blog.rstudio.org/2016/08/05/readr-1-0-0/#comments. Basically, bug #263 is not fixed in
readr
1.0.0, despite being closed as fixed. Files encoded with UTF-8 with BOM (e.g. CSV files saved by Excel) are not correctly read:UTF-8-BOM
, introduced in R 3.0.0, is not recognised.Here’s an an example (ZIP file containing an UTF-8-BOM-encoded CSV file). The file looks like this (with an embedded BOM):
Reproducible example:
This is shown as
when run under an UTF-8 locale. On other locales (e.g. Windows), the name of the first column is instead shown as
<U+FEFF>"foo"
(though some other ways of showing the column hides the U+FEFF character reference). In any case, it contains an embedded BOM character, as witnessed byunder both Windows and Linux.
Expected results: The BOM and the two quote characters should be discarded, so the above commands should result in:
Also, trying to explicitly use the UTF-8-BOM encoding results in an error message:
While the base
read.csv()
function works fine:(Here not specifying a
fileEncoding
argument also may or may not work, depending on which platform you’re on. It works on my UTF-8-based Linux box.)The text was updated successfully, but these errors were encountered: