Accident_Index is not identical to Accident_Index #83

layik · 2019-01-11T11:01:09Z

identical function in R behaves differently on MacOS, Linux vs Windows for a particular case in stats19 strings that can be reproduced.

Links & notes:
https://www.r-project.org/bugs.html

crashes 2017 file first column name here results in
http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip

codecov-io · 2019-01-11T11:13:11Z

Codecov Report

Merging #83 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master      #83   +/-   ##
=======================================
  Coverage   90.36%   90.36%           
=======================================
  Files           6        6           
  Lines         249      249           
=======================================
  Hits          225      225           
  Misses         24       24

Impacted Files	Coverage Δ
R/read.R	`98.64% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 276e5d1...c3d44cc. Read the comment docs.

layik · 2019-01-14T21:05:35Z

devtools::install_github("ropensi/stats19")
#> Error: HTTP error 404.
#>   Not Found
#> 
#>   Rate limit remaining: 56/60
#>   Rate limit reset at: 2019-01-14 22:00:47 UTC
#> 
#> 
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
acc_2017 = stats19::file_names$dftRoadSafetyData_Accidents_2017.zip
dl_stats19(year = 2017, type = 'acc')
#> Files identified: dftRoadSafetyData_Accidents_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
#> Attempt downloading from:
#> Data saved at /var/folders/z7/l4z5fwqs2ksfv22ghh2n9smh0000gp/T//RtmpUJVRWT/dftRoadSafetyData_Accidents_2017/Acc.csv
r = read.csv(file.path(tempdir(), 
                       sub(".zip", "", acc_2017),
                       "Acc.csv"),
             nrows = 1)
n = names(r)[1]
nchar(n)
#> [1] 14

Created on 2019-01-14 by the reprex
package (v0.2.0).

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14

layik · 2019-01-14T21:06:11Z

devtools::install_github("ropensi/stats19")
library(stats19)
acc_2017 = stats19::file_names$dftRoadSafetyData_Accidents_2017.zip
dl_stats19(year = 2017, type = 'acc')
r = read.csv(file.path(tempdir(), 
                       sub(".zip", "", acc_2017),
                       "Acc.csv"),
             nrows = 1)
n = names(r)[1]
nchar(n)

layik · 2019-01-14T21:18:16Z

reprex for windows below returns 17 characters instead of 14 from the MacOS version above.

layik · 2019-01-14T21:20:45Z

The bash way of checking the string in that csv file could be as follows:

curl -o acc2017.zip http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
unzip acc2017.zip
head -n 1 acc.csv | file -i -
head -n 1 acc.csv | cut -c 1-15
Accident_Index
head -n 1 acc.csv | cut -c 1-16
Accident_Index,

So we are sure its 14 characters.

Robinlovelace · 2019-01-15T10:39:39Z

devtools::install_github("ropensi/stats19")
#> Downloading GitHub repo ropensi/stats19@master
#> from URL https://api.github.com/repos/ropensi/stats19/zipball/master
#> Installation failed: Not Found (404)
    library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    acc_2017 = stats19::file_names$dftRoadSafetyData_Accidents_2017.zip
    dl_stats19(year = 2017, type = 'acc')
#> Files identified: dftRoadSafetyData_Accidents_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
#> Attempt downloading from:
#> Data saved at C:\Users\georl\AppData\Local\Temp\RtmpuuS6fK/dftRoadSafetyData_Accidents_2017/Acc.csv
    r = read.csv(file.path(tempdir(), 
                           sub(".zip", "", acc_2017),
                           "Acc.csv"),
                 nrows = 1)
    n = names(r)[1]
    nchar(n)
#> [1] 17

Created on 2019-01-15 by the reprex package (v0.2.0).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United Kingdom.1252 
#>  tz       Europe/London               
#>  date     2019-01-15
#> Packages -----------------------------------------------------------------
#>  package   * version date       source                           
#>  backports   1.1.2   2017-12-13 CRAN (R 3.5.0)                   
#>  base      * 3.5.0   2018-04-23 local                            
#>  compiler    3.5.0   2018-04-23 local                            
#>  curl        3.2     2018-03-28 CRAN (R 3.5.0)                   
#>  datasets  * 3.5.0   2018-04-23 local                            
#>  devtools    1.13.6  2018-06-27 CRAN (R 3.5.1)                   
#>  digest      0.6.15  2018-01-28 CRAN (R 3.5.0)                   
#>  evaluate    0.11    2018-07-17 CRAN (R 3.5.1)                   
#>  git2r       0.23.0  2018-07-17 CRAN (R 3.5.1)                   
#>  graphics  * 3.5.0   2018-04-23 local                            
#>  grDevices * 3.5.0   2018-04-23 local                            
#>  htmltools   0.3.6   2017-04-28 CRAN (R 3.5.0)                   
#>  httr        1.3.1   2017-08-20 CRAN (R 3.5.0)                   
#>  jsonlite    1.5     2017-06-01 CRAN (R 3.5.0)                   
#>  knitr       1.20    2018-02-20 CRAN (R 3.5.0)                   
#>  magrittr    1.5     2014-11-22 CRAN (R 3.5.0)                   
#>  memoise     1.1.0   2017-04-21 CRAN (R 3.5.0)                   
#>  methods   * 3.5.0   2018-04-23 local                            
#>  R6          2.2.2   2017-06-17 CRAN (R 3.5.0)                   
#>  Rcpp        0.12.18 2018-07-23 CRAN (R 3.5.1)                   
#>  rmarkdown   1.10    2018-06-11 CRAN (R 3.5.1)                   
#>  rprojroot   1.3-2   2018-01-03 CRAN (R 3.5.0)                   
#>  stats     * 3.5.0   2018-04-23 local                            
#>  stats19   * 0.1.1   2019-01-15 Github (ropensci/stats19@dc5da5e)
#>  stringi     1.1.7   2018-03-12 CRAN (R 3.5.0)                   
#>  stringr     1.3.1   2018-05-10 CRAN (R 3.5.1)                   
#>  tools       3.5.0   2018-04-23 local                            
#>  utils     * 3.5.0   2018-04-23 local                            
#>  withr       2.1.2   2018-03-15 CRAN (R 3.5.0)                   
#>  yaml        2.2.0   2018-07-25 CRAN (R 3.5.1)

Robinlovelace · 2019-01-15T10:41:02Z

Heads-up @layik the above reprex shows the behaviour on Windows. Hope that is useful. May be worth asking the question here if you're finding consistently strange behaviour: https://stat.ethz.ch/mailman/listinfo/r-devel

layik · 2019-01-17T21:19:49Z

From an email into r-help mailing list and contribution from Ivan Krylov:

The bash code can show that the column name actually has what is called Byte Order Mark, unicode value 0xEF,0xBB,0xBF which can be seen below:

$ head -n 1 Acc.csv | cut -c 1-15 | hexdump
0000000 ef bb bf 41 63 63 69 64 65 6e 74 5f 49 6e 64 65
0000010 78 0a                                          
0000012
$ echo Accident_Index |hexdump
0000000 41 63 63 69 64 65 6e 74 5f 49 6e 64 65 78 0a   
000000f

The question is then why does nchar or identical behaves differently on Windows? In other words does these base R functions read Byte Order Marks differently on Windows?

layik · 2019-01-17T22:07:50Z

> Why would `identical(str, "Accident_Index", ignore.case = TRUE)`
> behave differently on Linux/MacOS vs Windows?

Because str is different from "Accident_Index" on Windows: it was
decoded from bytes to characters according to different rules when file
was read.

Default encoding for files being read is specified by 'encoding'
options. On both Windows and Linux I get:

> options('encoding')
$encoding
[1] "native.enc"

For which ?file says (in section "Encoding"):

>> ‘""’ and ‘"native.enc"’ both mean the ‘native’ encoding, that is the
>> internal encoding of the current locale and hence no translation is
>> done.

Linux version of R has a UTF-8 locale (AFAIK, macOS does too) and
decodes the files as UTF-8 by default:

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

locale:
 [1] LC_CTYPE=ru_RU.utf8       LC_NUMERIC=C             
 [3] LC_TIME=ru_RU.utf8        LC_COLLATE=ru_RU.utf8    
 [5] LC_MONETARY=ru_RU.utf8    LC_MESSAGES=ru_RU.utf8   
 [7] LC_PAPER=ru_RU.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=ru_RU.utf8 LC_IDENTIFICATION=C    

While on Windows R uses a single-byte encoding dependent on the locale:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251   
[3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C                   
[5] LC_TIME=Russian_Russia.1251    

> readLines('test.txt')[1]
[1] "п»їAccident_Index"
> nchar(readLines('test.txt')[1])
[1] 17

R on Windows can be explicitly told to decode the file as UTF-8:

> nchar(readLines(file('test.txt',encoding='UTF-8'))[1])
[1] 15

The first character of the string is the invisible byte order mark.
Thankfully, there is an easy fix for that, too. ?file additionally
says:

>> As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted for
>> reading and will remove a Byte Order Mark if present (which it
>> often is for files and webpages generated by Microsoft applications).

So this is how we get the 14-character column name we'd wanted:

> nchar(readLines(file('test.txt',encoding='UTF-8-BOM'))[1])
[1] 14

For our original task, this means:

> names(read.csv('Acc.csv'))[1] # might produce incorrect results
[1] "п.їAccident_Index"
> names(read.csv('Acc.csv', fileEncoding='UTF-8-BOM'))[1] # correct
[1] "Accident_Index"

-- 
Best regards,
Ivan

Robinlovelace · 2019-01-17T22:50:57Z

Link to conversation: http://r.789695.n4.nabble.com/Potential-R-bug-in-identical-td4754898.html

Interesting stuff. I imagine there's a reason why , fileEncoding='UTF-8-BOM' isn't the default on Windows.

mpadge · 2019-01-18T07:52:35Z

Is this an endian issue? sf has lots of interesting code for dealing with different endians on different OS's. Or is it just UTF?

Robinlovelace · 2019-01-18T09:17:13Z

One for the R-help list, or maybe rOpenSci/RStudio discourse. Or maybe even stackoverflow!

layik · 2019-06-11T13:20:29Z

#96

Accident_Index is not identical to Accident_Index

c3d44cc

layik requested a review from Robinlovelace January 11, 2019 11:01

Robinlovelace approved these changes Jan 11, 2019

View reviewed changes

Robinlovelace merged commit 585abc3 into master Jan 11, 2019

Robinlovelace deleted the r-bug branch January 11, 2019 11:14

Robinlovelace mentioned this pull request Jul 27, 2019

Casualities - Acc_Index error for non English locale #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accident_Index is not identical to Accident_Index #83

Accident_Index is not identical to Accident_Index #83

layik commented Jan 11, 2019

codecov-io commented Jan 11, 2019 •

edited

Loading

layik commented Jan 14, 2019

layik commented Jan 14, 2019 •

edited

Loading

layik commented Jan 14, 2019 •

edited

Loading

layik commented Jan 14, 2019

Robinlovelace commented Jan 15, 2019

Robinlovelace commented Jan 15, 2019

layik commented Jan 17, 2019 •

edited

Loading

layik commented Jan 17, 2019

Robinlovelace commented Jan 17, 2019

mpadge commented Jan 18, 2019

Robinlovelace commented Jan 18, 2019

layik commented Jun 11, 2019

Accident_Index is not identical to Accident_Index #83

Accident_Index is not identical to Accident_Index #83

Conversation

layik commented Jan 11, 2019

codecov-io commented Jan 11, 2019 • edited Loading

Codecov Report

layik commented Jan 14, 2019

layik commented Jan 14, 2019 • edited Loading

layik commented Jan 14, 2019 • edited Loading

layik commented Jan 14, 2019

Robinlovelace commented Jan 15, 2019

Robinlovelace commented Jan 15, 2019

layik commented Jan 17, 2019 • edited Loading

layik commented Jan 17, 2019

Robinlovelace commented Jan 17, 2019

mpadge commented Jan 18, 2019

Robinlovelace commented Jan 18, 2019

layik commented Jun 11, 2019

codecov-io commented Jan 11, 2019 •

edited

Loading

layik commented Jan 14, 2019 •

edited

Loading

layik commented Jan 14, 2019 •

edited

Loading

layik commented Jan 17, 2019 •

edited

Loading