Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accident_Index is not identical to Accident_Index #83

Merged
merged 1 commit into from
Jan 11, 2019
Merged

Conversation

layik
Copy link
Member

@layik layik commented Jan 11, 2019

identical function in R behaves differently on MacOS, Linux vs Windows for a particular case in stats19 strings that can be reproduced.

Links & notes:
https://www.r-project.org/bugs.html

crashes 2017 file first column name here results in
http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip

@codecov-io
Copy link

codecov-io commented Jan 11, 2019

Codecov Report

Merging #83 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #83   +/-   ##
=======================================
  Coverage   90.36%   90.36%           
=======================================
  Files           6        6           
  Lines         249      249           
=======================================
  Hits          225      225           
  Misses         24       24
Impacted Files Coverage Δ
R/read.R 98.64% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 276e5d1...c3d44cc. Read the comment docs.

@Robinlovelace Robinlovelace merged commit 585abc3 into master Jan 11, 2019
@Robinlovelace Robinlovelace deleted the r-bug branch January 11, 2019 11:14
@layik
Copy link
Member Author

layik commented Jan 14, 2019

devtools::install_github("ropensi/stats19")
#> Error: HTTP error 404.
#>   Not Found
#> 
#>   Rate limit remaining: 56/60
#>   Rate limit reset at: 2019-01-14 22:00:47 UTC
#> 
#> 
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
acc_2017 = stats19::file_names$dftRoadSafetyData_Accidents_2017.zip
dl_stats19(year = 2017, type = 'acc')
#> Files identified: dftRoadSafetyData_Accidents_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
#> Attempt downloading from:
#> Data saved at /var/folders/z7/l4z5fwqs2ksfv22ghh2n9smh0000gp/T//RtmpUJVRWT/dftRoadSafetyData_Accidents_2017/Acc.csv
r = read.csv(file.path(tempdir(), 
                       sub(".zip", "", acc_2017),
                       "Acc.csv"),
             nrows = 1)
n = names(r)[1]
nchar(n)
#> [1] 14

Created on 2019-01-14 by the reprex
package
(v0.2.0).

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14

@layik
Copy link
Member Author

layik commented Jan 14, 2019

devtools::install_github("ropensi/stats19")
library(stats19)
acc_2017 = stats19::file_names$dftRoadSafetyData_Accidents_2017.zip
dl_stats19(year = 2017, type = 'acc')
r = read.csv(file.path(tempdir(), 
                       sub(".zip", "", acc_2017),
                       "Acc.csv"),
             nrows = 1)
n = names(r)[1]
nchar(n)

@layik
Copy link
Member Author

layik commented Jan 14, 2019

reprex for windows below returns 17 characters instead of 14 from the MacOS version above.

@layik
Copy link
Member Author

layik commented Jan 14, 2019

The bash way of checking the string in that csv file could be as follows:

curl -o acc2017.zip http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
unzip acc2017.zip
head -n 1 acc.csv | file -i -
head -n 1 acc.csv | cut -c 1-15
Accident_Index
head -n 1 acc.csv | cut -c 1-16
Accident_Index,

So we are sure its 14 characters.

@Robinlovelace
Copy link
Member

devtools::install_github("ropensi/stats19")
#> Downloading GitHub repo ropensi/stats19@master
#> from URL https://api.github.com/repos/ropensi/stats19/zipball/master
#> Installation failed: Not Found (404)
    library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    acc_2017 = stats19::file_names$dftRoadSafetyData_Accidents_2017.zip
    dl_stats19(year = 2017, type = 'acc')
#> Files identified: dftRoadSafetyData_Accidents_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
#> Attempt downloading from:
#> Data saved at C:\Users\georl\AppData\Local\Temp\RtmpuuS6fK/dftRoadSafetyData_Accidents_2017/Acc.csv
    r = read.csv(file.path(tempdir(), 
                           sub(".zip", "", acc_2017),
                           "Acc.csv"),
                 nrows = 1)
    n = names(r)[1]
    nchar(n)
#> [1] 17

Created on 2019-01-15 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United Kingdom.1252 
#>  tz       Europe/London               
#>  date     2019-01-15
#> Packages -----------------------------------------------------------------
#>  package   * version date       source                           
#>  backports   1.1.2   2017-12-13 CRAN (R 3.5.0)                   
#>  base      * 3.5.0   2018-04-23 local                            
#>  compiler    3.5.0   2018-04-23 local                            
#>  curl        3.2     2018-03-28 CRAN (R 3.5.0)                   
#>  datasets  * 3.5.0   2018-04-23 local                            
#>  devtools    1.13.6  2018-06-27 CRAN (R 3.5.1)                   
#>  digest      0.6.15  2018-01-28 CRAN (R 3.5.0)                   
#>  evaluate    0.11    2018-07-17 CRAN (R 3.5.1)                   
#>  git2r       0.23.0  2018-07-17 CRAN (R 3.5.1)                   
#>  graphics  * 3.5.0   2018-04-23 local                            
#>  grDevices * 3.5.0   2018-04-23 local                            
#>  htmltools   0.3.6   2017-04-28 CRAN (R 3.5.0)                   
#>  httr        1.3.1   2017-08-20 CRAN (R 3.5.0)                   
#>  jsonlite    1.5     2017-06-01 CRAN (R 3.5.0)                   
#>  knitr       1.20    2018-02-20 CRAN (R 3.5.0)                   
#>  magrittr    1.5     2014-11-22 CRAN (R 3.5.0)                   
#>  memoise     1.1.0   2017-04-21 CRAN (R 3.5.0)                   
#>  methods   * 3.5.0   2018-04-23 local                            
#>  R6          2.2.2   2017-06-17 CRAN (R 3.5.0)                   
#>  Rcpp        0.12.18 2018-07-23 CRAN (R 3.5.1)                   
#>  rmarkdown   1.10    2018-06-11 CRAN (R 3.5.1)                   
#>  rprojroot   1.3-2   2018-01-03 CRAN (R 3.5.0)                   
#>  stats     * 3.5.0   2018-04-23 local                            
#>  stats19   * 0.1.1   2019-01-15 Github (ropensci/stats19@dc5da5e)
#>  stringi     1.1.7   2018-03-12 CRAN (R 3.5.0)                   
#>  stringr     1.3.1   2018-05-10 CRAN (R 3.5.1)                   
#>  tools       3.5.0   2018-04-23 local                            
#>  utils     * 3.5.0   2018-04-23 local                            
#>  withr       2.1.2   2018-03-15 CRAN (R 3.5.0)                   
#>  yaml        2.2.0   2018-07-25 CRAN (R 3.5.1)

@Robinlovelace
Copy link
Member

Heads-up @layik the above reprex shows the behaviour on Windows. Hope that is useful. May be worth asking the question here if you're finding consistently strange behaviour: https://stat.ethz.ch/mailman/listinfo/r-devel

@layik
Copy link
Member Author

layik commented Jan 17, 2019

From an email into r-help mailing list and contribution from Ivan Krylov:

The bash code can show that the column name actually has what is called Byte Order Mark, unicode value 0xEF,0xBB,0xBF which can be seen below:

$ head -n 1 Acc.csv | cut -c 1-15 | hexdump
0000000 ef bb bf 41 63 63 69 64 65 6e 74 5f 49 6e 64 65
0000010 78 0a                                          
0000012
$ echo Accident_Index |hexdump
0000000 41 63 63 69 64 65 6e 74 5f 49 6e 64 65 78 0a   
000000f

The question is then why does nchar or identical behaves differently on Windows? In other words does these base R functions read Byte Order Marks differently on Windows?

@layik
Copy link
Member Author

layik commented Jan 17, 2019

> Why would `identical(str, "Accident_Index", ignore.case = TRUE)`
> behave differently on Linux/MacOS vs Windows?

Because str is different from "Accident_Index" on Windows: it was
decoded from bytes to characters according to different rules when file
was read.

Default encoding for files being read is specified by 'encoding'
options. On both Windows and Linux I get:

> options('encoding')
$encoding
[1] "native.enc"

For which ?file says (in section "Encoding"):

>> ‘""’ and ‘"native.enc"’ both mean the ‘native’ encoding, that is the
>> internal encoding of the current locale and hence no translation is
>> done.

Linux version of R has a UTF-8 locale (AFAIK, macOS does too) and
decodes the files as UTF-8 by default:

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

locale:
 [1] LC_CTYPE=ru_RU.utf8       LC_NUMERIC=C             
 [3] LC_TIME=ru_RU.utf8        LC_COLLATE=ru_RU.utf8    
 [5] LC_MONETARY=ru_RU.utf8    LC_MESSAGES=ru_RU.utf8   
 [7] LC_PAPER=ru_RU.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=ru_RU.utf8 LC_IDENTIFICATION=C    

While on Windows R uses a single-byte encoding dependent on the locale:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251   
[3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C                   
[5] LC_TIME=Russian_Russia.1251    

> readLines('test.txt')[1]
[1] "п»їAccident_Index"
> nchar(readLines('test.txt')[1])
[1] 17

R on Windows can be explicitly told to decode the file as UTF-8:

> nchar(readLines(file('test.txt',encoding='UTF-8'))[1])
[1] 15

The first character of the string is the invisible byte order mark.
Thankfully, there is an easy fix for that, too. ?file additionally
says:

>> As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted for
>> reading and will remove a Byte Order Mark if present (which it
>> often is for files and webpages generated by Microsoft applications).

So this is how we get the 14-character column name we'd wanted:

> nchar(readLines(file('test.txt',encoding='UTF-8-BOM'))[1])
[1] 14

For our original task, this means:

> names(read.csv('Acc.csv'))[1] # might produce incorrect results
[1] "п.їAccident_Index"
> names(read.csv('Acc.csv', fileEncoding='UTF-8-BOM'))[1] # correct
[1] "Accident_Index"

-- 
Best regards,
Ivan

@Robinlovelace
Copy link
Member

Link to conversation: http://r.789695.n4.nabble.com/Potential-R-bug-in-identical-td4754898.html

Interesting stuff. I imagine there's a reason why , fileEncoding='UTF-8-BOM' isn't the default on Windows.

@mpadge
Copy link
Member

mpadge commented Jan 18, 2019

Is this an endian issue? sf has lots of interesting code for dealing with different endians on different OS's. Or is it just UTF?

@Robinlovelace
Copy link
Member

One for the R-help list, or maybe rOpenSci/RStudio discourse. Or maybe even stackoverflow!

@layik
Copy link
Member Author

layik commented Jun 11, 2019

#96

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants