Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

origin in guess_dates #73

Closed
scottyaz opened this issue May 14, 2019 · 24 comments
Closed

origin in guess_dates #73

scottyaz opened this issue May 14, 2019 · 24 comments
Assignees
Labels
bug Something isn't working

Comments

@scottyaz
Copy link

It seems like the origin for dates from Excel is dependent on the OS with origin in windows being "1899-12-30" rather than "1900-01-01". Current version of guess_dates() assumes "1900-01-01" if modern excel flag is on.

@zkamvar
Copy link
Member

zkamvar commented May 14, 2019

Hi @scottyaz,

Thank you for the info! I completely forgot about that really frustrating curse, which is absolutely NOT documented in their help page on the issue: https://support.office.com/en-us/article/date-systems-in-excel-e7fe7167-48a9-4b96-bb53-5612a800b487

Can you give the output of R.version, so that I can see what version of windows you use?

R.version
#>                _                           
#> platform       x86_64-pc-linux-gnu         
#> arch           x86_64                      
#> os             linux-gnu                   
#> system         x86_64, linux-gnu           
#> status                                     
#> major          3                           
#> minor          6.0                         
#> year           2019                        
#> month          04                          
#> day            26                          
#> svn rev        76424                       
#> language       R                           
#> version.string R version 3.6.0 (2019-04-26)
#> nickname       Planting of a Tree

Created on 2019-05-14 by the reprex package (v0.2.1)

@zkamvar zkamvar self-assigned this May 14, 2019
@zkamvar zkamvar added the bug Something isn't working label May 14, 2019
@scottyaz
Copy link
Author

scottyaz commented May 14, 2019 via email

@scottyaz
Copy link
Author

I wonder if there is useful meta-data in an excel file (recognising that you expect a data.frame or tidy variant, not an excel file) that can be used to figure out the OS.

xmlInternalTreeParse(unzip('my_linelist.xlsx','xl/workbook.xml')) 

## or
xmlInternalTreeParse(unzip('my_linelist.xlsx','docProps/app.xml')) 

Doubt you want to go here but...

@zkamvar
Copy link
Member

zkamvar commented May 15, 2019

Hi @scottyaz, I definitely do not want to go down the route of attempting to peer into excel files themselves. I've proposed a fix (#74) by changing the modern_excel flag to excel and having three options:

(modern <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1900-01-01"))))
#> [1] "43598"
(oldmac <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1904-01-01"))))
#> [1] "42138"
(oldwin <- as.character(as.integer(as.Date("2019-05-15") - as.Date("1899-12-30"))))
#> [1] "43600"
guess_dates(modern)
#> [1] "2019-05-15"
guess_dates(oldmac, excel = "old-mac")
#> [1] "2019-05-15"
guess_dates(oldwin, excel = "old-win")
#> [1] "2019-05-15"

Created on 2019-05-15 by the reprex package (v0.2.1)

Ultimately, it will be up to the user to know which version of excel their data was created with and include some positive checks to make sure things are working smoothly (especially since these dates can be off by only a few days).

@zkamvar
Copy link
Member

zkamvar commented May 15, 2019

Let me know if you think this will work, and I'll merge it.

@scottyaz
Copy link
Author

That is totally reasonable!

@zkamvar
Copy link
Member

zkamvar commented May 15, 2019

Huh.... I think actually there are only two dates and 1900-01-01 ain't one of them 🤔

Both readxl and janitor use 1899-12-30: https://github.com/tidyverse/readxl/blob/15fde898887a8941bb3839892bf30fa9866290ee/src/utils.h#L7-L40, which has to do with the leap year bug, but it's absolutely not clear to me why it need 12-30 and not 12-31 😩 I'm going to change it back to modern_excel and just change the date to 1899-12-30

@scottyaz
Copy link
Author

documentation still says 1900-01-01...probably want to update.

@scottyaz
Copy link
Author

Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:

old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30"))))
[1] "43605"
guess_dates(oldwin,modern_excel = TRUE) # obviously doesn't give correct answer
[1] "2019-05-22"
guess_dates(old,modern_excel = FALSE)
[1] "43605"
Warning message:
In guess_dates(old, modern_excel = FALSE) : 
The following 1 dates were not in the correct timeframe (1969-05-24 -- 2019-05-24):

  original  |  parsed    
  --------  |  ------    
  43605     |  2023-05-21

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:

This is because it's a contrived example. As I explained in #73 (comment), both readxl and janitor use this system and they both have EXTENSIVE tests. It has something to do with the leap year bug that was introduced in the 90's.

I think I have an excel file lying around somewhere that needed some of this cleaning.... let me check

@scottyaz
Copy link
Author

I was trying this on real data and couldn't get the correct date and this is why I posted...See what you come up with.

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

I hate everything

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

Would you mind trying the janitor function and see if it gives you the same shitty result?

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

Also, check to make sure you have the latest version of linelist

@scottyaz
Copy link
Author

scottyaz commented May 24, 2019 via email

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

For clarity, here is how guess_dates works on a wild excel sheet:

x <- "https://osf.io/2yfre/download"
tmp <- tempdir()
f <- file.path(tmp, "test.xlsx")
download.file(x, f)
ex <- readxl::read_excel(f)
ex
#> # A tibble: 95 x 10
#>    MCG   `AP-GenoID` InventoryID `in-JRS-collect… `AP-Continent_C…
#>    <chr> <chr>       <chr>       <chr>            <chr>           
#>  1 A     143         143         TRUE             North America_U…
#>  2 K     202         202         TRUE             North America_U…
#>  3 K     264         264         TRUE             North America_U…
#>  4 H     265         265         TRUE             North America_U…
#>  5 A     266         266         TRUE             North America_U…
#>  6 I     267         267         TRUE             North America_U…
#>  7 J     268         268         TRUE             North America_U…
#>  8 A     276         276         TRUE             North America_U…
#>  9 K     289         289         TRUE             North America_U…
#> 10 C     293A        293         FALSE            South America_A…
#> # … with 85 more rows, and 5 more variables: `JRS-Isolate #` <chr>,
#> #   `JRS-Collection Date` <chr>, `JRS-Source (Host)` <chr>,
#> #   `JRS-Geographical Location` <chr>, `JRS-Notes` <chr>
ex[["JRS-Collection Date"]]
#>  [1] "1977"     "33117"    "34516"    "34547"    "34547"    "34578"   
#>  [7] "34578"    "35309"    "1996"     "1996"     "1996"     "1996"    
#> [13] "1996"     "1996"     "36373"    "36373"    "41175"    "41175"   
#> [19] "8/1/2009" "8/1/2009" "2012"     "2014"     "2014"     "2014"    
#> [25] "2014"     "2014"     "2012"     "2012"     "2012"     "2012"    
#> [31] "2012"     "2012"     "2012"     "2012"     "2012"     "2010"    
#> [37] "2010"     "2010"     "2010"     "2010"     "2010"     "2010"    
#> [43] "2010"     "2010"     "2010"     "2009"     "2009"     "2009"    
#> [49] "2009"     "2009"     "2009"     "2009"     "2010"     "2010"    
#> [55] "2010"     "2010"     "2010"     "2010"     "NA"       "2012"    
#> [61] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [67] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [73] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [79] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [85] "2012"     "2012"     "2012"     "2012"     "2012"     "2012"    
#> [91] "2012"     "2012"     "2012"     "2012"     "2012"
linelist::guess_dates(ex[["JRS-Collection Date"]], err = 1, orders = list("Y", "mdy"))
#>  [1] "1977-01-01" "1990-09-01" "1994-07-01" "1994-08-01" "1994-08-01"
#>  [6] "1994-09-01" "1994-09-01" "1996-09-01" "1996-01-01" "1996-01-01"
#> [11] "1996-01-01" "1996-01-01" "1996-01-01" "1996-01-01" "1999-08-01"
#> [16] "1999-08-01" "2012-09-23" "2012-09-23" "2009-08-01" "2009-08-01"
#> [21] "2012-01-01" "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01"
#> [26] "2014-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [31] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [36] "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [41] "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [46] "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01"
#> [51] "2009-01-01" "2009-01-01" "2010-01-01" "2010-01-01" "2010-01-01"
#> [56] "2010-01-01" "2010-01-01" "2010-01-01" NA           "2012-01-01"
#> [61] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [66] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [71] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [76] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [81] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [86] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"
#> [91] "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01"

Created on 2019-05-24 by the reprex package (v0.3.0)

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

If you compare this to the original sheet, you'll see that the dates have been rendered correctly: https://osf.io/2yfre/

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

Also, unless I am missing something, still doesn't work properly when the 1899 origin is being used:

old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30"))))
[1] "43605"
guess_dates(oldwin,modern_excel = TRUE) # obviously doesn't give correct answer
[1] "2019-05-22"
guess_dates(old,modern_excel = FALSE)
[1] "43605"
Warning message:
In guess_dates(old, modern_excel = FALSE) : 
The following 1 dates were not in the correct timeframe (1969-05-24 -- 2019-05-24):

  original  |  parsed    
  --------  |  ------    
  43605     |  2023-05-21

I just noticed, the first iteration you show uses "oldwin" and not "old" as the input. When I try this, I get the following:

old <- as.character(as.integer(as.Date("2019-05-20") - as.Date("1899-12-30")))
linelist::guess_dates(old, modern_excel = TRUE)
#> [1] "2019-05-20"

Created on 2019-05-24 by the reprex package (v0.3.0)

It's an understandable mistake, but please use reprex in the future when making examples.

@scottyaz
Copy link
Author

scottyaz commented May 24, 2019 via email

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

Can you post the package description?

packageDescription("linelist")
#> Package: linelist
#> Title: Tools to Import and Tidy Case Linelist Data
#> Version: 0.0.33.9000
#> Authors@R: c(person("Thibaut", "Jombart", email =
#>        "thibautjombart@gmail.com", role = c("aut", "cre")),
#>        person("Zhian N.", "Kamvar", email = "zkamvar@gmail.com",
#>        role = c("aut")))
#> Description: A collection of wrappers for importing case linelist
#>        data from usual formats, and tools for cleaning data,
#>        detecting dates, and storing meta-information on the
#>        content of the resulting data frame.
#> Depends: R (>= 3.0.0)
#> License: MIT + file LICENSE
#> Encoding: UTF-8
#> LazyData: true
#> Suggests: outbreaks, incidence, testthat, knitr, roxygen2, covr,
#>        tibble, dplyr, magrittr
#> RoxygenNote: 6.1.1
#> Imports: lubridate, rio, epitrix, tidyselect, rlang, forcats,
#>        utils, stats, crayon
#> URL: https://github.com/reconhub/linelist
#> BugReports: https://github.com/reconhub/linelist/issues
#> VignetteBuilder: knitr
#> Roxygen: list(markdown = TRUE)
#> RemoteType: github
#> RemoteHost: api.github.com
#> RemoteRepo: linelist
#> RemoteUsername: reconhub
#> RemoteRef: master
#> RemoteSha: fda9e18b02f5853cd311ddcc513c427244b21dd7
#> GithubRepo: linelist
#> GithubUsername: reconhub
#> GithubRef: master
#> GithubSHA1: fda9e18b02f5853cd311ddcc513c427244b21dd7
#> NeedsCompilation: no
#> Packaged: 2019-05-24 13:12:12 UTC; zkamvar
#> Author: Thibaut Jombart [aut, cre], Zhian N. Kamvar [aut]
#> Maintainer: Thibaut Jombart <thibautjombart@gmail.com>
#> Built: R 3.6.0; ; 2019-05-24 13:12:13 UTC; unix
#> 
#> -- File: /home/zkamvar/Documents/RLibrary/linelist/Meta/package.rds

Created on 2019-05-24 by the reprex package (v0.3.0)

@scottyaz
Copy link
Author

scottyaz commented May 24, 2019 via email

@zkamvar
Copy link
Member

zkamvar commented May 24, 2019

The only think I can think of why there would be a discrepancy is that you may have an unclean R session. Can you restart R and try again?

@scottyaz
Copy link
Author

arggggg. i'll shut up now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants