New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: read_excel fails to read dates #134

Closed
MichaelChirico opened this Issue Oct 10, 2015 · 17 comments

Comments

Projects
None yet
8 participants
@MichaelChirico

MichaelChirico commented Oct 10, 2015

read_excel has failed to read dates on a simple excel file.

Here is a Google Drive link to a reproducible example.

Here is the code I run on that file:

read_excel("test.xlsx", sheet = 1, col_names = TRUE,
           col_types = c("text", "date", "date", "date"), na = "-")

With warnings:

Warning messages:
1: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, ...:
[2, 2]: expecting date: got '42369'

And so on. Here is the output data.frame:

   ID EARLIEST YEAR OWED LATEST YEAR OWED EARLIEST PAYMENT
1   A               <NA>             <NA>             <NA>
2   B               <NA>             <NA>             <NA>
3   C               <NA>             <NA>             <NA>
4   D               <NA>             <NA>             <NA>
5   E               <NA>             <NA>             <NA>
6   F               <NA>             <NA>             <NA>
7   G               <NA>             <NA>             <NA>
8   H               <NA>             <NA>             <NA>
9   I               <NA>             <NA>             <NA>
10  J               <NA>             <NA>             <NA>

(i.e., read_excel has failed miserably).

xlsx::read.xlsx2 works perfectly fine:

library(xlsx)
read.xlsx2("~/Desktop/test.xlsx",sheetIndex=1,
           colClasses=c("character","Date","Date","Date"))

   ID EARLIEST.YEAR.OWED LATEST.YEAR.OWED EARLIEST.PAYMENT
1   A         2015-12-31       2015-12-31       2015-07-15
2   B         2015-12-31       2015-12-31             <NA>
3   C         2015-12-31       2015-12-31             <NA>
4   D         2015-12-31       2015-12-31             <NA>
5   E         2015-12-31       2015-12-31             <NA>
6   F         2015-12-31       2015-12-31             <NA>
7   G         2015-12-31       2015-12-31             <NA>
8   H         2015-12-31       2015-12-31             <NA>
9   I         2015-12-31       2015-12-31             <NA>
10  J               <NA>             <NA>       2015-06-25

Here is my session info in case it's platform-specific:

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readxl_0.1.0

loaded via a namespace (and not attached):
[1] tools_3.2.2 Rcpp_0.12.1

@MichaelChirico MichaelChirico changed the title from Bug: error reading dates to Bug: read_excel fails to read dates Oct 10, 2015

@shrektan

This comment has been minimized.

shrektan commented Oct 11, 2015

You would get the right data if you change the format of the date column to Date, in Excel.

But I think xlsx::read.xlsx2 has a better behavior that is having more respect on col_types settings.

@MichaelChirico

This comment has been minimized.

MichaelChirico commented Oct 11, 2015

I prefer not to do anything to my raw Excel files, otherwise I'd just save it as a .csv of course :)

The question is what read.xlsx2 is doing that read_excel can't seem to handle :\

@shrektan

This comment has been minimized.

shrektan commented Oct 11, 2015

@MichaelChirico Yes, I have to agree with you. +1

@MichaelChirico

This comment has been minimized.

MichaelChirico commented Oct 11, 2015

OK, so here's as far as I've gotten with this.

It may have to do with cacheDateStyles in the XlsxWorkBook class definition.

I note that in the file above that doesn't work, I have the following under the numFmts tag in xl/styles.xml:

<numFmts count="3">
    <numFmt numFmtId="164" formatCode="GENERAL"/>
    <numFmt numFmtId="165" formatCode="\$#,##0"/>
    <numFmt numFmtId="166" formatCode="M/D/YYYY"/>
</numFmts>

I compare this to the same .xml file for an .xlsx file for which read_excel detects the date successfully:

<numFmts count="3">
    <numFmt numFmtId="43" formatCode="_(* #,##0.00_);_(* \(#,##0.00\);_(* &quot;-&quot;??_);_(@_)"/>
    <numFmt numFmtId="166" formatCode="&quot;$&quot;#,##0"/>
    <numFmt numFmtId="168" formatCode="_(* #,##0_);_(* \(#,##0\);_(* &quot;-&quot;??_);_(@_)"/>
</numFmts>

They're clearly quite different; I've personally no idea how to parse the latter, and it seems from the isDateFormat code that indeed the former should have been detected as a date.

(I believe that cacheDateStyles is essential for the type method for XlsxCell objects to detect a date; if this returns a string, we get the warning described above)

@richardsc

This comment has been minimized.

richardsc commented Oct 20, 2015

I'm having this same issue with the current CRAN version of readxl. I can provide a sample file if necessary, but it's not public so I'd rather not put it in the cloud. Can confirm that the column is set to Date in the xlsx file, but I'm getting the following:

> str(log)
Classes 'tbl_df', 'tbl' and 'data.frame':   69 obs. of  20 variables:
 $ Cast                : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Glacier             : chr  NA NA "Verhoeff" "Verhoeff" ...
 $ ID                  : chr  "B" "B0" "V4" "V3" ...
 $ Date                : num  42224 42224 42225 42225 42225 ...
@shrektan

This comment has been minimized.

shrektan commented Oct 20, 2015

Please try the Devon version I think it's improved a lot.

发自我的 iPhone

在 2015年10月21日,03:49,Clark Richards notifications@github.com 写道:

I'm having this same issue with the current CRAN version of readxl. I can provide a sample file if necessary, but it's not public so I'd rather not put it in the cloud. Can confirm that the column is set to Date in the xlsx file, but I'm getting the following:

str(log)
Classes 'tbl_df', 'tbl' and 'data.frame': 69 obs. of 20 variables:
$ Cast : num 1 2 3 4 5 6 7 8 9 10 ...
$ Glacier : chr NA NA "Verhoeff" "Verhoeff" ...
$ ID : chr "B" "B0" "V4" "V3" ...
$ Date : num 42224 42224 42225 42225 42225 ...

Reply to this email directly or view it on GitHub.

@f34739

This comment has been minimized.

f34739 commented Apr 3, 2016

Any progress on this? The most recent dev version still displays the exact same buggy behavior.

@jhn316

This comment has been minimized.

jhn316 commented Oct 20, 2016

Still unresolved, I am switching to xlsx::read.xlsx2.

@MichaelChirico

This comment has been minimized.

MichaelChirico commented Oct 20, 2016

@richardsc can you paste the numFmts field in the unzipped version of your file? See my post above. This could help establish a common thread.

@MichaelChirico

This comment has been minimized.

MichaelChirico commented Oct 21, 2016

Also for the record, the comment in CellType.h should note the specific edition / part to which the URL is referring. Currently there are about 20 different possibilities at the URL noted, and the first I tried got me to the incorrect part of the pdf.

@MichaelChirico

This comment has been minimized.

MichaelChirico commented Oct 21, 2016

@richardsc @f34739 @jhn316 I believe I may have found the source of the issue.

Could any of you try reading your troublesome file with my patch and report whether it worked? It works on this file for me.

Should be able to install with:

devtools::install_github("MichaelChirico/readxl@date_fmts")
@ngnikhilgoyal

This comment has been minimized.

ngnikhilgoyal commented Feb 8, 2017

@MichaelChirico don't specify the col type for the date column. Let readxl read it as an integer. Then just try:

as.Date(df$Date, origin = "1899-12-30")
@MichaelChirico

This comment has been minimized.

MichaelChirico commented Feb 8, 2017

@ngnikhilgoyal yes, that is the workaround, but it's not a fix.

origin dates may be hard to determine automatically -- it should be readxl's job to do that for the user

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 9, 2017

@MichaelChirico does the current development version read the file in question correctly now? This issue has been closed because a PR that (claimed to) fix this was merged. If not, we'll re-open it.

@MichaelChirico

This comment has been minimized.

MichaelChirico commented Feb 9, 2017

Yes, it does, thanks for checking! Sorry for the confusion.

@N1h1l1sT

This comment has been minimized.

N1h1l1sT commented Nov 25, 2018

I'm having the same problem. readxl version ‘1.1.0’.
I get an integer returned (instead of the date that I should), and a warning:

50: In read_fun(path = path, sheet_i = sheet, limits = limits, ... :
Expecting numeric in CS1 / R1C97: got a date

I can provide a snippet to demonstrate it if you want me to, but since this is the exact same problem as described on this thread, my initial thought is it's redundant.

@jennybc

This comment has been minimized.

Member

jennybc commented Nov 26, 2018

@N1h1l1sT yes since this has been closed for 1.5 years, it would be better to open a new issue with a reproducible example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment