Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.ExcelFile ignore parse_dates=False #10001

Closed
a59 opened this issue Apr 27, 2015 · 11 comments
Closed

pandas.ExcelFile ignore parse_dates=False #10001

a59 opened this issue Apr 27, 2015 · 11 comments
Labels
Bug IO Excel read_excel, to_excel
Milestone

Comments

@a59
Copy link

a59 commented Apr 27, 2015

I am trying to read an Excel file which someone else created and the wrongly formatted a column as "date" when it is not. It has a large integer in it, which triggers an error

OverflowError: normalized days too large to fit in a C int

But I have "parse_dates=False" so I thought pandas.ExcelFile would not try to parse the dates and return a string instead. Is this a bug?

@jreback
Copy link
Contributor

jreback commented Apr 30, 2015

you would have to show an example and pd.show_versions()

@jreback jreback added the IO Excel read_excel, to_excel label Apr 30, 2015
@a59
Copy link
Author

a59 commented May 3, 2015

In Excel (2013 Windows 7), I created a new Workbook. In Sheet1, I entered A in A1 and 10000000 in B1. I then formatted B1 to be a Short Date which displays the cell as #################. I saved the files as 'test.xlsx'.

I then ran the following python code

import pandas as pd
pd.show_versions()
xl_file = pd.ExcelFile('test.xlsx')
sheet = xl_file.parse('Sheet1',parse_dates=False)

which gives me the following output

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 14.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.0
nose: 1.3.6
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.2
lxml: 3.4.3
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.1
pymysql: None
psycopg2: None
Traceback (most recent call last):
File "test.py", line 5, in
sheet = xl_file.parse('Sheet1',parse_dates=False)
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 287, in parse
**kwds)
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 418, in _parse_excel
row.append(_parse_cell(value,typ))
File "/Users/myname/anaconda/lib/python2.7/site-packages/pandas/io/excel.py", line 342, in _parse_cell
epoch1904)
File "/Users/myname/anaconda/lib/python2.7/site-packages/xlrd/xldate.py", line 130, in xldate_as_datetime
return epoch + datetime.timedelta(days, seconds, 0, milliseconds)
OverflowError: date value out of range

@jreback jreback added the Bug label May 4, 2015
@jreback jreback added this to the Next Major Release milestone May 4, 2015
@jreback
Copy link
Contributor

jreback commented May 4, 2015

parse_dates is not implemented. But I think could be in the _should_parse function. pull-requests are welcome.

@kwantopia
Copy link

Hi, I am adding this patch here, in case it's useful for those who do not want to parse dates from excel file by setting parse_dates=False. Please review. I had trouble parsing the following excel file from Crunchbase Excel Export which had really old dates which gave OverflowError.

From aae19c65e2a4b3a965f91bdffa5bd4595b0b7d7b Mon Sep 17 00:00:00 2001
From: Kwan Lee <kwan@openviewpartners.com>
Date: Tue, 29 Mar 2016 14:52:03 +0300
Subject: [PATCH] parse_dates=False in read_excel should prevent from parsing
 dates.  It was throwing overflowerror when the dates were too old. update
 #10001

---
 pandas/io/excel.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pandas/io/excel.py b/pandas/io/excel.py
index 5656c36..b4b8996 100644
--- a/pandas/io/excel.py
+++ b/pandas/io/excel.py
@@ -321,11 +321,11 @@ class ExcelFile(object):

         epoch1904 = self.book.datemode

-        def _parse_cell(cell_contents, cell_typ):
+        def _parse_cell(cell_contents, cell_typ, parse_dates=True):
             """converts the contents of the cell into a pandas
                appropriate object"""

-            if cell_typ == XL_CELL_DATE:
+            if cell_typ == XL_CELL_DATE and parse_dates:
                 if xlrd_0_9_3:
                     # Use the newer xlrd datetime handling.
                     cell_contents = xldate.xldate_as_datetime(cell_contents,
@@ -405,7 +405,7 @@ class ExcelFile(object):
                         should_parse[j] = self._should_parse(j, parse_cols)

                     if parse_cols is None or should_parse[j]:
-                        row.append(_parse_cell(value, typ))
+                        row.append(_parse_cell(value, typ, parse_dates))
                 data.append(row)

             if sheet.nrows == 0:
-- 
2.7.2

@TomAugspurger
Copy link
Contributor

@kwantopia It'll be easier to review that if you put it up as a pull request. Then we can comment inline.

What's the desired behavior here? read_csv seems to silently "fail" to parse the columns that can't be represented as datetime64s.

In [5]: !cat foo.csv
date,val
1500-01-01,1
1600-01-02,2
1700-01-01,3
1800-01-01,4
1900-01-01,5
2000-01-01,6

In [1]: pd.read_csv('foo.csv', parse_dates='date')
Out[1]:
         date  val
0  1500-01-01    1
1  1600-01-02    2
2  1700-01-01    3
3  1800-01-01    4
4  1900-01-01    5
5  2000-01-01    6

@jreback
Copy link
Contributor

jreback commented Mar 29, 2016

actually maybe @jorisvandenbossche can comment here. IIRC the parse_dates kw only matters if in excel its NOT a date already (and in your case these are out-of-range, so they are strings and will be parsed to object even in pandas).

@jreback jreback removed this from the Next Major Release milestone Mar 29, 2016
@jreback
Copy link
Contributor

jreback commented Mar 29, 2016

this is related to the issue in #11544 and looks to be a dupe of these (there is a somewhat convoluted chain as to what the original issues actually though). maybe someone can figure this chain out and we can create a master issue so its more clear.

@kwantopia
Copy link

@TomAugspurger it's a problem in read_excel, but I guess I was also misunderstanding parse_dates field. I was assuming that parse_dates=True means parse the dates and parse_dates=False means do not parse the dates for pandas.read_excel

@jreback jreback added this to the Next Major Release milestone Apr 30, 2016
@jreback jreback modified the milestones: 0.18.2, Next Major Release May 6, 2016
@jreback jreback closed this as completed in 1296ab3 May 6, 2016
@ffranchina
Copy link

Hello, sorry for writing here but it seems a quite common wish the ability of disabling date parsing from the function pd.read_excel(). Yet another stackoverflow question without a proper answer.
As @kwantopia underlined, I also would expect that setting the flag parse_dates=False could help me to achieve the desired result but, instead, seems to be ignored or to work in a different (and hence obscure) way.
Can anyone tell me which is the situation of this issue and if actually it's an issue? And, in case everything it's regular, can I have a definitive answer that can allow me to read an .xls without having the dates parsed?
If there's need for changing or fixing some code, I'm very willing and ready to help, I'd just like to be pointed in the right direction.
Thank you in advance!

@maratbakiev2
Copy link

Hi guys, I've also run into this issue. One workaround for this (given that the name of columns that you don't want to convert or their column positions are fixed) is that you can provide dictionary of names/column numbers as keys and desired type as value. So, if you set "str" as a value, columns in dataframe will have "object" type and won't get parsed.
pd.read_excel(**you_kwargs, dtype={"col_name1": "str", "col_name2": "str"})
btw, you may provide desired column types only for some columns, it's not required for all of them

@maratbakiev2
Copy link

@a59 @ffranchina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants