Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.read_csv not correctly parsing date/month field when set parse_date = ['column name'] #13063

Closed
itsmesaranya opened this issue May 3, 2016 · 1 comment

Comments

@itsmesaranya
Copy link

I ran in to this bug while trying to parse the few dates through parse_dates of pandas.read_csv(). In the following code snippet, I'm trying to parse dates that have format dd/mm/yy which is resulting me an improper conversion. For some cases, the date field is considered as month and vice versa.

To keep it simple, for some cases dd/mm/yy get converted to yyyy-dd-mm instead of yyyy-mm-dd.

Case 1:

      04/10/96 is parsed as 1996-04-10, which is wrong.

Case 2:

      15/07/97 is parsed as 1997-07-15, which is correct.

Case 3:

      10/12/97 is parsed as 1997-10-12, which is wrong.

Code Sample

import pandas as pd

df = pd.read_csv('date_time.csv') 
print 'Data in csv:'
print df
print df['start_date'].dtypes

print '----------------------------------------------'

df = pd.read_csv('date_time.csv', parse_dates = ['start_date'])
print 'Data after parsing:'
print df
print df['start_date'].dtypes

Current Output

----------------------
Data in csv:
----------------------
  start_date
0   04/10/96
1   15/07/97
2   10/12/97
3   06/03/99
4     //1994
5   /02/1967
object
----------------------
Data after parsing:
----------------------
  start_date
0 1996-04-10
1 1997-07-15
2 1997-10-12
3 1999-06-03
4 1994-01-01
5 1967-02-01
datetime64[ns]

Expected Output

----------------------
Data in csv:
----------------------
  start_date
0   04/10/96
1   15/07/97
2   10/12/97
3   06/03/99
4     //1994
5   /02/1967
object
----------------------
Data after parsing:
----------------------
  start_date

0 1996-10-04
1 1997-07-15
2 1997-12-10
3 1999-03-06
4 1994-01-01
5 1967-02-01
datetime64[ns]

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 18.5
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
Jinja2: None

More Comments:

I could use date_parser or pandas.to_datetime() to specify the proper format for date. But in my case, I have few date fields like ['//1997', '/02/1967'] for which I need to convert ['01/01/1997','01/02/1967']. The parse_dates helps me converting those type of date fields to the expected format without making me to write extra line of code.

parse_date_bug.zip

@itsmesaranya itsmesaranya changed the title read_csv not correctly parsing date/month field when set parse_date = ['column name'] pd.read_csv not correctly parsing date/month field when set parse_date = ['column name'] May 3, 2016
@itsmesaranya itsmesaranya changed the title pd.read_csv not correctly parsing date/month field when set parse_date = ['column name'] BUG: pd.read_csv not correctly parsing date/month field when set parse_date = ['column name'] May 3, 2016
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 3, 2016

I suppose this related to the SO question: http://stackoverflow.com/questions/36998540/pd-read-csv-not-correctly-parsing-date-month-field-when-set-parse-date-colum

As was commented over there by ajcr, this is not a bug. The date format you are using ("dd/mm/yy") is not an international standard format (but a European one), and the default of the parser is to try to parse such dates as "mm/dd/yy" (more an American format I think). However, if this gives an error, it will try the other format. This is the reason you get inconsistent results between the rows in the csv.

You can pass dayfirst=True to to_datetime, and then the preference will be given to "dd/mm/yy" instead of to "mm/dd/yy".

Typically you can also use format="%d/%m/%y", however (as you note) this will not work with the "//1994" dates (since this is another format).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants