BUG/ENH: to_datetime dayfirst ought to be strict #3341

Open
hayd opened this Issue Apr 13, 2013 · 12 comments

5 participants

@hayd
Python for Data member

xref #12585

I think this ought to raise a ValueErrror:

In [1]: pd.to_datetime('01-13-2012', dayfirst=True)
Out[1]: datetime.datetime(2012, 1, 13, 0, 0)

This means if my data is bad (i.e. somehow I have a mixture of american and british string-dates), I won't be told there was an issue when parsing! I can't think of a usecase where this kind of precedence would be appropriate, and it means you'll end up with incorrect dates.

The precedence behaviour appears to be a feature from dateutil (which tslib.array_to_datetime calls):

In [2]: from dateutil.parser import parse as parse_date

In [3]: parse_date('1/12/2012', dayfirst=True)
Out[3]: datetime.datetime(2012, 12, 1, 0, 0)

In [4]: parse_date('1/13/2012', dayfirst=True)
Out[4]: datetime.datetime(2012, 1, 13, 0, 0)

It should raising like this:

In [5]: parse_date('14/14/2012', dayfirst=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-ca14fc16e421> in <module>()
----> 1 parse_date('14/14/2012', dayfirst=True)

/Library/Python/2.7/site-packages/dateutil/parser.pyc in parse(timestr, parserinfo, **kwargs)
    718         return parser(parserinfo).parse(timestr, **kwargs)
    719     else:
--> 720         return DEFAULTPARSER.parse(timestr, **kwargs)
    721
    722

/Library/Python/2.7/site-packages/dateutil/parser.pyc in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    315             if value is not None:
    316                 repl[attr] = value
--> 317         ret = default.replace(**repl)
    318         if res.weekday is not None and not res.day:
    319             ret = ret+relativedelta.relativedelta(weekday=res.weekday)

ValueError: month must be in 1..12

... :(

@hayd
Python for Data member

Another usecase, is if you call to_datetime (without setting dayfirst) on a list of british date-strings, atm it parses without error but does so "incorrectly" (using a mixture of dayfirst and not, depending). Which is confusing.

@y-p

I'm shocked that dateutil does that.

@hayd
Python for Data member

cc @jreback.

@jreback

I think we ought to drop this parameter in favor of just using format; dateutil just does the wrong thing. as an alternative, could just have dayfirst imply format="%d/%m/%Y"....or some such (but not nearly as flexible)

until/unless we actually do more date parsing internally

thoughts?

@hayd
Python for Data member

but but but dayfirst is the best argument ever, it's very very useful for munging (it's great if you don't know the format or worse the format is mixed (!)).

It being not strict is sad, but not having it at all would be much worse. please please can we keep it!

I've looked before where in their code it's not being strict, but it's pretty messy and not sure how actively maintained it is...

I'm sure @wesm suggested another (C?) library fairly recently but can't find the issue.

@jreback

how about a big warning in the docstring/docs? that you really ought to specify format instead?

I see the warning you have, but EVEN bigger ? (and then in docstrings too), http://pandas.pydata.org/pandas-docs/dev/timeseries.html#converting-to-timestamps

@hayd
Python for Data member

+1 on that, definitely users should know what they are letting themselves in for.

@jreback

alright...so let's just make this a doc enhancement for now?

@hayd
Python for Data member

How's about (pr coming):

dayfirst : boolean, default False
    If True parses dates with the day first, eg 20/01/2005
    Warning: dayfirst=True is not strict, but will prefer to parse
    with day first (this is a known bug).
@janschulz

Switching from #11725, where this happened, e.g. daysfirst=False would parse month as days

>>> pd.to_datetime(["29.01.1945","1.3.1945", "02.03.1945"])
DatetimeIndex(['1945-01-29', '1945-01-03', '1945-02-03'], dtype='datetime64[ns]', freq=None)

right now, the doc only describes the problem for the other way:

Warning: dayfirst=True is not strict, but will prefer to parse with day first (this is a known bug, based on dateutil behavior).

As dayfirst is False, this makes no mention that the other way can also happen

@jorisvandenbossche

@JanSchulz Indeed .. Maybe also worth mentioning that if you want strict behaviour, you can use format=... if possible?

@jorisvandenbossche

I opened an issue for this at the dateutil tracker, as I couldn't find one: dateutil/dateutil#214

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment