-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved datetime format inference #543
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, a few tweaks to merge and it's good for me!
dirty_cat/_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a remark on the _infer_date_format
function: since it's only used in the TableVectorizer
, I think it should be moved over to its file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Messed up something in my previous review, this is a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks again!
Fix #540
The bug in #540 was related to pandas 2.0 update: as datetime format inference became stricter, a bug which was silent (inconsistent date format inferred for one column in the test) became loud.
This PR fixes this and improves date format detection (and thus datetime column detection in
TableVectorizer
). This replaces usingpd.datetime
directly for datetime format inference.To do this, we try pandas's
guess_datetime_format
(on a subset) both withdayfirst
beingTrue
andFalse
, and see if one of these options finds a single format for all rows. If both work, we return the monthfirst format and raise a warning. If both return multiple formats, we give up.This finds the right %d-%m-%Y format for the failing test example, instead of failing or returning mixed format (see below).
pd.to_datetime
format inferenceIf I understand correctly, pandas datetime format inference (in
pd.to_datetime
) works like this:Versions < 2.0
If not
infer_datetime_format
, the format of each row is inferred independently.If
infer_datetime_format
, the format is inferred from the first non-missing example, and pandas tries to apply it to the other example (but can use another format otherwise).This can easily create issues, for instance in our tests:
For all rows, the inferred format was %m-%d-%Y expect 13-02-2000, for which the format was %d-%m-%Y.
Version >= 2.0
The format is inferred from the first non-missing example, and pandas tries to apply it to the other example (and raise an error otherwise).