Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved datetime format inference #543

Merged
merged 8 commits into from
Apr 6, 2023

Conversation

LeoGrin
Copy link
Contributor

@LeoGrin LeoGrin commented Apr 5, 2023

Fix #540
The bug in #540 was related to pandas 2.0 update: as datetime format inference became stricter, a bug which was silent (inconsistent date format inferred for one column in the test) became loud.

This PR fixes this and improves date format detection (and thus datetime column detection in TableVectorizer). This replaces using pd.datetime directly for datetime format inference.

To do this, we try pandas's guess_datetime_format (on a subset) both with dayfirst being True and False, and see if one of these options finds a single format for all rows. If both work, we return the monthfirst format and raise a warning. If both return multiple formats, we give up.

This finds the right %d-%m-%Y format for the failing test example, instead of failing or returning mixed format (see below).

  "11-12-2029",
  "02-12-2012",
  "11-09-2012",
  "13-02-2000",
  "10-11-2001"

pd.to_datetime format inference

If I understand correctly, pandas datetime format inference (in pd.to_datetime) works like this:

Versions < 2.0

If not infer_datetime_format, the format of each row is inferred independently.
If infer_datetime_format, the format is inferred from the first non-missing example, and pandas tries to apply it to the other example (but can use another format otherwise).
This can easily create issues, for instance in our tests:

    "11-12-2029",
    "02-12-2012",
    "11-09-2012",
    "13-02-2000",
    "10-11-2001"

For all rows, the inferred format was %m-%d-%Y expect 13-02-2000, for which the format was %d-%m-%Y.

Version >= 2.0

The format is inferred from the first non-missing example, and pandas tries to apply it to the other example (and raise an error otherwise).

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, a few tweaks to merge and it's good for me!

dirty_cat/_utils.py Outdated Show resolved Hide resolved
dirty_cat/_utils.py Outdated Show resolved Hide resolved
dirty_cat/_utils.py Outdated Show resolved Hide resolved
dirty_cat/_utils.py Outdated Show resolved Hide resolved
dirty_cat/_utils.py Outdated Show resolved Hide resolved
dirty_cat/_utils.py Outdated Show resolved Hide resolved
dirty_cat/_utils.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a remark on the _infer_date_format function: since it's only used in the TableVectorizer, I think it should be moved over to its file.

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messed up something in my previous review, this is a fix

dirty_cat/_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks again!

@LilianBoulard LilianBoulard merged commit 6b9cb63 into skrub-data:main Apr 6, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TableVectorizer test failing with latest dependencies
2 participants