Improved datetime format inference #543

LeoGrin · 2023-04-05T21:02:31Z

Fix #540
The bug in #540 was related to pandas 2.0 update: as datetime format inference became stricter, a bug which was silent (inconsistent date format inferred for one column in the test) became loud.

This PR fixes this and improves date format detection (and thus datetime column detection in TableVectorizer). This replaces using pd.datetime directly for datetime format inference.

To do this, we try pandas's guess_datetime_format (on a subset) both with dayfirst being True and False, and see if one of these options finds a single format for all rows. If both work, we return the monthfirst format and raise a warning. If both return multiple formats, we give up.

This finds the right %d-%m-%Y format for the failing test example, instead of failing or returning mixed format (see below).

  "11-12-2029",
  "02-12-2012",
  "11-09-2012",
  "13-02-2000",
  "10-11-2001"

`pd.to_datetime` format inference

If I understand correctly, pandas datetime format inference (in pd.to_datetime) works like this:

Versions < 2.0

If not infer_datetime_format, the format of each row is inferred independently.
If infer_datetime_format, the format is inferred from the first non-missing example, and pandas tries to apply it to the other example (but can use another format otherwise).
This can easily create issues, for instance in our tests:

    "11-12-2029",
    "02-12-2012",
    "11-09-2012",
    "13-02-2000",
    "10-11-2001"

For all rows, the inferred format was %m-%d-%Y expect 13-02-2000, for which the format was %d-%m-%Y.

Version >= 2.0

The format is inferred from the first non-missing example, and pandas tries to apply it to the other example (and raise an error otherwise).

LilianBoulard

Thanks for the PR, a few tweaks to merge and it's good for me!

dirty_cat/_utils.py

LilianBoulard · 2023-04-06T08:42:59Z

dirty_cat/_utils.py

Maybe a remark on the _infer_date_format function: since it's only used in the TableVectorizer, I think it should be moved over to its file.

LilianBoulard

Messed up something in my previous review, this is a fix

dirty_cat/_utils.py

LilianBoulard

LGTM, thanks again!

LeoGrin added 4 commits April 5, 2023 20:01

Change datetime format inference and datetime column detection

d056756

Cleaning

0c49ee6

Simplify _infer_date_format logic

2d03823

Add pr number

a65800a

LilianBoulard reviewed Apr 6, 2023

View reviewed changes

Apply suggestions from code review

4489383

LilianBoulard reviewed Apr 6, 2023

View reviewed changes

dirty_cat/_utils.py Outdated Show resolved Hide resolved

Update dirty_cat/_utils.py

ef2315d

LilianBoulard mentioned this pull request Apr 6, 2023

DOC simplify print_worst_matches for 04_fuzzy_joining_and_FeatureAugmenter #535

Merged

LeoGrin added 2 commits April 6, 2023 11:14

Move _infer_date_format to _table_vectorizer

c094c33

Remove warnings when monthfirst and dayfirst find the same format

c21cd6e

LilianBoulard approved these changes Apr 6, 2023

View reviewed changes

LilianBoulard merged commit 6b9cb63 into skrub-data:main Apr 6, 2023

LilianBoulard mentioned this pull request Apr 6, 2023

Example 3 does not render #544

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved datetime format inference #543

Improved datetime format inference #543

LeoGrin commented Apr 5, 2023

LilianBoulard left a comment

LilianBoulard Apr 6, 2023

LilianBoulard left a comment

LilianBoulard left a comment

Improved datetime format inference #543

Improved datetime format inference #543

Conversation

LeoGrin commented Apr 5, 2023

pd.to_datetime format inference

Versions < 2.0

Version >= 2.0

LilianBoulard left a comment

Choose a reason for hiding this comment

LilianBoulard Apr 6, 2023

Choose a reason for hiding this comment

LilianBoulard left a comment

Choose a reason for hiding this comment

LilianBoulard left a comment

Choose a reason for hiding this comment

`pd.to_datetime` format inference