Datetime encoder #239

LeoGrin · 2022-02-19T17:41:28Z

Creates a new encoder which transform datetime columns into several numerical features (year, month, day...). Solves the second part of #233.

…cal feature

LilianBoulard

Here's bunch of small adjustments, but overall this is very good ! Thank you :)

dirty_cat/datetime_encoder.py

CHANGES.rst

dirty_cat/datetime_encoder.py

@Lilian

Applying @Lilian suggested changes Co-authored-by: Lilian <lilian@boulard.fr>

LeoGrin · 2022-03-02T17:05:42Z

Thank you for the review @LilianBoulard !

GaelVaroquaux

Nice!

I left a few comments, but none of them are major.

The new object should be added to the list of encoders on the first page (doc/index.rst).

We also need an example demonstrating the new object.

CHANGES.rst

dirty_cat/datetime_encoder.py

GaelVaroquaux · 2022-03-07T17:34:34Z

dirty_cat/datetime_encoder.py

+    "millisecond", "microsecond", "nanosecond"}, default="hour"
+        Extract up to this granularity, and gather the rest into the "other" feature.
+        For instance, if you specify "day", only "year", "month", "day" and "other" features will be created.
+        The "other" feature will be a numerical value expressed in the "extract_until" unit.


My gut feeling is that the "other" feature would make the learning easier if it were the full time to epoch. Else, the model may need to learning the weird algebra: 24H, 30 days a months but not quite, 365 days in a year and sometimes not.
Probably, having both features would be useful....

Hmm interesting. I get you point, but here's why I chose to do it this way:

I fear that full_time_to_epoch would be very collinear with the highest-level feature, hurting interpretability. For a worst case scenario, we can imagine a dataset of different hours during the same day for different years.

It’s true that with my choice, the learner has to learn the weird algebra. But with your proposition, the learner would still need to learn this weird algebra if it wants to use my other feature (which may be more important for prediction since it’s information which isn’t contained in the other variables), because other = full_time_to_epoch - (year - 1970) * 365 * 24 * 3600 + …

Maybe using both could indeed make the learning easier, but wouldn't you worry about collinearity?
All in all I'm not sure.

I do machine learning. I don't worry about collinearity :).

It does not hurt for prediction. I can hurt for interpretability.

My reason for favoring the full_time_to_epoch is that I think that, in general, it is more likely to be useful than the other end, and we are more likely to have to reconstruct it.

dirty_cat/datetime_encoder.py

dirty_cat/test/test_datetime_encoder.py

GaelVaroquaux · 2022-03-08T11:54:52Z

Do you think a new dependency is worth it?

No, I don't think that a new dependency is worth it. I'm wondering if we should simply drop this feature (less features = less code = less problems :D )

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

…tly written (trailing underscore)

Merge branch 'datetime_encoder' of https://github.com/LeoGrin/dirty_cat into datetime_encoder

LeoGrin · 2022-06-28T18:28:20Z

Additions:

Replaced the "other" feature by the "full" features, which contains the full time to epoch (see above discussion with @GaelVaroquaux)
Made the DatetimeEncoder the default for datetime columns in the SuperVectorizer
Added a simple example
Small fixes

LeoGrin · 2022-06-28T21:04:29Z

Apparently using fetch_traffic_violations() in the example takes too much memory for circleCI

LeoGrin · 2022-06-28T23:33:19Z

I've changed the example to use another dataset, I think the PR is ready for review @LilianBoulard @GaelVaroquaux @jovan-stojanovic

GaelVaroquaux

A few changes requested. Most are minor. I hope that the prediction example won't ask for too much work.

examples/06_datetime_encoder.py

GaelVaroquaux · 2022-06-29T15:36:33Z

examples/06_datetime_encoder.py

+breaks down each datetime features into several numerical features, by extracting relevant information from the
+datetime features, such as the month, the day of the week, the hour of the day, etc. Used in
+the SuperVectorizer, which automatically detects the datetime features, the DatetimeEncoder allows
+to handle datetime features easily.


Would it be possible to do a small prediction model at the end, add a section, to showcase this in a prediction pipeline?

For didactic purposes it would be important to have not only the cross-validation (using https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html ) but also the plot of the extrapolated time series (compared to the real one)

dirty_cat/datetime_encoder.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

LeoGrin · 2022-06-29T21:31:39Z

Thank you for the comments! I've added a new section to the example

GaelVaroquaux · 2022-06-30T05:44:30Z

Oups, I just saw that there is a "examples/.DS_Store" that needs to be removed. You can add it in the gitignore if it's helpful.

GaelVaroquaux

Still cosmetic comments on the example :)

examples/06_datetime_encoder.py

Subsections Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

GaelVaroquaux

LGTM. Thanks a lot!

GaelVaroquaux · 2022-07-01T09:14:30Z

Hum, I merged this, but I hadn't seen that the lines are too long (I'll fix it, I wanted to do cosmetics on the example anyhow).

It does have a practical consequence: it forces horizontal scrolling in the example, which is bad for readability.

GaelVaroquaux · 2022-10-11T06:55:47Z

This actually made me wonder: in which case should I use a class attribute vs a module-level constant?

Use a class attribute if you might want to override it in a subclass.

LeoGrin added 11 commits February 18, 2022 12:01

encoder

41f6404

First working version

371e1c8

add holidays option + gather time below extract_until into one numeri…

a7e05e7

…cal feature

handle NaNs

3168fff

handling timezone-aware dates

fee04db

add get_features_name

02d57dd

doc + change default for add_day_of_the_week

97dab4a

CHANGES.rst

d965e95

doc + change of variable name

b5a81cd

doc

5bb73da

change array creation in test for numpy 1.16 compatibility

7020412

LilianBoulard requested changes Mar 2, 2022

View reviewed changes

LeoGrin and others added 2 commits March 2, 2022 17:56

Apply suggestions from code review

b835cf8

Applying @Lilian suggested changes Co-authored-by: Lilian <lilian@boulard.fr>

taking into account Lilian's review

b81ddb0

GaelVaroquaux requested changes Mar 7, 2022

View reviewed changes

LeoGrin and others added 13 commits March 8, 2022 14:46

reorder import (Gael review)

ffe088e

error in docstring of transform (Gael review)

9bf78df

Update dirty_cat/datetime_encoder.py

21b2094

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

Sphynx :class: (Gael review)

d939740

instance attribute to module variable

29de9e5

make instance attribute either private (leading underscore) or correc…

db82a50

…tly written (trailing underscore)

invert get_feature_names and get_feature_names_out (Gael's review)

59ab3fa

formatting

0cfa1ec

docstring formatting

4886b10

fix doctrings (Gael review)

c693e85

doc

1ff4aff

Remove add_holidays option for DatetimeEncoder

99da5fc

ok

1d78e11

Merge branch 'datetime_encoder' of https://github.com/LeoGrin/dirty_cat into datetime_encoder

typo

b622302

LeoGrin added 7 commits June 28, 2022 20:43

reduce requirements

063f751

less memory usage in example

1c8e073

test

2797790

gitignore

19eaa28

less memory

3a1a6dd

test

d8f4a07

example without test

79c8c95

new example + bug fix in supervectorizer

2265b1a

GaelVaroquaux requested changes Jun 29, 2022

View reviewed changes

LeoGrin and others added 5 commits June 29, 2022 20:08

Apply suggestions from code review

0722127

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

replace the feature by

c582f9e

adding prediction example

42bbc87

new section on prediction

c04f129

Merge branch 'master' into datetime_encoder

e782c60

GaelVaroquaux reviewed Jun 30, 2022

View reviewed changes

examples/06_datetime_encoder.py Outdated Show resolved Hide resolved

examples/06_datetime_encoder.py Outdated Show resolved Hide resolved

examples/06_datetime_encoder.py Outdated Show resolved Hide resolved

examples/06_datetime_encoder.py Outdated Show resolved Hide resolved

LeoGrin and others added 4 commits June 30, 2022 13:30

Apply suggestions from code review

2691538

Subsections Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

Merge branch 'master' into datetime_encoder

66ea058

remove DS_store

13bfe04

change example order + small improvements

a0a7723

GaelVaroquaux approved these changes Jul 1, 2022

View reviewed changes

GaelVaroquaux merged commit c71498e into skrub-data:master Jul 1, 2022

LeoGrin mentioned this pull request Jul 1, 2022

Handling date columns in SuperVectorizer #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datetime encoder #239

Datetime encoder #239

LeoGrin commented Feb 19, 2022

LilianBoulard left a comment

LeoGrin commented Mar 2, 2022

GaelVaroquaux left a comment

GaelVaroquaux Mar 7, 2022

LeoGrin Mar 8, 2022 •

edited

GaelVaroquaux Mar 13, 2022

GaelVaroquaux Mar 13, 2022

GaelVaroquaux commented Mar 8, 2022 via email

LeoGrin commented Jun 28, 2022

LeoGrin commented Jun 28, 2022

LeoGrin commented Jun 28, 2022

GaelVaroquaux left a comment

GaelVaroquaux Jun 29, 2022

LeoGrin commented Jun 29, 2022

GaelVaroquaux commented Jun 30, 2022

GaelVaroquaux left a comment

GaelVaroquaux left a comment

GaelVaroquaux commented Jul 1, 2022

GaelVaroquaux commented Oct 11, 2022 via email

Datetime encoder #239

Datetime encoder #239

Conversation

LeoGrin commented Feb 19, 2022

LilianBoulard left a comment

Choose a reason for hiding this comment

LeoGrin commented Mar 2, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment

GaelVaroquaux Mar 7, 2022

Choose a reason for hiding this comment

LeoGrin Mar 8, 2022 • edited

Choose a reason for hiding this comment

GaelVaroquaux Mar 13, 2022

Choose a reason for hiding this comment

GaelVaroquaux Mar 13, 2022

Choose a reason for hiding this comment

GaelVaroquaux commented Mar 8, 2022 via email

LeoGrin commented Jun 28, 2022

LeoGrin commented Jun 28, 2022

LeoGrin commented Jun 28, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment

GaelVaroquaux Jun 29, 2022

Choose a reason for hiding this comment

LeoGrin commented Jun 29, 2022

GaelVaroquaux commented Jun 30, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jul 1, 2022

GaelVaroquaux commented Oct 11, 2022 via email

LeoGrin Mar 8, 2022 •

edited