Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime encoder #239

Merged
merged 52 commits into from
Jul 1, 2022
Merged

Datetime encoder #239

merged 52 commits into from
Jul 1, 2022

Conversation

LeoGrin
Copy link
Contributor

@LeoGrin LeoGrin commented Feb 19, 2022

Creates a new encoder which transform datetime columns into several numerical features (year, month, day...). Solves the second part of #233.

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's bunch of small adjustments, but overall this is very good ! Thank you :)

dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Show resolved Hide resolved
dirty_cat/datetime_encoder.py Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
LeoGrin and others added 2 commits March 2, 2022 17:56
Applying @Lilian suggested changes

Co-authored-by: Lilian <lilian@boulard.fr>
@LeoGrin
Copy link
Contributor Author

LeoGrin commented Mar 2, 2022

Thank you for the review @LilianBoulard !

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I left a few comments, but none of them are major.

The new object should be added to the list of encoders on the first page (doc/index.rst).

We also need an example demonstrating the new object.

CHANGES.rst Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
"millisecond", "microsecond", "nanosecond"}, default="hour"
Extract up to this granularity, and gather the rest into the "other" feature.
For instance, if you specify "day", only "year", "month", "day" and "other" features will be created.
The "other" feature will be a numerical value expressed in the "extract_until" unit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that the "other" feature would make the learning easier if it were the full time to epoch. Else, the model may need to learning the weird algebra: 24H, 30 days a months but not quite, 365 days in a year and sometimes not.
Probably, having both features would be useful....

Copy link
Contributor Author

@LeoGrin LeoGrin Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting. I get you point, but here's why I chose to do it this way:

  • I fear that full_time_to_epoch would be very collinear with the highest-level feature, hurting interpretability. For a worst case scenario, we can imagine a dataset of different hours during the same day for different years.
  • It’s true that with my choice, the learner has to learn the weird algebra. But with your proposition, the learner would still need to learn this weird algebra if it wants to use my other feature (which may be more important for prediction since it’s information which isn’t contained in the other variables), because other = full_time_to_epoch - (year - 1970) * 365 * 24 * 3600 + …

Maybe using both could indeed make the learning easier, but wouldn't you worry about collinearity?
All in all I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do machine learning. I don't worry about collinearity :).

It does not hurt for prediction. I can hurt for interpretability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason for favoring the full_time_to_epoch is that I think that, in general, it is more likely to be useful than the other end, and we are more likely to have to reconstruct it.

dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/test/test_datetime_encoder.py Outdated Show resolved Hide resolved
dirty_cat/test/test_datetime_encoder.py Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 8, 2022 via email

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 28, 2022

Additions:

  • Replaced the "other" feature by the "full" features, which contains the full time to epoch (see above discussion with @GaelVaroquaux)
  • Made the DatetimeEncoder the default for datetime columns in the SuperVectorizer
  • Added a simple example
  • Small fixes

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 28, 2022

Apparently using fetch_traffic_violations() in the example takes too much memory for circleCI

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 28, 2022

I've changed the example to use another dataset, I think the PR is ready for review @LilianBoulard @GaelVaroquaux @jovan-stojanovic

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes requested. Most are minor. I hope that the prediction example won't ask for too much work.

examples/06_datetime_encoder.py Outdated Show resolved Hide resolved
examples/06_datetime_encoder.py Outdated Show resolved Hide resolved
breaks down each datetime features into several numerical features, by extracting relevant information from the
datetime features, such as the month, the day of the week, the hour of the day, etc. Used in
the SuperVectorizer, which automatically detects the datetime features, the DatetimeEncoder allows
to handle datetime features easily.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to do a small prediction model at the end, add a section, to showcase this in a prediction pipeline?

For didactic purposes it would be important to have not only the cross-validation (using https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html ) but also the plot of the extrapolated time series (compared to the real one)

dirty_cat/datetime_encoder.py Outdated Show resolved Hide resolved
@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 29, 2022

Thank you for the comments! I've added a new section to the example

@GaelVaroquaux
Copy link
Member

Oups, I just saw that there is a "examples/.DS_Store" that needs to be removed. You can add it in the gitignore if it's helpful.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still cosmetic comments on the example :)

examples/06_datetime_encoder.py Outdated Show resolved Hide resolved
examples/06_datetime_encoder.py Outdated Show resolved Hide resolved
examples/06_datetime_encoder.py Outdated Show resolved Hide resolved
examples/06_datetime_encoder.py Outdated Show resolved Hide resolved
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot!

@GaelVaroquaux GaelVaroquaux merged commit c71498e into skrub-data:master Jul 1, 2022
@GaelVaroquaux
Copy link
Member

Hum, I merged this, but I hadn't seen that the lines are too long (I'll fix it, I wanted to do cosmetics on the example anyhow).

It does have a practical consequence: it forces horizontal scrolling in the example, which is bad for readability.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants