-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datetime encoder #239
Datetime encoder #239
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's bunch of small adjustments, but overall this is very good ! Thank you :)
Applying @Lilian suggested changes Co-authored-by: Lilian <lilian@boulard.fr>
Thank you for the review @LilianBoulard ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I left a few comments, but none of them are major.
The new object should be added to the list of encoders on the first page (doc/index.rst).
We also need an example demonstrating the new object.
dirty_cat/datetime_encoder.py
Outdated
"millisecond", "microsecond", "nanosecond"}, default="hour" | ||
Extract up to this granularity, and gather the rest into the "other" feature. | ||
For instance, if you specify "day", only "year", "month", "day" and "other" features will be created. | ||
The "other" feature will be a numerical value expressed in the "extract_until" unit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that the "other" feature would make the learning easier if it were the full time to epoch. Else, the model may need to learning the weird algebra: 24H, 30 days a months but not quite, 365 days in a year and sometimes not.
Probably, having both features would be useful....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interesting. I get you point, but here's why I chose to do it this way:
- I fear that full_time_to_epoch would be very collinear with the highest-level feature, hurting interpretability. For a worst case scenario, we can imagine a dataset of different hours during the same day for different years.
- It’s true that with my choice, the learner has to learn the weird algebra. But with your proposition, the learner would still need to learn this weird algebra if it wants to use my
other
feature (which may be more important for prediction since it’s information which isn’t contained in the other variables), becauseother
= full_time_to_epoch - (year - 1970) * 365 * 24 * 3600 + …
Maybe using both could indeed make the learning easier, but wouldn't you worry about collinearity?
All in all I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do machine learning. I don't worry about collinearity :).
It does not hurt for prediction. I can hurt for interpretability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reason for favoring the full_time_to_epoch is that I think that, in general, it is more likely to be useful than the other end, and we are more likely to have to reconstruct it.
Do you think a new dependency is worth it?
No, I don't think that a new dependency is worth it.
I'm wondering if we should simply drop this feature (less features = less code = less problems :D )
|
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
…tly written (trailing underscore)
Merge branch 'datetime_encoder' of https://github.com/LeoGrin/dirty_cat into datetime_encoder
Additions:
|
Apparently using fetch_traffic_violations() in the example takes too much memory for circleCI |
I've changed the example to use another dataset, I think the PR is ready for review @LilianBoulard @GaelVaroquaux @jovan-stojanovic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few changes requested. Most are minor. I hope that the prediction example won't ask for too much work.
breaks down each datetime features into several numerical features, by extracting relevant information from the | ||
datetime features, such as the month, the day of the week, the hour of the day, etc. Used in | ||
the SuperVectorizer, which automatically detects the datetime features, the DatetimeEncoder allows | ||
to handle datetime features easily. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to do a small prediction model at the end, add a section, to showcase this in a prediction pipeline?
For didactic purposes it would be important to have not only the cross-validation (using https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html ) but also the plot of the extrapolated time series (compared to the real one)
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Thank you for the comments! I've added a new section to the example |
Oups, I just saw that there is a "examples/.DS_Store" that needs to be removed. You can add it in the gitignore if it's helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still cosmetic comments on the example :)
Subsections Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot!
Hum, I merged this, but I hadn't seen that the lines are too long (I'll fix it, I wanted to do cosmetics on the example anyhow). It does have a practical consequence: it forces horizontal scrolling in the example, which is bad for readability. |
This actually made me wonder: in which case should I use a class attribute vs a module-level constant?
Use a class attribute if you might want to override it in a subclass.
|
Creates a new encoder which transform datetime columns into several numerical features (year, month, day...). Solves the second part of #233.