DOC new example on feature engineering for cyclic time features #20281

ogrisel · 2021-06-16T17:16:52Z

Here is a prototype example to explore some cyclic date-related feature engineering strategies being discussed in #20259.

This is not meant to be reviewed or merge in its current state. In particular I did not put any narrative yet but once we have converged on which models we want to highlight, we can turn this into a full fledged tutorial.

Update: I think this example is interesting to consider for merging irrespective of the outcome of the discussion in #20259..

ogrisel · 2021-06-17T22:45:57Z

Here is the rendered HTML:

https://141704-843222-gh.circle-artifacts.com/0/doc/auto_examples/applications/plot_cyclical_feature_engineering.html

rth

A very nice and clearly written ~~example~~ tutorial!

cc @lorentzenchr

examples/applications/plot_cyclical_feature_engineering.py

rth · 2021-06-18T11:09:29Z

examples/applications/plot_cyclical_feature_engineering.py

+# %%
+# We observe that this model performance can almost rival the performance of
+# the gradient boosted trees with an average error around 6% of the maximum
+# demand.


Not asking to add it here, since it's already quite long but how does polynomial features without cyclic_spline_transformer perform?

Also I wonder if having KBinsDiscretizer would produce mostly equivalent results in terms or score, though with fewer artifacts in the prediction.

Intuitively I think I would agree with you. Let me try on my local copy what you suggest out of curiosity.

Polynomial Features or Polynomial kernel approximation on the raw time features does not work any better than the linear model on the raw time features.

Binning on the other hand only slightly worse than spline features but not by much (one or 2 percents than the matching model, with or without poly kernel approx). Here are the plots when binning:

Interesting, thanks for checking it!

rth · 2021-06-18T11:18:03Z

It's probably too late to change in now in sphinx-gallery but plot_cyclical_feature_engineering doesn't really make much sense as a name for something like this. It's not even an example more a tutorial.

ogrisel · 2021-06-18T14:28:17Z

It's probably too late to change in now in sphinx-gallery but plot_cyclical_feature_engineering doesn't really make much sense as a name for something like this. It's not even an example more a tutorial.

It's not too late at all. Where should we put this? What name for the file and the title do you suggest?

ogrisel · 2021-06-20T22:32:54Z

I added one-hot encoding of the time features because it's a natural strong baseline in this case and it makes for interesting analysis. I also started to reorganize the order a bit. I want to do it further (move the plot for the features + linear model before introducing the Nystroem kernel models).

thomasjpfan

This example only requiring ~ 15 seconds is very impressive given the scope.

examples/applications/plot_cyclical_feature_engineering.py

thomasjpfan · 2021-06-21T20:36:53Z

examples/applications/plot_cyclical_feature_engineering.py

+# %%
+# We visualize those predictions by zooming on the last 96 hours (4 days) of
+# the test set to get some qualitative insights:
+plt.figure(figsize=(12, 4))


May we use the OO interface of matplotlib? (especially now that we have so many plots)

thomasjpfan · 2021-06-21T20:38:41Z

examples/applications/plot_cyclical_feature_engineering.py

+# Again we zoom on the last 4 days of the test set:
+
+last_hours = slice(-96, None)
+plt.figure(figsize=(12, 4))


mpl OO interface here as well?

TomDLT

Awesome example!

TomDLT · 2021-06-21T21:00:26Z

examples/applications/plot_cyclical_feature_engineering.py

+    np.linspace(0, 26, 1000).reshape(-1, 1),
+    columns=["hour"],
+)
+splines = periodic_spline_transformer(24, n_knots=12).fit_transform(hour_df)


I find non-intuitive to have 11 splines in the figure. I know this number is arbitrary, but to relate splines with bins, wouldn't it make more sense to have 12 splines (and 13 knots)?
Then in periodic_spline_transformer, the default would be n_knots = period + 1.

I see what I can do. It makes me think that maybe the SplineTransformer should allow for n_splines and a period argument...

But that could make the parameters docstring very complex to understand. /cc @lorentzenchr.

When writing the SplineTransformer, I thought the number of knots is more intuitive than the number of splines/dof. But I documented the numbers very clearly (and it is accesible via n_features_out_). I did not, however, think of periodic splines or a period. That came only a little later with @mlondschien.

A period argument, however, would make sense in my opinion.

I was only suggesting to use periodic_spline_transformer(24, n_knots=13) in this example, to get 12 splines instead of 11. I agree the numbers are well documented in SplineTransformer.

Nice writeup! Different to what you are writing above @ogrisel, the number of knots you are choosing are not natural. They are arbitrary. I would vary the period but keep the number of knots fixed for month / weekday / hour (e.g. 5?). If you use period + 1 knots, resulting in period splines, the resulting splines are equivalent as using one-hot-encoded features (assuming integer value features). This is why the performance of splines is so similar to one-hot-encoded features. To benefit from the additional "smoothness" from splines, you will need to reduce the number of splines. Note that you could use non-evenly spaced knots, e.g. via quantiles.

If you want to display the strengths of periodic splines, I would suggest to include interactions between periodic transformations of the time variables. For e.g. 4 knots this could be manageable, whereas this would explode for one-hot encoded features.

I updated the example to control for the number of splines and made the number of knots a technical detail.

examples/applications/plot_cyclical_feature_engineering.py

lorentzenchr

Just excellent and a lot of fun to read!

examples/applications/plot_cyclical_feature_engineering.py

lorentzenchr · 2021-06-24T16:20:03Z

Already now: the best tutorial/example of the year. And we still have a year to go🥳 🍻

rth · 2021-06-24T16:26:11Z

It's not too late at all. Where should we put this? What name for the file and the title do you suggest?

Maybe something like cyclical_feature_engineering_tutorial.html or cyclical_feature_engineering_example.html would be a better name ? And then we would need to change sphinx-gallery pattern matching to include those patterns.

ogrisel · 2021-06-25T08:01:36Z

And then we would need to change sphinx-gallery pattern matching to include those patterns.

Agreed, but I would rather not change the sphinx gallery as part of this PR but instead coordinate via #18257.

ogrisel · 2021-06-25T09:02:10Z

I think I addressed all the comments. Thanks for the reviews!

examples/applications/plot_cyclical_feature_engineering.py

glemaitre · 2021-06-25T13:05:53Z

examples/applications/plot_cyclical_feature_engineering.py

+#
+# Here, we do minimal ordinal encoding for the categorical variables and then
+# let the model know that it should treat those as categorical variables by
+# using a dedicated tree splitting rule.


It might be nice to mention that we explicitly provide the order of the categories to avoid automatic ordering based on lexicography.

examples/applications/plot_cyclical_feature_engineering.py

glemaitre · 2021-06-25T13:08:29Z

examples/applications/plot_cyclical_feature_engineering.py

+# %%
+# This model has an average error around 4 to 5% of the maximum demand. This is
+# quite good for a first trial without any hyper-parameter tuning! We just had
+# to make the categorical variables explicit. Note that the time related


I see that you mentioned this point now. I was expecting to see it a bit earlier :)

I find more lightweight to do it this way.

examples/applications/plot_cyclical_feature_engineering.py

glemaitre · 2021-06-25T13:22:59Z

examples/applications/plot_cyclical_feature_engineering.py

+
+"""
+==========================
+Cyclic feature engineering


I don't know if we should have something more related to "date-time encoding". I might think it might be easier to find than "cyclic" even if the title is correct.

I changed the title. The title and the filename no longer match though. I think it's ok but not 100% sure.

mlondschien · 2021-06-25T16:05:28Z

This is a very nice tutorial!

Since I added the periodic feature to the SplineTransformer I thought I would weigh in:

As mentioned above, there is no benefit in using any transformer that preduces k - 1 features or more on a variable with k distinct values in a linear model. This also holds for the SplineTransformer, e.g. with include_intercept=False and n_knots=8 for weekday. The difference here is due to regularisation.
weekday and month do not take enough values to justify using splines. A better use of splines would be on dayoftheyear, where I would assume splines to outperform one-hot-encoded month. However this is not included in the data (and engineering it is probably out of the scope of this tutorial).
(periodic) splines allow for a smooth reduction of the number of features compared against one-hot-encoding. This is not necessary for month, weekday and hour, but might be valuable to construct interactions of these features. E.g. adding a hour and workday interaction with (only) 8 features reduces the MAE and RMSE of the one-hot and cyclic splines pipelines by ~25%:

from sklearn.preprocessing import PolynomialFeatures

hour_workday_interaction = make_pipeline(
    ColumnTransformer(
        [
            ("cyclic_hour", periodic_spline_transformer(24, n_splines=8), ["hour"]),
            ("workingday", FunctionTransformer(lambda x: x=="True"), ["workingday"]),
        ]
    ), PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
)

I don't think that OLS (or ridge) is a good fit here. A log-link GLM (e.g. GridSearchCV(TweedieRegressor(power=2), param_grid({"alpha": alphas})) is probably a better fit (the gamma performs better). Again I assume that this is out of scope for this tutorial.

ogrisel · 2021-06-25T17:05:58Z

I don't think that OLS (or ridge) is a good fit here. A log-link GLM (e.g. GridSearchCV(TweedieRegressor(power=2), param_grid({"alpha": alphas})) is probably a better fit (the gamma performs better). Again I assume that this is out of scope for this tutorial.

That's a good point. But the execution speed won't be the same. Maybe I will add a note.

I will think a bit how to take your other insightful remarks into account.

lorentzenchr

some nitpicks

examples/applications/plot_cyclical_feature_engineering.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

ogrisel · 2021-07-18T21:38:23Z

@mlondschien I have updated the notebook to take your remarks into account in the last few commits.

ogrisel · 2021-07-19T11:39:10Z

@mlondschien we have a problem with the periodic splines:

There are 12 splines for a period of 24 as expected on the figure. The periodic signal seems to start again as expected at the right location to ensure the continuity. But it seems that we have a missing spline near the end of the period. However the number of splines (aka output features was good).

This seems to be caused using include_bias=False. Using include_bias=True and setting n_knots = n_splines + 1 seems to fix the problem:

See the commit below:

mlondschien · 2021-07-19T14:47:27Z

What is the use-case / expected behaviour here? Are you interested in producing pretty plots or features for modelling?

I see three (non-compatible) outcomes here (i.e. for a periodicity of 24):

knots that are spaced two hours apart
12 splines
include_bias=False

If you want 12 splines with knots that are spaced two hours apart, you need to pass n_knots=13, knots="uniform" (or equivalently knots=np.linspace(0, 24, 13).reshape(-1, 1)) and include_bias=True. The 12 splines will sum to one (include_bias=True), so using them in a non-regularized model is discouraged. This is what you have implemented in 26051c4 via periodic_spline_transformer(24, 12). If you plot the resulting splines, you should get the second figure you posted.

If you want splines with knots that are spaced two hours apart without an intercept, you need to pass n_knots=13, knots="uniform" (or equivalently knots=np.linspace(0, 24, 13).reshape(-1, 1)) and include_bias=False. If you plot the resulting splines, it will appear as if one was missing. To get this you have to revert 26051c4 and pass periodic_spline_transformer(24, 11). I imagine that this is what you might want.

If you want 12 splines without an intercept, you need to pass n_knots=14, knots="uniform" (or equivalently knots=np.linspace(0, 24, 14).reshape(-1, 1)) and include_bias=False. This is what you got before 26051c4 with periodic_spline_transformer(24, 12). The knots will be 24 / 13 ~ 1.85 apart. The result is first figure you posted. I doubt that this is what you intended.

lorentzenchr · 2021-07-19T20:50:21Z

As we always use an L2 penalty (Ridge regression), we can set include_bias=True.
As @mlondschien already pointed out, include_bias=True means that the sum over all spline basis functions gives one, in every point. Otherwise said, a linear combination of the splines gives an intercept column.

lorentzenchr · 2021-07-19T20:51:16Z

BTW, I reeeeeeeally like the addition of interactions in 8d066f3!

examples/applications/plot_cyclical_feature_engineering.py

ogrisel · 2021-07-19T21:20:50Z

If you want 12 splines with knots that are spaced two hours apart, you need to pass n_knots=13, knots="uniform" (or equivalently knots=np.linspace(0, 24, 13).reshape(-1, 1)) and include_bias=True.

I agree. Since we use regularization, I think this is what makes most sense for this example: we want a symmetric handling of all the hours. I prefer to not use knots="uniform" to control the period explicitly.

ogrisel · 2021-07-26T09:03:13Z

I merged this. Thank you everyone for the detailed reviews.

apachaves · 2021-07-26T09:17:39Z

This looks awesome! Thank you for this.

glemaitre · 2021-07-26T09:42:36Z

Yeah indeed 🥇

…it-learn#20281) Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

ogrisel added Documentation module:preprocessing labels Jun 16, 2021

ogrisel mentioned this pull request Jun 16, 2021

[WIP] New Feature - CyclicalEncorder (cosine/sine) in preprocessing #20259

Closed

ogrisel changed the title ~~WIP cyclic feature engineering example~~ DOC cyclic feature engineering example Jun 17, 2021

ogrisel marked this pull request as ready for review June 17, 2021 18:51

ogrisel force-pushed the cyclic_feature_engineering branch from 4f69c63 to e620c5f Compare June 17, 2021 21:27

ogrisel changed the title ~~DOC cyclic feature engineering example~~ DOC new example on feature engineering for cyclic time features Jun 18, 2021

rth approved these changes Jun 18, 2021

View reviewed changes

thomasjpfan reviewed Jun 21, 2021

View reviewed changes

TomDLT reviewed Jun 21, 2021

View reviewed changes

lorentzenchr approved these changes Jun 22, 2021

View reviewed changes

lorentzenchr reviewed Jun 23, 2021

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

TomDLT reviewed Jun 23, 2021

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 25, 2021

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 25, 2021

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 25, 2021

View reviewed changes

glemaitre self-requested a review June 25, 2021 13:19

glemaitre reviewed Jun 25, 2021

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 25, 2021

View reviewed changes

ogrisel added 5 commits July 18, 2021 18:04

Make degrees of freedom of each pipeline more explicit

e84aee5

Various improvements

dcb65ac

Cosmetics

de16e81

Better labels on the first plot

c788eca

More compressed spline model

a51a667

ogrisel force-pushed the cyclic_feature_engineering branch from 26c7b6f to a51a667 Compare July 18, 2021 16:04

lorentzenchr reviewed Jul 18, 2021

View reviewed changes

ogrisel and others added 3 commits July 18, 2021 22:28

Apply suggestions from code review

65b84a2

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Demo PolynomialFeatures for workingday/hours interactions

8d066f3

Expand final remarks

df3800d

Use tab20b doe the splines plot

3e069af

Move spline peak locations to get balanced coverage of the feature range

26051c4

ogrisel commented Jul 19, 2021

View reviewed changes

examples/applications/plot_cyclical_feature_engineering.py Outdated Show resolved Hide resolved

Apply suggestions from code review

163d29e

ogrisel merged commit 45bb9ab into scikit-learn:main Jul 26, 2021

ogrisel deleted the cyclic_feature_engineering branch July 26, 2021 09:02

This was referenced Jul 26, 2021

Formatting fixes in time features example #20605

Merged

Missing labels on a plot for the time features example #20606

Merged

lorentzenchr mentioned this pull request Aug 16, 2021

DOC improve wording of time-related feature engineering example #20759

Merged

DOC new example on feature engineering for cyclic time features #20281

DOC new example on feature engineering for cyclic time features #20281

Conversation

ogrisel commented Jun 16, 2021 • edited Loading

ogrisel commented Jun 17, 2021

rth left a comment • edited Loading

Choose a reason for hiding this comment

rth Jun 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Jun 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Jun 18, 2021 • edited Loading

ogrisel commented Jun 18, 2021

ogrisel commented Jun 20, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT left a comment

Choose a reason for hiding this comment

TomDLT Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

ogrisel Jun 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr commented Jun 24, 2021

rth commented Jun 24, 2021

ogrisel commented Jun 25, 2021

ogrisel commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlondschien commented Jun 25, 2021

ogrisel commented Jun 25, 2021

lorentzenchr left a comment

Choose a reason for hiding this comment

ogrisel commented Jul 18, 2021

ogrisel commented Jul 19, 2021 • edited Loading

mlondschien commented Jul 19, 2021

lorentzenchr commented Jul 19, 2021

lorentzenchr commented Jul 19, 2021

ogrisel commented Jul 19, 2021

ogrisel commented Jul 26, 2021

apachaves commented Jul 26, 2021

glemaitre commented Jul 26, 2021

ogrisel commented Jun 16, 2021 •

edited

Loading

rth left a comment •

edited

Loading

rth Jun 18, 2021 •

edited

Loading

ogrisel Jun 18, 2021 •

edited

Loading

rth commented Jun 18, 2021 •

edited

Loading

TomDLT Jun 21, 2021 •

edited

Loading

ogrisel Jun 22, 2021 •

edited

Loading

ogrisel commented Jul 19, 2021 •

edited

Loading