Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add keep_original_columns option to FourierFeatures trafo #4008

Merged
merged 2 commits into from Jan 29, 2023

Conversation

KishManani
Copy link
Contributor

@KishManani KishManani commented Dec 28, 2022

Reference Issues/PRs

See discussion here: #3996 (comment)

What does this implement/fix? Explain your changes.

This PR makes the following changes:

  1. Adds a keep_original_columns option and sets the default to True with a deprecation warning that in the future this should be set to False. This will make it easier to use FourierFeatures in the context of a pipeline (or FeatureUnion) where we do not wish to passthrough the other columns of X.
  2. Adds tests and modifies the existing tests now that the class can take a keep_original_columns as an argument.
  3. Removes the use of deep copies of X to make this more memory efficient.

What should a reviewer concentrate their feedback on?

  • Are there any concerns about removing the deep copies of X here? @ltsaprounis - would be great to hear your thoughts!

Any other comments?

For additional context around the impact of removing copies of X see the discussion here: #3996 (comment).

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • I've added unit tests and made sure they pass locally.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, really nice! Deprecation by the book.

@ltsaprounis is listed as owner so would have to approve before we merge.

@KishManani
Copy link
Contributor Author

KishManani commented Dec 28, 2022

Whilst we're on this topic, I have a question regarding what the behaviour of this transformer should be for panel data. The t that goes inside the sine and cosine functions is currently defined by the start of the time series. For panel data, the current implementation would look at the start index of each time series individually when getting the time indices to pass to the sine and cosine functions. This means that at time t, if two different time series have different start times, then the two time series will have different values for their fourier features. I think this could be problematic when fitting global models (e.g., linear regression via reduction). Consider a linear model, whilst the sum of the fourier features (i.e., a sum along axis=1) still represents a fourier series; the model might struggle to find the optimal coefficients because a given fourier feature (e.g., the column sin_7_1) contains not just a single sine wave (i.e.,sin(2*pi*t/7)) but instead a sine wave at multiple different phases (i.e.,sin(2*pi*(t-t_i)/7) where t_i is the start time of time series i).

Here is an example:

import numpy as np
import pandas as pd
from sktime.transformations.series.fourier import FourierFeatures

# Panel dataframe with two timeseries
# Note there are entries in the two time series which have the same date.
df = pd.DataFrame(data={'c0': 
{('h0_0', pd.to_datetime('2000-01-01 00:00:00')): 2.1482432004268444,
  ('h0_0', pd.to_datetime('2000-01-02 00:00:00')): 1.5974533778936917,
  ('h0_0', pd.to_datetime('2000-02-01 00:00:00')): 2.5374903855394737,
  ('h0_1', pd.to_datetime('2000-02-01 00:00:00')): 1.2991497904155045,
  ('h0_1', pd.to_datetime('2000-02-02 00:00:00')): 1.920059660128214,
  ('h0_1', pd.to_datetime('2000-02-03 00:00:00')): 1.0}})
display(df)
                        c0		
h0_0	2000-01-01	2.1482432004268444
h0_0	2000-01-02	1.5974533778936917
h0_0	2000-02-01	2.5374903855394737
h0_1	2000-02-01	1.2991497904155045
h0_1	2000-02-02	1.920059660128214
h0_1	2000-02-03	1.0

# Fit the transformer and create fourier features
trafo = FourierFeatures(sp_list=[7], fourier_terms_list=[1], freq="D")
trafo.fit(df)
result = trafo.transform(df)
display(result)
		        c0	sin_7_1	cos_7_1
h0_0	2000-01-01	2.148	0.0	1.0
h0_0	2000-01-02	1.597	0.782	0.623
h0_0	2000-02-01 *	2.537	0.434	-0.901 
h0_1	2000-02-01 *	1.299	0.0	1.0 
h0_1	2000-02-02	1.92	0.782	0.623
h0_1	2000-02-03	1.0	0.975	-0.223

The rows with the same time points (marked with an *) have different feature values.

@ltsaprounis - do you know if this is problematic or not? If it is, a simple solution would be to use a single reference time (i.e., the earliest time of all the time series) when computing the integer time index to pass to the sine and cosine functions. However, this would be modelling a shared seasonal component across all the multiple time series. What do you think?

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 28, 2022

@KishManani, it could simply be an option: shared time axis (reference point possibly being the earliest time stamp occurring), or individual time axis, which is also the current (and recommended future?) default

@KishManani
Copy link
Contributor Author

@KishManani, it could simply be an option: shared time axis (reference point possibly being the earliest time stamp occurring), or individual time axis, which is also the current (and recommended future?) default

This is a very pragmatic way forward. I'll wait for @ltsaprounis to comment and leave this for a separate PR.

@KishManani
Copy link
Contributor Author

Hi @fkiraly! Any idea of whether this can be merged?

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 29, 2023

it's been a month now without reaction, and this moves an existing transformer towards an interface desideratum (do not keep columns). There is also deprecation which makes this easily reversible should discussion come up.
I'm therefore making the call to merge it for 0.16.0.

@fkiraly fkiraly merged commit 24aab6e into sktime:main Jan 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants