-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tabularize dtype draft #316
Conversation
@mloning This is a simplified change. There are some behaviours I noticed and wanted your opinion on it: Starting with a nested df: from sktime.utils.data_container import tabularize, from_nested_to_long, detabularize
from sktime.utils._testing import generate_df_from_array
n_obs_X = 20
n_cols_X = 3
X = generate_df_from_array(np.random.normal(size=n_obs_X), n_rows=10,n_cols=n_cols_X)
X.head() # <- Returns 3 columns with nested series
# Tabularize (new implementation
Xt = tabularize(X)
Xt.head() # <- column is now is a tuple of (column name, time index)
Remarks:
Further remarks: Xd = detabularize(Xt)
Xd.head() # <- single column where each element is a series of series. Behaviour is the same in old implementation of tabularize
tabularize(detabularize(tabularize(X))) # <- The time index for 3 columns is merged (we lose information about the time index) A few questions I have about this behavior:
|
I tried one implementation with a MultiIndex setup for tabulzarize that would return consistent transformations (more extensive testing that visual checks needed): def explode_series(s):
s = s if isinstance(s, pd.Series) else pd.Series(s)
df = s.explode()
idx = lambda x: x.index if hasattr(x, "index") else np.arange(x.shape[0])
time = s.apply(lambda x: idx(x.index)).explode()
df_idx = pd.DataFrame([df.index, time]).T
df.index = pd.MultiIndex.from_frame(df_idx)
return df
# Need to add return_array
def tabularize(X):
Xt=(
X
.apply(explode_series)
.unstack(-1)
)
Xt.columns.names = ["column", "time_index"]
Xt.index.name = "index"
return Xt def detabularize(Xt):
def _detabularize_series(s):
df = pd.Series(s.values, index=s.index.values)
return pd.Series([df,], name=s.name)
def _detabularize_group(df):
temp = (
df.droplevel("column")
.apply(_detabularize_series).T
)
return temp
Xd = (
Xt.T
.groupby("column")
.apply(lambda x: _detabularize_group(x))
.unstack(0)
.droplevel(level=0, axis=1)
)
Xd.index.name = None
Xd.columns.name = None
return Xd Test everything is same: _assert_almost_equal(X, detabularize(tabularize(X))) # passes Note: This implementation will (potentially) allow consistency between transformations ([de]tabularize - nested_to_long - long_to_nested) but it requires all indexes to be unique. |
Thanks @hiqbal2 for your work on this!
Thanks again @hiqbal2 I really appreciate your work on this! 👍 |
@mloning, will try to answer some of the questions you raised.
Given the multiple ways the data can be transformed, I would go for a method that does not transform data types too much.
Does this mean that our TS models accept it too? As a simple example, if I run a simple regression model with duplicate column names, wouldn't it be confusing to return betas with common name but different values? Shouldn't we raise a warning or error. Similarly, how do we go from One potential method could be 1) raise a warning, and 2) rename with distinguishing names
Definitely. I can benchmark the computations. I also need to provide an ability to return numpy arrays too.
It should be possible to add in checks for dtypes of cells. However, since I'm new, if the tabular format mostly used within the sktime api? From how i understand, a user will likely provide data in a normal pandas dataframe with a time column and a group/stock id/id/observation column. Another request: all tests should be added to test_data_container.py and existing tests moved there from test_utils_forecasting.py Looks like I will be finally getting a crash course in testing :) Some general points/questions I had in mind:
Completely off-topic but:
|
Hi @hiqbal2 thanks for your comments. We've had a more in-depth discussion about our choice of data container, for more details see our wiki entry and speak to @prockenschaub who prototyped an extension of pandas for our use case. Another idea we're currently considering is adding support for 3d numpy arrays in addition to the nested pandas DataFrame.
I'd suggest to start with these two changes and then see what else we may have to change. |
@hamzahiqb please feel free to reopen if you start working on this again! |
Reference Issues/PRs
Initial draft for #311
What does this implement/fix? Explain your changes.
Saves the column as a tuple with column name and time index instead of the string implementation.
Does your contribution introduce a new dependency? If yes, which one?
Yes. Currently need to ensure that it does not break implementation. There are other secondary issues to consider.
PR checklist for new estimators
Any other comments?