Problems interpreting variable importances in multivariate time series forest #736

flying-scotsman · 2021-03-23T09:05:00Z

flying-scotsman
Mar 23, 2021

Hi sktime community!

I'm trying to understand variable importances in a multivariate time series forest. I've attached a diagram - it displays the 3 standard feature importances for a time series forest fitted on a dataset with 103 instances, 2 variables (here 1 & 2) and 28 time points. What's confusing me is how they seem to be continuous at the boundary - of course the importances for the last time point for variable 1 are directly before the first time point of variable 2. I've also observed that the ordering of the variables in the long time series affects the feature importances.

Can anyone explain to me what's happening here? What I would expect: Discontinuous (but normalized) variable importance curves.

Here's the code I'm using to extract the importances (df is a pandas.DataFrame with pandas.Series as cells and labels is a pandas.Series):

X_train, X_test, y_train, y_test = train_test_split(df, labels, random_state=42)

steps = [
    ("concatenate", ColumnConcatenator()), 
    ("classify", TimeSeriesForestClassifier(n_estimators=200)),
]

clf = Pipeline(steps)
clf.fit(X_train, y_train)

importance = clf.steps[1][1].feature_importances_
importance['feature'] = numpy.repeat(list(X_train.columns), len(df.iat[0, 0]))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems interpreting variable importances in multivariate time series forest #736

{{title}}

Replies: 0 comments

Select a reply

Problems interpreting variable importances in multivariate time series forest #736

flying-scotsman Mar 23, 2021

Replies: 0 comments

flying-scotsman
Mar 23, 2021