# Random Forest with TS-Fresh Feature Selection

The tsfresh python package is designed specifically for selecting features from time series data. With the help of having the best, most useful features, a random forest classifier achieves **79% accuracy** in predicting which students will ultimately succeed or fail based on the data from the first third of course timelines. Importantly, it also achieves **87% precision on class 0 (failing students)**. Depending on the nature of the interjection that this model would support, it may be worth it in a real-world setting to sacrifice precision in order to improve recall on class 0. (For example, if the interjection was polite and optional, it would probably be more important to reach students in danger of failing than to not bother students that are actually fine.)

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [213]:
from tsfresh import extract_features, select_features, extract_relevant_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction.settings import MinimalFCParameters, ComprehensiveFCParameters
from tsfresh.transformers import RelevantFeatureAugmenter

In [71]:
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import xgboost as xgb

In [4]:
stu_vle = pd.read_csv('data/stvl_ccc14b.csv')
stu_as = pd.read_csv('data/stas_ccc14b.csv')
ass = pd.read_csv('data/ass_ccc14b.csv')
vle = pd.read_csv('data/vle_ccc14b.csv')
stu_info = pd.read_csv('data/stuinfo_ccc14b.csv')
stu_reg = pd.read_csv('data/stureg_ccc14b.csv')

# Preprocessing

## NOTE: This is a demonstration of the preprocessing that needs to be done for using ts-fresh transformer. All of this preprocessing in encapsulated in the ts_preprocess function in src.util, for easier use of the pipeline in the future.

### Drop early withdrawal students

Or rather, determine which students to keep.

In [5]:
stu_reg.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,CCC,2014B,28418,-37.0,
1,CCC,2014B,29764,-34.0,
2,CCC,2014B,29820,-57.0,
3,CCC,2014B,40333,-30.0,17.0
4,CCC,2014B,40604,-17.0,


In [6]:
pop_of_interest = stu_reg.drop(stu_reg[stu_reg.date_unregistration <= 67].index)['id_student'].unique()

In [7]:
len(stu_vle.drop(stu_vle[~stu_vle.id_student.isin(pop_of_interest)].index)['id_student'].unique()) # num students left in course who clicked

1300

In [12]:
clkd_pop_of_interest = stu_vle.drop(stu_vle[~stu_vle.id_student.isin(pop_of_interest)].index)['id_student'].unique()

### Make y (targets) column

In [10]:
stu_info.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,CCC,2014B,28418,F,West Midlands Region,A Level or Equivalent,20-30%,0-35,0,30,N,Fail
1,CCC,2014B,29764,M,East Anglian Region,A Level or Equivalent,50-60%,0-35,0,90,N,Distinction
2,CCC,2014B,29820,M,East Anglian Region,HE Qualification,40-50%,0-35,0,60,N,Pass
3,CCC,2014B,40333,M,North Region,HE Qualification,0-10%,35-55,0,30,N,Withdrawn
4,CCC,2014B,40604,M,Ireland,A Level or Equivalent,,35-55,0,30,N,Pass


In [13]:
y_col = stu_info.drop(stu_info[~stu_info.id_student.isin(clkd_pop_of_interest)].index)

In [14]:
y_col = y_col.drop(['code_module', 'code_presentation', 'gender', 'region', 'highest_education', 'imd_band', 'age_band', 'num_of_prev_attempts', 'studied_credits', 'disability'], axis=1)
y_col.final_result.replace(to_replace=dict(Pass=1, Distinction=1, Fail=0, Withdrawn=0), inplace=True)
y_col = y_col.sort_values(by=['id_student'])

In [15]:
y_col.head()

Unnamed: 0,id_student,final_result
0,28418,0
1,29764,1
2,29820,1
4,40604,1
5,42638,1


In [79]:
y_col = y_col.set_index('id_student')

In [82]:
y = y_col['final_result']

### Train-test split BEFORE feature-extraction

Very important or else this would perform significantly worse in production.

In [83]:
y_train, y_test = train_test_split(y, test_size=0.2)

In [88]:
len(y_train), len(y_test)

(1040, 260)

### Make students click stream data (from which to generate features)

In [91]:
df_train = stu_vle.loc[stu_vle.id_student.isin(y_train.index)]
df_test = stu_vle.loc[stu_vle.id_student.isin(y_test.index)]

In [92]:
df_train = df_train.drop(['code_module', 'code_presentation', 'id_site'], axis=1)
df_train = df_train.groupby(['id_student', 'date']).sum() #sum daily clicks per student
df_train = df_train.reset_index()

In [94]:
df_test = df_test.drop(['code_module', 'code_presentation', 'id_site'], axis=1)
df_test = df_test.groupby(['id_student', 'date']).sum() #sum daily clicks per student
df_test = df_test.reset_index()

In [99]:
df_train.shape, df_test.shape

((66932, 3), (15815, 3))

### Make X input for tsfresh feature augmenter (just student IDs of each group)

In [100]:
X_train = pd.DataFrame(index=y_train.index)
X_test = pd.DataFrame(index=y_test.index)

In [103]:
X_train.head()

147675
601949
251385
2016517
383158


# Create and run pipeline with feature extraction and classifier

This all may look strange but setting the parameters like this is precisely how the ts docs outline, for the use case of extracting features on the train set, and then only using those features when evaluating the test set. (that is, NOT extracting features from the test set) (see docs at https://github.com/blue-yonder/tsfresh/blob/master/notebooks/pipeline_with_two_datasets.ipynb)

In [104]:
ppl = Pipeline([('fresh', RelevantFeatureAugmenter(column_id='id_student', column_sort='date', 
                                                   default_fc_parameters=MinimalFCParameters())),
                ('clf', RandomForestClassifier())])

In [105]:
# for the fit on the train test set, we set the fresh__timeseries_container to `df_train`
ppl.set_params(fresh__timeseries_container=df_train)
ppl.fit(X_train, y_train)

Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 97.19it/s]
Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 257.30it/s]


Pipeline(memory=None,
         steps=[('fresh',
                 RelevantFeatureAugmenter(chunksize=None,
                                          column_id='id_student',
                                          column_kind=None, column_sort='date',
                                          column_value=None,
                                          default_fc_parameters={'length': None,
                                                                 'maximum': None,
                                                                 'mean': None,
                                                                 'median': None,
                                                                 'minimum': None,
                                                                 'standard_deviation': None,
                                                                 'sum_values': None,
                                                                 'variance': None},
                    

In [106]:
# for the predict on the test test set, we set the fresh__timeseries_container to `df_test`
ppl.set_params(fresh__timeseries_container=df_test)
y_pred = ppl.predict(X_test)

Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 782.50it/s]


In [107]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.74      0.76       131
           1       0.75      0.80      0.77       129

    accuracy                           0.77       260
   macro avg       0.77      0.77      0.77       260
weighted avg       0.77      0.77      0.77       260



# Again but with diffs

In [112]:
df_trainpiv = df_train.pivot(index='id_student', columns='date', values='sum_click')
df_trainpiv.head()

date,-18,-17,-16,-15,-14,-13,-12,-11,-10,-9,...,232,233,234,235,236,237,238,239,240,241
id_student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28418,,,,,,,,,,,...,,,,,,,,,,
29820,2.0,,,,,,,,,,...,,,,,,,,,,
40604,,,4.0,9.0,,,9.0,,24.0,10.0,...,,,,,1.0,,,,,
42638,20.0,22.0,7.0,9.0,,5.0,,5.0,2.0,3.0,...,,7.0,32.0,9.0,,5.0,,18.0,9.0,
45664,,,,,,,,2.0,1.0,,...,,,,,,,,,,


In [114]:
df_trainpiv = df_trainpiv.fillna(0)

In [116]:
df_trpivdif = df_trainpiv.diff(axis=1)
df_trpivdif.drop(-18, axis=1, inplace=True) #drop first column which becomes all NaNs after diff

In [117]:
df_trpivdif.head()

date,-17,-16,-15,-14,-13,-12,-11,-10,-9,-8,...,232,233,234,235,236,237,238,239,240,241
id_student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29820,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40604,0.0,4.0,5.0,-9.0,0.0,9.0,-9.0,24.0,-14.0,-3.0,...,0.0,0.0,0.0,0.0,1.0,-1.0,0.0,0.0,0.0,0.0
42638,2.0,-15.0,2.0,-9.0,5.0,-5.0,5.0,-3.0,1.0,-3.0,...,-6.0,7.0,25.0,-23.0,-9.0,5.0,-5.0,18.0,-9.0,-9.0
45664,0.0,0.0,0.0,0.0,0.0,0.0,2.0,-1.0,-1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [130]:
df_trpivdif = df_trpivdif.stack() #back to tsfresh format

In [133]:
df_trpivdif = df_trpivdif.reset_index()

In [134]:
df_testpiv = df_test.pivot(index='id_student', columns='date', values='sum_click')
df_testpiv = df_testpiv.fillna(0)
df_tspivdif = df_testpiv.diff(axis=1)
df_tspivdif.drop(-18, axis=1, inplace=True) #drop first column which becomes all NaNs after diff
df_tspivdif = df_tspivdif.stack() #back to tsfresh format
df_tspivdif = df_tspivdif.reset_index()

In [135]:
df_tspivdif.head()

Unnamed: 0,id_student,date,0
0,29764,-17,2.0
1,29764,-16,-14.0
2,29764,-15,-1.0
3,29764,-14,0.0
4,29764,-13,0.0


### pipeline

In [136]:
ppl2 = Pipeline([('fresh', RelevantFeatureAugmenter(column_id='id_student', column_sort='date', 
                                                   default_fc_parameters=MinimalFCParameters())),
                ('clf', RandomForestClassifier())])

In [137]:
# for the fit on the train test set, we set the fresh__timeseries_container to `df_train`
ppl2.set_params(fresh__timeseries_container=df_trpivdif)
ppl2.fit(X_train, y_train)

Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 209.00it/s]
Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 229.00it/s]


Pipeline(memory=None,
         steps=[('fresh',
                 RelevantFeatureAugmenter(chunksize=None,
                                          column_id='id_student',
                                          column_kind=None, column_sort='date',
                                          column_value=None,
                                          default_fc_parameters={'length': None,
                                                                 'maximum': None,
                                                                 'mean': None,
                                                                 'median': None,
                                                                 'minimum': None,
                                                                 'standard_deviation': None,
                                                                 'sum_values': None,
                                                                 'variance': None},
                    

In [138]:
# for the predict on the test test set, we set the fresh__timeseries_container to `df_test`
ppl2.set_params(fresh__timeseries_container=df_tspivdif)
y_pred2 = ppl2.predict(X_test)

Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 794.44it/s]


In [139]:
print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.74      0.73      0.74       131
           1       0.73      0.74      0.74       129

    accuracy                           0.74       260
   macro avg       0.74      0.74      0.74       260
weighted avg       0.74      0.74      0.74       260



Much worse with diffs this time!

# Parameter play

In [214]:
ppl3 = Pipeline([('fresh', RelevantFeatureAugmenter(column_id='id_student', column_sort='date', 
                                                   default_fc_parameters=ComprehensiveFCParameters())),
                ('clf', RandomForestClassifier(n_estimators = 200,
                                max_depth=5,
                                random_state = 21))])

In [215]:
# for the fit on the train test set, we set the fresh__timeseries_container to `df_train`
ppl3.set_params(fresh__timeseries_container=df_train)
ppl3.fit(X_train, y_train)

Feature Extraction: 100%|██████████| 20/20 [00:39<00:00,  1.37s/it]


Feature Extraction: 100%|██████████| 20/20 [00:26<00:00,  1.01it/s]


Pipeline(memory=None,
         steps=[('fresh',
                 RelevantFeatureAugmenter(chunksize=None,
                                          column_id='id_student',
                                          column_kind=None, column_sort='date',
                                          column_value=None,
                                          default_fc_parameters={'abs_energy': None,
                                                                 'absolute_sum_of_changes': None,
                                                                 'agg_autocorrelation': [{'f_agg': 'mean',
                                                                                          'maxlag': 40},
                                                                                         {'f_agg': 'median',
                                                                                          'maxlag': 40},
                                                                              

In [216]:
# for the predict on the test test set, we set the fresh__timeseries_container to `df_test`
ppl3.set_params(fresh__timeseries_container=df_test)
y_pred3 = ppl3.predict(X_test)

Feature Extraction: 100%|██████████| 20/20 [00:05<00:00,  4.54it/s]


In [217]:
print(classification_report(y_test, y_pred3))

              precision    recall  f1-score   support

           0       0.87      0.69      0.77       131
           1       0.74      0.89      0.81       129

    accuracy                           0.79       260
   macro avg       0.80      0.79      0.79       260
weighted avg       0.80      0.79      0.79       260



**That's more like it.** This model achieved 79% accuracy, and 87% precision on class 0 (failing students). Depending on the nature of the interjection that this model would support, it may be worth it in a real-world setting to sacrifice precision to improve recall on class 0.

In [210]:
ppl3.named_steps['clf'].feature_importances_

array([0.391     , 0.31416356, 0.11127336, 0.05598614, 0.04973959,
       0.04213917, 0.01146875, 0.02422944])

In [201]:
ppl4 = Pipeline([('fresh', RelevantFeatureAugmenter(column_id='id_student', column_sort='date', 
                                                   default_fc_parameters=MinimalFCParameters())),
                ('clf', xgb.XGBClassifier(n_estimators    = 300, 
                              max_depth    =  5,
                              random_state = 21))])

In [202]:
# for the fit on the train test set, we set the fresh__timeseries_container to `df_train`
ppl4.set_params(fresh__timeseries_container=df_train)
ppl4.fit(X_train, y_train)

Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 241.92it/s]
Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 257.60it/s]


Pipeline(memory=None,
         steps=[('fresh',
                 RelevantFeatureAugmenter(chunksize=None,
                                          column_id='id_student',
                                          column_kind=None, column_sort='date',
                                          column_value=None,
                                          default_fc_parameters={'length': None,
                                                                 'maximum': None,
                                                                 'mean': None,
                                                                 'median': None,
                                                                 'minimum': None,
                                                                 'standard_deviation': None,
                                                                 'sum_values': None,
                                                                 'variance': None},
                    

In [203]:
# for the predict on the test test set, we set the fresh__timeseries_container to `df_test`
ppl4.set_params(fresh__timeseries_container=df_test)
y_pred4 = ppl4.predict(X_test)

Feature Extraction: 100%|██████████| 20/20 [00:00<00:00, 8312.96it/s]


In [204]:
print(classification_report(y_test, y_pred4))

              precision    recall  f1-score   support

           0       0.82      0.73      0.77       131
           1       0.75      0.84      0.79       129

    accuracy                           0.78       260
   macro avg       0.78      0.78      0.78       260
weighted avg       0.78      0.78      0.78       260

