# Time series classification

In this notebook I demonstrate how to do a classification task on panel data.

A classic example for this case is predicting churn: for each time period, build a model that predicts the target variable 1 (churn) or 0 (not churn) based on previous time period data.

In [1]:
import pandas as pd 

## Methodology

Things to pay special attention to:

- Target definition: are you forecasting an event one period ahead, or multiple periods ahead?
- Cross validation: what time periods do you want to use for model selection, and which for testing?
- Class unbalances: is each class represented equally? If not sample to remove redundancy and boost model performance.

All are demonstrated in this notebooks example.

## Example

For this demo let's play with a small panel dataset called [Cornwell and Rupert Returns to Schooling Data](http://people.stern.nyu.edu/wgreene/Econometrics/PanelDataSets.htm). The data contains multivariate time series data on 595 Individuals, measured over 7 Years. 

One of the variables is whether the person was married or not `MS == 1`. In this exercise we are going to predict if someone is getting married, which is defined as: `MS_t-1 == 0 & MS_t == 1`.

### 1. Preprocessing data

First import the data.

In [2]:
df = pd.read_csv('../data/cornwell_rupert.csv').rename(columns=lambda x: x.strip())

Then select a subset of unmarried people, and define our target variable.

In [3]:
df_unmarried = (
    df
    .loc[df['ID'].isin(df.loc[(df['YEAR']==1) & (df['MS']==0)]['ID'])]  # subset only unmarried population
    .assign(MS_previous = lambda x: x.groupby('ID')['MS'].shift(1)) 
    .assign(got_married = lambda x: x['MS'] > x['MS_previous'])  # check if married that year
    .drop('MS_previous', axis=1)
)

Now we have to prepare our dataframe such that at each time period, we have:
- One record per individual
- The target variable whether someone got married that year (1 or 0)
- Various features that are based on previous time periods (married last 1 period, 3 periods, 6 periods, ...)

For the time based features, we need a function that calculates these given any original feature and the required time lag or shift. 

In this example, we are only going to use the last year to predict the next year. This means we are shifting all our featuers by 1 period. Basically for every `X`, we are now making sure `X_t-1` is on the same row as the `y` that we are predicting. This is handy as it will allow for simple cross validation iterators later.

In [4]:
# function to lag features a specified number of periods
def shift_features(dataframe_in, *features, periods_to_shift=1, group_var='ID'):
    """Replaces features in a dataframe by their requested lag. 

    Args:
        dataframe_in: Input dataframe with features to lag
        features: List of features to lag
        periods_to_shift: Number of periods to shift each feature. Default is 1.
        group_var: Variable indicating the grouping within which to shift.

    Yields:
        dataframe_out: new dataframe with lagged features
    """
    dataframe_out = dataframe_in.copy()
    
    for feature in features:
        feature_name = feature+'_lag'+str(periods_to_shift)
        dataframe_out[feature_name] = dataframe_out.groupby(group_var)[feature].shift(periods_to_shift)
        dataframe_out.drop(feature, axis=1, inplace=True)

    return dataframe_out

Apply and check if all went well:

In [5]:
# shift all features except indices and the target
features = [col for col in df_unmarried.columns if col not in ['ID', 'YEAR', 'got_married']]
X_all = shift_features(df_unmarried, *features)

X_all.head(9)[['got_married', 'ID', 'YEAR', 'MS_lag1']]

Unnamed: 0,got_married,ID,YEAR,MS_lag1
21,False,4,1,
22,False,4,2,0.0
23,False,4,3,0.0
24,False,4,4,0.0
25,False,4,5,0.0
26,False,4,6,0.0
27,False,4,7,0.0
70,False,11,1,
71,False,11,2,0.0


Looks good, but now we have `NaN` values for the first year. Remove these from the data:

In [6]:
X_all = X_all.loc[X_all['YEAR'] > 1]

Great. This is the data we will be using for our model. As a final check, let's take a look if there is any skew in our data set.

In [7]:
def print_count_per_year(df):
    return (
        df
        .groupby('YEAR')
        .apply(lambda x: pd.Series({
            'got_married' : x['got_married'].sum(),
            'n_records' : len(x)
        }))
        .assign(perc_married = lambda x: x['got_married'] / x['n_records'] * 100)
    )

print_count_per_year(X_all)

Unnamed: 0_level_0,got_married,n_records,perc_married
YEAR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,3,105,2.857143
3,3,105,2.857143
4,4,105,3.809524
5,2,105,1.904762
6,2,105,1.904762
7,1,105,0.952381


Whoa! This is very skewed. Keeping it like this will make it very hard for a model to predict the right outcome. For example, a model that always predicts `got_married == 0` will be very hard to beat in terms of accuracy. 

A skew like this is quite normal for a churn case prediction problem. The solution is to sample in each of the periods to make the balance more equal. This also reduces redundancy caused by ohterwise including the same individual (`'ID'`) multiple times in the data set (1 for each year, so 7 times!).

In below code, we randomly pick 5 of the class of people that didn't get married for each year.

In [8]:
X_sampled = (
    X_all
    .groupby(['YEAR', 'got_married'], group_keys=False)
    .apply(lambda x: x.sample(min(len(x), 5), random_state=777))
)

print_count_per_year(X_sampled)

Unnamed: 0_level_0,got_married,n_records,perc_married
YEAR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,3,8,37.5
3,3,8,37.5
4,4,9,44.444444
5,2,7,28.571429
6,2,7,28.571429
7,1,6,16.666667


Now we can make a time series classification model on who is getting married each period.

In [9]:
X = X_sampled.copy()
y = X.pop('got_married')

X.shape, y.shape

((45, 24), (45,))

### 2. Setting up the cross validation

There are 7 years of data to be used. Think about how you want to do model selection and evaluation. For example:
- **Train_set**: years 1 to 5. Cross validation will look as follows:
    - The first two years are used for the first training set (`X` from period 1, `y` from period 2)
    - That means the first validation fold could be year 3.
    - Such that 3 fold time series cross validation is possible (year 3, 4 and 5).
- **Model evaluation**: years 6 and 7

Note that in the case we defined our target as predicting marriage 2 months ahead, we would have 1 gap month, and the first validation fold would be year 4. In that case, only 2 folds can be used for the cross validation.

For training we use the *entire* history untill time `t-1` to get as much training data as possible (the number of data points is pretty small, so we need as much as possible). In other cases, it could make sense to only use the last previous year of data for training. This could be a good option in the scenario that you think old data is not relevant for the current time period anymore, for example because the world changed.

We might be able to use `sklearn.model_selection.TimeSeriesSplit`, but here I like to demonstrate how to create a custom cross validation iterator to be used in `grid_search`. We want to predict let's say period `t` (= test data) based on the values of period `t-1, t-2, t-3, ...` (= train data). 

In [10]:
def timeseries_cv(x, split_by, start_period=1, stop_period=None):
    """Cross validation iterator which predicts a period based on all previous periods.

    Args:
        x: DataFrame to iterate over.
        split_by: Name of the time column or index level.
        start_period: First period nr that is used for validation. Default of 1 means 1 period is used for training.
        stop_period: Last period nr that is used for validation. By setting this data can kept apart for evaluation.

    Yields:
        train_index, test_index
    """
    
    x = x.reset_index()
    unique = x[split_by].unique()[start_period:stop_period]
    
    for value in unique:
        yield x[x[split_by] < value].index.values, x[x[split_by] == value].index.values

In [11]:
# define generator for creating train and test data folds
cv_splitter_train = timeseries_cv(X, 'YEAR', stop_period=4)
cv_splitter_test = timeseries_cv(X, 'YEAR', start_period=4)

# print('number of train/test folds:', (len(list(cv_splitter_train)), len(list(cv_splitter_test))))

This creates 3 train folds and 2 test folds (check for yourself by uncommenting the last `list` line).

This satisfies our wishes: 
- For **model selection** we use folds where years 3, 4 and 5 are used for validation. Years 1 and 2 are reserved for training.
- For **model evaluation** we use folds with actuals from years 6 and 7 to compare with our model predictions.

### 3. Model selection

Now let's do a grid search to select our model. We hand the grid search algorithm X, y and a cross validation iterator. The iterator will then select the indices for each of the folds and subset the X and y data accordingly.


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier(random_state=707)  # set random seed to reproduce results

params = {'max_depth': [2, 3, 4, 10], 
          'n_estimators' : [10, 20, 50]}

grid_search = GridSearchCV(model, scoring='precision', param_grid=params, n_jobs=-1, cv=cv_splitter_train)

grid_search.fit(X, y)

print('Best params:', grid_search.best_params_, '\n')

Best params: {'max_depth': 2, 'n_estimators': 50} 





In [13]:
# print('All grid search results:')

# for comb in zip(grid_search.cv_results_['params'], grid_search.cv_results_['mean_test_score']):
#     print(comb)

### 4. Model evaluation

We are going to evaluate the model on the last 2 periods. This means for every period we have to retrain the model, score with it and get the results. The results over the 2 periods are then averaged to get the final evaluation metric.

In [14]:
from sklearn.metrics import confusion_matrix, precision_score

model = grid_search.best_estimator_

cv_splitter_test = timeseries_cv(X, 'YEAR', start_period=4)

# loop over de test data folds
for train_ind, test_ind in cv_splitter_test:
    
    print('Number of train/test obs in fold =', (len(train_ind), len(test_ind)))
    
    # select test data for the selected fold
    X_train, y_train = X.iloc[train_ind], y.iloc[train_ind]
    X_test, y_test = X.iloc[test_ind], y.iloc[test_ind]
    
    # fit and predict for the selected fold
    model.fit(X_train, y_train)
    y_test_pred = model.predict(X_test)

    # report on error metrics
    print('Precision:', precision_score(y_test, y_test_pred))
    print('Confusion matrix: \n', confusion_matrix(y_test, y_test_pred))

Number of train/test obs in fold = (32, 7)
Precision: 0.5
Confusion matrix: 
 [[4 1]
 [1 1]]
Number of train/test obs in fold = (39, 6)
Precision: 0.0
Confusion matrix: 
 [[4 1]
 [1 0]]


Looking at the evaluation metrics above we see:
* The first fold had 32 observations (years 2, 3, 4 and 5), and 7 observations in year 6 to test.
* The 7 observations from year 6 are added to the training data for the last evaluation fold testing on year 7.
* The model classified one person correctly in the first fold (bottom-right corner of the confusion matrix).
* The overall precision we can report is the average precision over evaluation folds: `(0.5 + 0) / 2 = 0.25`.

Note: we made a choice here to split our data in a *model selection* (years 2-5) and *model evaluation* part (years 6 & 7). An alternative could be to do *rolling evaluation* instead, as described in notebook `00-time-series-basics`.

Done.