# Time series basics

There's multiple use cases that we need to use time series data for:

- Classification: predict next period's class. 
    - Example: churn y/n?
- Regression: predict next period's value. 
    - Example: stock price?
- Outlier detection: discover unusual values. 
    - Example: sudden spikes or crashes?

## Methodology

### Why it matters

Especially in timeseries you want to be careful that you do **not** leak information from the future into your trainingset. This would be cheating and might give a false impression that you've got a good model. Therefore take care spending time on the following topics.

### Target definition

What exactly are you trying to forecast?

* If it is churn in the coming two months, this means you need two months of 'gap time' before you can add your target column to your feature data set.
* If dealing with multiple time dimensions (like incident and book month) think carefully about which dimension has business value forecasting (likely book month as financial department can make a reservation for this). 
* Etc...

In the case of classification, examine the distribution over each of the classes. If very unbalanced, sample to get more equal proportions of each class. For example, remove non-churners if the number of churners is relatively small.

### Features

Set up your dataset in 'long' format, also in the case of panel data where you have multiple dimensions like `client` and `time` (2-dimensional). In this example the feature table will have `client-time` as index, such that each client individually has multiple rows (for each timestamp) and each timestamp has multiple rows (for each of the clients). 

By doing this you can save lots of computation time when doing the cross validation, as you can now easily *slice* the rows that you need for each of the folds without recalculating the feature table for each fold.

#### Lookback features

When forecasting for example churn, in each of the cross validation's folds (see below) we have to forecast whether a client is churning on a specific `time`. This means we have to flatten the `client-time` feature data set to a `client` indexed data set only. To keep some of the time information in there, you will have to create features like:
* Number of clicks in last 1 month
* Number of clicks in last 3 month
* Number of clicks in last 6 month
* ...

We call these *lookback features*. It's best to create these using functions like:

```python 
create_lookback_features(feature_name='nr_clicks', months=[1, 3, 6])
```

These features ought to be added to the feature data set **before** model selection, such that model selection can be done simply by slicing through the feature data. See comment above.

#### RBF features

In the case you want to use time as features, try using [radial basis functions](http://koaning.io/radial-basis-functions.html) instead of 'stupid' dummies.

For example, we can use timestamp in seconds as a feature for capturing long term trend (train on for example 3 months of data, or max 2 years). And we can derive RBF function values for 24x hour features, or 48x hour-weekend features. We prefer using 48 features with cross combinations of weekend/weekday and hour above 24 hour features and a dummy for weekend, as weekend effects are probably not linear. I.e. traffic volume will likely not be increased by the same amount over the entire day as would be suggested by the parameter for the weekday dummy. Instead, it's likely that only specific hours of the day will be affected by the fact that it's a weekend.

Same goes up for day features. Their effects will likely also not be linear across each hour of the day. 

### Model evaluation framework

Two options:
1. One-off split of data into a train and test set. Use the train for model selection. Evaluate on the test set.
    - For example: use the last 3 years for training and model selection. Then, evaluate on the last 6 months.
    - This is the most straightforward option, very similar to normal cross-sectional (non time series) data.
2. Split data in train and test set for multiple time series folds ("nested cv"). Train and evaluate on each fold.
    - For example: use the last 3 years to split into 6 folds of 6 months to train and test on. 
    - The final test metrics is an average of each of the 6 folds' test result.


#### Option 1: *'regular'* time series cross validation

In time series you cannot use regular k-fold cross validation as you will be using information from the future. Therefore, we use more of a 'sliding window' wherein the validation set moves with time after each iteration. This can be done in `sklearn` with `sklearn.model_selection.TimeSeriesSplit`. 

Note: below picture only visualizes the cross validation process and doesn't show the hold-out set. We keep this data apart for model evaluation at the end.

![](https://i.stack.imgur.com/fXZ6k.png)

Sometimes, you need to customise things. If the timeseries if fluctuating a lot then you might want to consider not looking at **all** of your past data but merely a **subset**. It might also be the case that we're interested in merely prediction 1 day ahead or two weeks. We might be interested in something like:

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRffkVyzaJ080jaAceiSup-177tDuiku5KxLZ6e1VLAW94GKG1kiA)

For this you will need to write your own crossvalidator as there is no standard implementation available.

Model selection can be based upon performance in each of the time series validation folds. Calculate the metric of interest (like; RMSE for regression or precision for classification), and examine the average and standard deviation over all folds to get a feeling of the model's bias/variance trade-off. This is the same for k-fold cross validation on non time series data.

#### Option 2: *'nested'* time series cross validation

The above method has some minor disadvantages, that can be improved using nested cross validation: 
- Hyperparameters are not based upon most recent data as this data is reserved for testing. In nested cross validation, the final model will be trained on the most recent data. 
- The test error that we use for model evaluation is based on the most recent period only. Let's say we pick the last 3 months for model evaluation, then we don't get an unbiased idea about the true error over the entire year. In nested cross validation, we get the most unbiased estimate of the true error possible; the average of each split's error. 

Check out this [blog](https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9) for a good explanation of how it works. Read the [GoDataDriven blog](https://blog.godatadriven.com/time-series-nested-cv) for good pictures and example code. 

![](https://blog.godatadriven.com/images/time-series-nested-cv/custom_time_series_cv.png)

<br>

As an example, if our dataset has five days, then we would produce three different training and test splits, as shown in this figure from the above mentioned blog:

![](https://cdn-images-1.medium.com/max/800/1*2-zaRQ-dsv8KWxOlzc8VaA.png)

### Testing

Especially for time series, build some tests on a small toy data set to check your data pipeline. Specifically validate:
* If your target definition is computed correctly
* If your (lookback) features are computed correctly
* If your custom cross validation time series splitter works correct

Making a small mistake is easy when working with time series.


## Tools

There's really a lot of time series tools out there. Just check out this list alone by [rob-med](https://github.com/rob-med/awesome-TS-anomaly-detection).

Some of the more known are:

### Prophet

Easy to use forecasting module by Facebook.

### PyMC3

Bayesion approach. Can be handy when few observation. Benefits are:
- that you can give a prior.
- that you get an uncertainty metric for the outcome as the result is a distribution not a number (like in the frequentists' approach)

### Statsmodels

For SARIMA models

### Pyflux

...