# Cross-validation & strategy backtesting

## Cross-validation

We follow the same steps as last week, but with different cross-validation approaches.

1. Import the data
2. Feature engineering and data labelling
3. Split the data into train, validation, and test datasets
4. Model builder
5. Train and cross-validate the model
6. Make predictions and evaluate the performance of the final selected model

### 1. Import the data

We will import data from Yahoo! finance

In [None]:
import cryptocompare
import pandas as pd

df = cryptocompare.get_historical_price_hour('BTC', curr='USD', limit=2000)
df = pd.DataFrame(df)
df.time = pd.to_datetime(df['time'], unit='s')
df.set_index('time', inplace=True)

df

In [None]:
from plotnine import *

(ggplot(df, aes(x='df.index', y='close'))
 + geom_line()
 + xlab('date'))

### 2. Feature engineering and data labelling

#### 2.1 Feature engineering

In [None]:
import talib as ta

df['ADX'] = ta.ADX(df['high'].values, df['low'].values, df['close'].values, timeperiod=14) / 30
df['RSI'] = ta.RSI(df['close'].values, timeperiod=14) / 30
df['SMA'] = ta.SMA(df['close'].values, timeperiod=20) / 1e4
df['SMA2'] = ta.SMA(df['volumeto'].values, timeperiod=20) / 1e7

df

#### 2.2 Data labelling

We will use the **fixed horizon method** with a non-zero threshold.

In [None]:
import numpy as np

label_window = 5
return_threshold = 0.0025

## Compute the n-day future returns
df['fut_returns'] = df['close'].pct_change(+label_window).shift(-label_window)

## Attribute the class {-1, 0, 1} if the future return is {below, between, above} the thresholds
df['target_class'] = np.where(df.fut_returns > return_threshold, 1, 
                                np.where(df.fut_returns < -return_threshold, -1, 0))

df

Check that our target classes are balanced.

In [None]:
(ggplot(df, aes(x='target_class'))
 + geom_histogram())

#### 2.3 Extract $X$ and $y$

In [None]:
df = df.dropna()

## feature variables
predictors_list = ['ADX', 'RSI', 'SMA', 'SMA2']
X = df[predictors_list].to_numpy()

## target variable
y = df.target_class.to_numpy()

## 3. Split the data into train-test datasets

This time we **DO NOT** shuffle the data and reserve the last part as a **holdout** sample.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Are the two samples balanced?

In [None]:
print('Percentage of 1s in the train and test sets: %.2f and %.2f' % (np.mean(y_train==1)*100, np.mean(y_test==1)*100))
print('Percentage of -1s in the train and test sets: %.2f and %.2f' % (np.mean(y_train==-1)*100, np.mean(y_test==-1)*100))

## 4. Model builder

Functions to easily build and evaluate models. *This is normally a big chunk of code!!!*

For simplicity, we use scikit-learn but the steps would be the same with TensorFlow but the model construction would be a bit more involved.

We will work with a Random Forest, see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn.ensemble import RandomForestClassifier

def create_model(n_estimators, max_depth):
    return RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=0)

## 5. Train and cross-validate the model

In [None]:
from sklearn.model_selection import KFold

## model parameters
n_estimators = 100
max_depth = 10
## CV parameters
n_split = 5

scores_train = []
scores_test = []

for train_index, test_index in KFold(n_split).split(X_train):
    
    ## CV train-test split
    x_cv_train, x_cv_test = X_train[train_index], X_train[test_index]
    y_cv_train, y_cv_test = y_train[train_index], y_train[test_index]
  
    ## create and train the model
    model = create_model(n_estimators, max_depth)
    model.fit(x_cv_train, y_cv_train)
    
    ## collect accuracy metrics
    scores_train.append(model.score(x_cv_train, y_cv_train, sample_weight=None))
    scores_test.append(model.score(x_cv_test, y_cv_test, sample_weight=None))

print('Mean accuracy on CV train %.2f%%' % (100*np.mean(scores_train)))
print('Mean accuracy on CV test %.2f%%' % (100*np.mean(scores_test)))

**Exercise:** 
  * Are the results good? If no, why?
  * How to perform hyper-parameter tuning?
  * Is the cross-validation approach appropriate here? If no, what would you change?

### Hyper-parameter search

In [None]:
## function that generate some random parameters
def generate_random_hyperparams(par1_min, par1_max, par2_min, par2_max):
    random_par1 = np.random.uniform(par1_min, par1_max)
    random_par2 = np.random.uniform(par2_min, par2_max)
    return random_par1, random_par2

## create grid of parameters with 'meshgrid' as follows
np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)

## 6. Make predictions and evaluate the performance of the final selected model

In [None]:
max_depth = 5
n_estimators = 10
model = create_model(n_estimators, max_depth)
model.fit(X_train, y_train)

print('Mean accuracy on train %.2f%%' % (100*model.score(X_train, y_train, sample_weight=None)))
print('Mean accuracy on holdout %.2f%%' % (100*model.score(X_test, y_test, sample_weight=None)))

## Backtesting a betting rule

Assume that we have an infinite amount of cash available.

We will bet at most 1'000 USD every hour on the next 5-hour prediction and hold it until the final time. Hence, at a given time, we will bet at most 5'000 dollars.

We will invest this dollar proportionally to the confidence, or informativeness, of our signal.

In [None]:
from plotnine import *

## Predicted class along with with probability
signal = model.predict(X_test)
proba = np.max(np.exp(model.predict_log_proba(X_test)), axis=1)

## n-day returns
fut_returns = df.tail(y_test.size).fut_returns

## Put together
df_strat = pd.DataFrame(data={'signal':signal, 'proba':proba, 'fut_returns':fut_returns.to_numpy()},
                       index=fut_returns.index)

df_strat

In [None]:
## Display the predicted class
(ggplot(df_strat, aes(x='df_strat.index', y='signal')) + 
   geom_point() + 
   xlab('time'))

In [None]:
from scipy.stats import norm

## Compute the bet size and side (long/short) and the realized PnL
bet_max = 1000
df_strat['z'] = (df_strat.proba - 1/3) / np.sqrt(df_strat.proba * (1 - df_strat.proba))
df_strat['position'] = bet_max * df_strat.signal * (2*norm.cdf(df_strat.z) - 1)
df_strat['pnl'] = df_strat.fut_returns * df_strat.position

In [None]:
df_strat

What does our total position looks like?

In [None]:
import talib as ta

df_strat['total_position'] = ta.SMA(df_strat['position'], 5)

(ggplot(df_strat, aes(x='df_strat.index', y='total_position')) + 
  geom_line() + xlab("time") + ggtitle("Total long/short position"))

Did we make any money?

In [None]:
df_strat['cum_pnl'] = np.cumsum(df_strat.pnl)

(ggplot(df_strat, aes(x='df_strat.index', y='cum_pnl')) + 
  geom_line() + xlab("time") + ggtitle("Strategy cumulative PnL"))

In [None]:
(ggplot(df.tail(y_test.size), aes(x='df.tail(y_test.size).index', y='close')) + geom_line())

**Exercise:** 
  * What do you think of the performance of this strategy? Would you go to production with it? Why?
  * How would you make this backtest more realistic? (e.g. transaction costs)
  * What alternative betting strategy would you use?

### Feature importance

In a future lecture we will study how to identify which features are important, or not. This is particularly useful during the model development stage as it will help you better understand and build your model.

Herebelow is just a preview.

In [None]:
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

import matplotlib.pyplot as plt  
    
# Plot the impurity-based feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
        color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

print(predictors_list)