# Cross Validation In Machine Learning Trading Models
Cross validation in machine learning is a technique that provides an <i><b>accurate measure of the performance</i></b> of a machine learning model. This performance will be closer to what you can expect when the model is used on a future unseen dataset. 


This notebook will help you
1. Determine if the machine learning model is good in predicting buy signal and/or sell signal
2. Demonstrate the performance of your machine learning trading model in different stress scenarios
3. Comprehensively do the cross validation in machine learning trading model


A sample machine learning <b><i>decision tree classifier model</b></i> using the <i><b>bank marketing dataset</i></b> is created to better explain how to do cross validation in machine learning model.

## Importing the required packages

In [105]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline
sns.set()

## Preparing Data
The dataset we used is realted  with direct marketing campaigns of a Portuguese banking institution and it is available [here](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). The classification goal is to predict if the client will subscribe a <b>term deposit (variable y).</b>

In [107]:
bank_mkt = pd.read_csv('bankmkt.csv') # Read the dataset into dataframe
bank_mkt.dropna() # Remove missing value
bank_mkt.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,487,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,346,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,227,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,17,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,58,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


### Creating input dataset
Here we preprocess the data to be the input dataset. The choice of features as input is completely random, you can learn more about feature selection [here](https://en.wikipedia.org/wiki/Feature_selection). 

In [108]:
# We deep copy the dataframe for parsing
bank_data = bank_mkt.copy()

In [109]:
# Drop contact since every participant has been contacted
bank_data.drop('contact', axis = 1, inplace = True)

# Dropped since they don't have intrinsic meaning 
# by knowing the last contacted date
bank_data.drop('month', axis=1, inplace=True)
bank_data.drop('day_of_week', axis=1, inplace=True)

# Map values 'yes' / 'no' to 1 / 0 
bank_data.y = bank_data.y.map({'yes': 1, 'no': 0})

In [110]:
# Convert categorical variables to dummy variables
bank_with_dummies = pd.get_dummies(
    data=bank_data, 
    columns = ['job', 'default', 'marital', 'education', 'poutcome', 'housing', 'loan'], 
    prefix = ['job', 'default', 'marital', 'education', 'poutcome', 'housing', 'loan'])
bank_with_dummies.tail()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,default_no,default_unknown,default_yes,marital_divorced,marital_married,marital_single,marital_unknown,education_basic.4y,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_professional.course,education_university.degree,education_unknown,poutcome_failure,poutcome_nonexistent,poutcome_success,housing_no,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes
4114,30,53,1,999,0,1.4,93.918,-42.7,4.958,5228.1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1
4115,39,219,1,999,0,1.4,93.918,-42.7,4.959,5228.1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0
4116,27,64,2,999,1,-1.8,92.893,-46.2,1.354,5099.1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0
4117,58,528,1,999,0,1.4,93.444,-36.1,4.966,5228.1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0
4118,34,175,1,999,0,-0.1,93.2,-42.0,4.12,5195.8,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0


## Training the Machine Learning Model
We are going to train a decision tree classifier model here. The classifier is saved as 'clf' and the fit function is used learn the relationship between input X and output variable y using the classifier.

In [112]:
X = bank_with_dummies.drop('y', 1)
y = bank_with_dummies.y

In [113]:
clf = DecisionTreeClassifier(random_state = 5)
model = clf.fit(X, y)

Now the model is ready and let's see how to do cross validation of this model.

## Cross Validation of the Machine Learning Model
If cross validation of predictions is done on the same data from which the model learned, the performance of the model is bound to be spectacular.

In [120]:
from sklearn.metrics import accuracy_score
preds = model.predict(X) # Predicted output 
correct_Preds = accuracy_score(y, preds, normalize=False) # How many predictions are correct
print('Correct Predictions: {}'.format(correct_Preds))
print('Total Predictions: {}'.format(X.shape[0]))
print('Accuracy in Percentage: {}%'.format(accuracy_score(y, preds)*100))

Correct Predictions: 4119
Total Predictions: 4119
Accuracy in Percentage: 100.0%


### How to overcome this problem of using the same data for training and testing
<img style="height:200px" src='split.jpeg'/>
The easiest and most widely used ways is to partition the data into two parts:

1. training dataset (used to train the model)
2. testing dataset (used to test the performance of the model)

In [122]:
# Total dataset length 
dataset_length = X.shape[0]

# Training set's length
split = int(dataset_length*0.75)
split

3089

### K-Fold Cross Validation Technique

### Confusion Matrix