# Decision Trees
(Reon)

### Contents:
1. Data Exploration
2. Feature Engineering
3. Training

## Packages


In [30]:
from sklearn.ensemble import *
from sklearn.metrics import *

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from ta import *

## Data Exploration

In [98]:
# This is from Samuel's part

df = pd.read_csv('AAPL.csv') #Read the data in
df.Date = pd.to_datetime(df.Date, format='%d/%m/%Y') #Set the date column to datetime
#df.set_index('Date', inplace=True) #Set the index to the date column
df = df.rename(columns = {'Adj Close':'Adj_Close'})
df = df.rename(columns = {'Date':'Timestamp'})
df.head(20) #Observe a few rows of data

Unnamed: 0,Timestamp,Open,High,Low,Close,Adj_Close,Volume
0,2015-09-01,110.150002,111.879997,107.360001,107.720001,100.533249,76845900
1,2015-09-02,110.230003,112.339996,109.129997,112.339996,104.845024,61888800
2,2015-09-03,112.489998,112.779999,110.040001,110.370003,103.006447,53233900
3,2015-09-04,108.970001,110.449997,108.510002,109.269997,101.979836,49996300
4,2015-09-08,111.75,112.559998,110.32,112.309998,104.817017,54843600
5,2015-09-09,113.760002,114.019997,109.769997,110.150002,102.801125,85010800
6,2015-09-10,110.269997,113.279999,109.900002,112.57,105.059677,62892800
7,2015-09-11,111.790001,114.209999,111.760002,114.209999,106.590263,49915500
8,2015-09-14,116.580002,116.889999,114.860001,115.309998,107.616867,58363400
9,2015-09-15,115.93,116.529999,114.419998,116.279999,108.522148,43341200


In [99]:
#Target - 5 day later price
target =  list(df["Adj_Close"])[4:] + [0,0,0,0]
df["Binary_Target"] = target > df["Adj_Close"]

df.head()

Unnamed: 0,Timestamp,Open,High,Low,Close,Adj_Close,Volume,Binary_Target
0,2015-09-01,110.150002,111.879997,107.360001,107.720001,100.533249,76845900,True
1,2015-09-02,110.230003,112.339996,109.129997,112.339996,104.845024,61888800,False
2,2015-09-03,112.489998,112.779999,110.040001,110.370003,103.006447,53233900,True
3,2015-09-04,108.970001,110.449997,108.510002,109.269997,101.979836,49996300,True
4,2015-09-08,111.75,112.559998,110.32,112.309998,104.817017,54843600,True


## Feature Engineering

In this section we will be creating technical indicators that are commonly used for stock technical analysis. We use the package "ta" to create the technical indicators.

### Indicators:

#### Volume
1. Accumulation/Distribution Index (ADI)
2. On-Balance Volume (OBV)
3. Chaikin Money Flow (CMF)
4. Force Index (FI)
5. Ease of Movement (EoM, EMV)
6. Volume-price Trend (VPT)
7. Negative Volume Index (NVI)

#### Volatility
1. Average True Range (ATR)
2. Bollinger Bands (BB)
3. Keltner Channel (KC)
4. Donchian Channel (DC)

#### Trend
1. Moving Average Convergence Divergence (MACD)
2. Average Directional Movement Index (ADX)
3. Vortex Indicator (VI)
4. Trix (TRIX)
5. Mass Index (MI)
6. Commodity Channel Index (CCI)
7. Detrended Price Oscillator (DPO)
8. KST Oscillator (KST)
9. Ichimoku Kinkō Hyō (Ichimoku)

#### Momentum
1. Money Flow Index (MFI)
2. Relative Strength Index (RSI)
3. True strength index (TSI)
4. Ultimate Oscillator (UO)
5. Stochastic Oscillator (SR)
6. Williams %R (WR)
7. Awesome Oscillator (AO)
8. Kaufman's Adaptive Moving Average (KAMA)

#### Others
1. Daily Return (DR)
2. Daily Log Return (DLR)
3. Cumulative Return (CR)

For a start, we simply add every single indicator into our dataset. The decision tree algorithm will conduct feature selection for us later.


In [101]:
df= add_all_ta_features(df, "Open", "High", "Low", "Close", "Volume", fillna=True)
df.head()

Unnamed: 0,Timestamp,Open,High,Low,Close,Adj_Close,Volume,Binary_Target,volume_adi,volume_obv,...,momentum_mfi,momentum_tsi,momentum_uo,momentum_stoch,momentum_stoch_signal,momentum_wr,momentum_ao,others_dr,others_dlr,others_cr
0,2015-09-01,110.150002,111.879997,107.360001,107.720001,100.533249,76845900,True,-63291490.0,0.0,...,0.0,-100.0,0.808747,7.964609,7.964609,-92.035391,0.0,-29.072455,0.0,0.0
1,2015-09-02,110.230003,112.339996,109.129997,112.339996,104.845024,61888800,False,-2716149.0,61888800.0,...,45.122342,-92.752788,10.135684,100.0,53.982304,-0.0,0.0,4.288892,4.199467,4.288892
2,2015-09-03,112.489998,112.779999,110.040001,110.370003,103.006447,53233900,True,21477750.0,8654900.0,...,32.522933,-90.286322,10.236476,55.535113,54.499907,-44.464887,0.0,-1.753599,-1.769157,2.460084
3,2015-09-04,108.970001,110.449997,108.510002,109.269997,101.979836,49996300,True,-51235150.0,-41341400.0,...,25.845868,-89.010639,11.279728,35.239792,63.591635,-64.760208,0.0,-0.996653,-1.001653,1.438912
4,2015-09-08,111.75,112.559998,110.32,112.309998,104.817017,54843600,True,31777620.0,13502200.0,...,39.711281,-86.788464,15.953536,91.328392,60.701099,-8.671608,0.0,2.7821,2.744103,4.261044


For all the features we would also like to implement some feature scaling. This helps prevent bias in our data set. We will use normalization in this case for our data.

In [34]:
from sklearn.preprocessing import * 

In [106]:
features = pd.DataFrame(normalize(df.copy().drop('Timestamp', axis =1).drop('Binary_Target', axis =1)))
features["Timestamp"] = df["Timestamp"]
features["Binary_Target"] = df["Binary_Target"]
features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,56,57,58,59,60,61,62,63,Timestamp,Binary_Target
0,1e-06,1e-06,1e-06,1e-06,9.852779e-07,0.75313,-0.620289,0.0,-8.239372e-09,0.0,...,7.926141e-09,7.805729e-08,7.805729e-08,-9.019945e-07,0.0,-2.849251e-07,0.0,0.0,2015-09-01,True
1,1e-06,1e-06,1e-06,1e-06,1.168166e-06,0.689555,-0.030263,0.689555,-2.18135e-10,0.0,...,1.129301e-07,1.114184e-06,6.01462e-07,-0.0,0.0,4.778614e-08,4.678978e-08,4.778614e-08,2015-09-02,False
2,1e-06,1e-06,1e-06,1e-06,1.206571e-06,0.623558,0.251581,0.10138,-2.631537e-09,-0.732938,...,1.199055e-07,6.505132e-07,6.383873e-07,-5.208416e-07,0.0,-2.054087e-08,-2.07231e-08,2.881631e-08,2015-09-03,True
3,1e-06,1e-06,1e-06,1e-06,1.128327e-06,0.55317,-0.566877,-0.45741,-2.467006e-09,0.403954,...,1.248013e-07,3.899006e-07,7.035916e-07,-7.165209e-07,0.0,-1.102718e-08,-1.10825e-08,1.592043e-08,2015-09-04,True
4,2e-06,2e-06,2e-06,2e-06,1.615295e-06,0.845174,0.489713,0.208077,-5.892825e-10,0.048124,...,2.458538e-07,1.407427e-06,9.354414e-07,-1.336348e-07,0.0,4.287388e-08,4.228832e-08,6.566532e-08,2015-09-08,True




## Iteration 1 (All 31 Features, Random Forest)
We will split the data into train and test set. We then use a randomforest classifier to predict our binary target with randomly selected features.

In [107]:
# Split into independent and dependent variables
X = features.copy().drop('Binary_Target', axis =1)
Y = features[['Timestamp','Binary_Target']]

# Get Training set
X_train = X[X["Timestamp"] <= '2018-08-31']
y_train = Y[Y["Timestamp"] <= '2018-08-31']

## Training our Classifier
The decision tree algorithm automatically does feature selection for us, by picking features that best split the data. In this case, we use the GINI Index to calculate our information gain.

In [116]:
clf = RandomForestClassifier(n_estimators=200, random_state=0)

In [117]:
X_train.set_index('Timestamp', inplace=True) #Set the index to the date column
y_train.set_index('Timestamp', inplace=True) #Set the index to the date column

KeyError: "None of ['Timestamp'] are in the columns"

In [118]:
clf.fit(X_train, y_train)

  """Entry point for launching an IPython kernel.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [111]:
#Test set
X_test = X[X["Timestamp"] > '2018-08-31']
y_test = Y[Y["Timestamp"] > '2018-08-31']
X_test.set_index('Timestamp', inplace=True) #Set the index to the date column
y_test.set_index('Timestamp', inplace=True) #Set the index to the date column

In [119]:
predicted = clf.predict(X_test)

## Prediction Accuracy
In this part we evaluate our model accuracy based on our hold out sample

In [120]:
pd.crosstab(y_test["Binary_Target"],predicted, rownames=['Actual'], colnames=['Predicted'])

Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,7,107
True,12,124


In [121]:
print(classification_report(y_train["Binary_Target"], clf.predict(X_train)))

              precision    recall  f1-score   support

       False       1.00      0.74      0.85       321
        True       0.84      1.00      0.91       436

    accuracy                           0.89       757
   macro avg       0.92      0.87      0.88       757
weighted avg       0.91      0.89      0.88       757



In [122]:
print(classification_report(y_test["Binary_Target"], predicted))

              precision    recall  f1-score   support

       False       0.37      0.06      0.11       114
        True       0.54      0.91      0.68       136

    accuracy                           0.52       250
   macro avg       0.45      0.49      0.39       250
weighted avg       0.46      0.52      0.42       250



Our model accuracy is pretty bad - an accuracy of 0.52 is equivalent to random chance, given that this is a binary classification problem. Next we will introduce scaling into our model