# Tree methods

We download and prepare some financial data (simple feature engineering and labelling).

We implement and train a decision tree (first part) and boosted trees (second part) in order to try to prediction future return class.

## Classification Tree

1. Import the data
2. Feature engineering and data labelling
3. Split the data into train and test dataset
4. Fit a decision tree model on train data
5. Visualize the decision tree model
6. Make predictions and evaluate the performance

### 1. Import the data

We will import data from Yahoo! finance

In [None]:
import yfinance as yf

data = yf.Ticker("TSLA")
df = data.history(period="max")
df.tail()

#### Graphics

We will use [plotnine](https://plotnine.readthedocs.io/en/stable/) as much as possible  for figures.

There are many different packages for creating figures. The packages `plotnine`in Python and `ggplot2` in R both implement the [*The Grammar of Graphics*](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448) which will help you save a lot of time on the long run. 

Here is a small [tutorial](https://www.kaggle.com/residentmario/grammar-of-graphics-with-plotnine-optional/) on `plotnine`.

In [None]:
#!pip install plotnine
from plotnine import *

(ggplot(df, aes(x='df.index', y='Close'))
 + geom_line()
 + xlab('date'))

### 2. Feature engineering and data labelling

We define a list of predictors using the [TA-Lib library](https://mrjbq7.github.io/ta-lib/) for technical indicators (150+ available):
  * Average Directional Index (ADX)
  * Relative Strength Index (RSI) 
  * Simple Moving Average (SMA)

In [None]:
## run to install ta-lib (WARNING: must install from binary)
#!pip install talib-binary

In [None]:
import talib as ta
import numpy as np

df['ADX'] = ta.ADX(df['High'].values, df['Low'].values, df['Close'].values, timeperiod=14)
df['RSI'] = ta.RSI(df['Close'].values, timeperiod=14)
df['SMA'] = ta.SMA(df['Close'].values, timeperiod=20)

df['Return'] = df['Close'].pct_change(1).shift(-1)
df['target'] = np.where(df.Return > 0, 1, 0)

df.tail()

Remove NaN values, and prepare data for tranining.

In [None]:
df = df.dropna()

## feature variables
predictors_list = ['ADX', 'RSI', 'SMA']
X = df[predictors_list]
X.tail()

## target variable
y = df.target
y.tail()

### 3. Split the data into train and test dataset

Split the data into a train and a test set. 
* `stratify=y` indicates that there should be the same proportion of 1 and 0 in the train and test sets, i.e. get a more balanced dataset.
* by default `shuffle=True` will randomly select observations rows (with the random seed set using `random_state`).

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
print('Percentage of 1s in the train and test sets: %.2f and %.2f' % (np.mean(y_train)*100, np.mean(y_test)*100))

**Do you see any issues with the above train / test splitting?**

### 4. Fit a decision tree model on train data

We will use [scikit-learn](https://scikit-learn.org/) which is a great ML library to know. Simple with many standard algorithms available.

In [None]:
from sklearn.tree import DecisionTreeClassifier

## create the model
dtc = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_leaf=5) 

## train the model
dtc = dtc.fit(X_train, y_train)

In [None]:
## uncomment the line below for help
# help(DecisionTreeClassifier)

### 5. Visualize the decision tree model

In [None]:
#!sudo apt install graphviz
#!pip install graphviz

In [None]:
from sklearn import tree
import graphviz

dot_data = tree.export_graphviz(dtc, out_file=None, filled=True, feature_names=predictors_list)   
graphviz.Source(dot_data) 

### 6. Make Predictions and evaluate the performance

In [None]:
## make predictions on the train and test sets
y_hat_train = dtc.predict(X_train)
y_hat_test = dtc.predict(X_test)

Compare performance on train and test sets.

In [None]:
from sklearn.metrics import classification_report

print('Train set report:\n', classification_report(y_train, y_hat_train))
print('Test set report:\n', classification_report(y_test, y_hat_test))

### Exercise

Implement some more advanced feature engineering techniques and data labelling techniques discussed in the course.

## Boosted trees

We will use a boosted trees model prepared with [TensorFlow Estimator](https://www.tensorflow.org/guide/estimator).

Why? Build a model following a high-level logic.

![tf-estimator](https://files.virgool.io/upload/users/11692/posts/t1molsna5wnn/mvr6hysy4acc.png)

We will follow the model construction from [this tutorial](https://www.tensorflow.org/tutorials/estimator/boosted_trees).

### Make the input function

In [None]:
import tensorflow as tf

NUMERIC_COLUMNS = ['ADX', 'RSI', 'SMA']
CATEGORICAL_COLUMNS = [] ## not used here

fc = tf.feature_column

## add numerical features
feature_columns = []
for feature_name in NUMERIC_COLUMNS:
    feature_columns.append(fc.numeric_column(feature_name, dtype=tf.float32))

## map classes to one-hot vectors
def one_hot_cat_column(feature_name, vocab):
    return fc.indicator_column(
        fc.categorical_column_with_vocabulary_list(feature_name, vocab))

## add categorical features
for feature_name in CATEGORICAL_COLUMNS:
    vocabulary = df[feature_name].unique()
    feature_columns.append(one_hot_cat_column(feature_name, vocabulary))

What is one-hot encoding?

In [None]:
import pandas as pd

## Example of one-hot encoding
example = dict({'country': pd.Series('Switzerland')})
class_fc = one_hot_cat_column('country',  {'US', 'China', 'Switzerland'})

print('Feature value: "{}"'.format(example['country'].iloc[0]))
print('One-hot encoded: ', tf.keras.layers.DenseFeatures([class_fc])(example).numpy())   

In [None]:
## input_fn() maker
def make_input_fn(X, y, n_epochs=None, shuffle=True, batch_size=len(y)):
    def input_fn():
        dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
        if shuffle:
            dataset = dataset.shuffle(batch_size)
        dataset = dataset.repeat(n_epochs)  
        dataset = dataset.batch(batch_size)
        return dataset
    return input_fn

# Training and evaluation input functions
train_input_fn = make_input_fn(X_train, y_train)
test_input_fn = make_input_fn(X_test, y_test, shuffle=False, n_epochs=1)

### Train and evaluate the model

We will use the pre-canned TensorFlow [boosted tree](https://www.tensorflow.org/api_docs/python/tf/estimator/BoostedTreesClassifier) estimator.

In [None]:
# Boosted trees
nbpl = int(np.ceil(0.5 * len(y_train) / 128))
btc = tf.estimator.BoostedTreesClassifier(feature_columns,
                                          n_batches_per_layer=nbpl)

In [None]:
# Train model
btc.train(train_input_fn, max_steps=1000)

In [None]:
# Evaluate
result = btc.evaluate(test_input_fn)
print(pd.Series(result))

In [None]:
# Train data, if do not make a new input it will run all the batches and epoch
train_input_fn_2 = make_input_fn(X_train, y_train, shuffle=False, n_epochs=1)
results_train = btc.evaluate(train_input_fn_2)

# Test data
results_test = btc.evaluate(test_input_fn)

print('Accuracy (train data): ', results_train['accuracy'])
print('Dummy model (train data): ', results_train['accuracy_baseline'])
print('Accuracy (test data): ', results_test['accuracy'])
print('Dummy model (test data): ', results_test['accuracy_baseline'])

In [None]:
## make predictions, and generate reports as above
preds_train = list(btc.predict(train_input_fn_2))
preds_test = list(btc.predict(test_input_fn))
y_hat_train = [pred['class_ids'][0] for pred in preds_train]
y_hat_test = [pred['class_ids'][0] for pred in preds_test]

print('Train set report:\n', classification_report(y_train, y_hat_train))
print('Test set report:\n', classification_report(y_test, y_hat_test))

### Exercise

Try more advanced feature engineering techniques and data labelling.

Reflect and try on what could be done to better align train and test sets performance.

Compare your model performance to a simple linear model using the TensorFlow [linear classifier](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearClassifier) estimator.

```python
## Linear classifier
lc = tf.estimator.LinearClassifier(feature_columns)
```