# Decision Tree

**Basic Description**

A decision tree is a model that can be represented in a treelike form determined by binary splits made in the feature space and resulting in various leaf nodes, each with a different prediction. Decision trees are flexible and often perform well in practice for both classification and regression use cases. Trees are trained in a greedy and recursive fashion, proceeding through a series of binary splits in features that lead to minimal error.

**Bias-Variance Tradeoff**

Highly flexible, leading to low bias but high variance

**Upsides**

**Downsides**

Risk of overfitting

**Other Notes**

## Load Packages and Prep Data

In [3]:
# custom utils
from utils import custom
from utils.cf_matrix import make_confusion_matrix

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

In [4]:
# load data
X_train, y_train, X_test, y_test = custom.load_data()

X_train (62889, 42)
y_train (62889,)
X_test (15723, 42)
y_test (15723,)


## Model 1
- Defaults

In [5]:
# fit decision tree with default hyperparameters
dt_1 = DecisionTreeClassifier()
x = dt_1.fit(X_train, y_train)

In [6]:
# cross-validation scoring
dt_1_scores = custom.cv_metrics(dt_1, X_train, y_train)
dt_1_scores

accuracy     0.950
precision    0.624
recall       0.616
f1           0.618
dtype: float64

## Model 2
- Grid search hyperparameters

In [None]:
# define grid to search
param_grid={
        "max_depth":[*range(3,16,3)]
        ,"min_samples_leaf":[*range(5,16,5)]
}

# instance
gs = GridSearchCV(DecisionTreeClassifier()
                ,param_grid
                ,scoring='f1'
                ,cv=5
                ,n_jobs=-1
                #,verbose=1
    )

# search and fit
gs.fit(X_train, y_train)

# best params and score
print(gs.best_params_)
print(gs.best_score_)

# store best model
dt_2 = gs.best_estimator_

In [None]:
# cross-validation scoring
dt_2_scores = custom.cv_metrics(dt_2, X_train, y_train)
dt_2_scores

## Test

In [None]:
# test the performance of the selected model
y_pred = dt_2.predict(X_test)
# pred_metrics(y_test, y_pred)

# confusion matrix
cm = confusion_matrix(y_test, y_pred)
labels = ['True Neg','False Pos','False Neg','True Pos']
categories = ['Soil', 'Stone']
make_confusion_matrix(cm
                      ,group_names=labels
                      ,categories=categories
                      ,cmap='Blues'
                      ,count=False
                      ,title = 'Logistic Regression')

## Visualize Tree
- Take a look at the decision tree

In [1]:
# text representation

# text_representation = tree.export_text(dt_3)
# print(text_representation)

In [2]:
# diagram

# fig = plt.figure(figsize=(25,20))
# _ = tree.plot_tree(dt_3, 
#                    feature_names=X_train.columns,  
#                    class_names=['stone', 'soil'], # need to confirm this ordering
#                    filled=True)