# Simple decision trees

## Load X,y data from NPZ

Using the function added to *mylib.py* file, it's now easy to grab data and X/y vectors ready to be used for model training and tuning

In [3]:
# Run content of mylib.py file
%run mylib.py

# Load data from NPZ file
#data=loadNpz()
(data, X, y)=loadXy()

Loading 'train' set
  loading  data
     shape: (281, 299, 299, 3) - dtype: float64
  loading  features
     shape: (281, 2048) - dtype: float64
  loading  filenames
     shape: (281,) - dtype: <U46
  loading  labels
     shape: (281,) - dtype: int32


Loading 'test' set
  loading  data
     shape: (51, 299, 299, 3) - dtype: float64
  loading  features
     shape: (51, 2048) - dtype: float64
  loading  filenames
     shape: (51,) - dtype: <U50
  loading  labels
     shape: (51,) - dtype: int32


Loading 'valid' set
  loading  data
     shape: (139, 299, 299, 3) - dtype: float64
  loading  features
     shape: (139, 2048) - dtype: float64
  loading  filenames
     shape: (139,) - dtype: <U30
  loading  labels
     shape: (139,) - dtype: int32


building 'trainX' set
  building  data
     shape: (420, 299, 299, 3) - dtype: float64
  building  features
     shape: (420, 2048) - dtype: float64
  building  filenames
     shape: (420,) - dtype: <U46
  building  labels
     shape: (420,) - dt

## Four different decision trees

In that Notebook, I will try four different type of decision trees:

* DecisionTreeClassifier
* RandomForestClassifier
* LinearSVC, which is the same as SVM(kernel='linear')
* SVM(kernel='rbf')

For each of them, I'll do a grid search to fine tune the hyperparameters. I'll also save the model with the best hyperparameters on disk, in order to use them at the end of this project.

In the next cell, I will import needed liraries and define some constant I will use across the Notebook

In [23]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# cv parameter of the GridSearchCV object
CV=5

# max tree depth to us in model
MAX_DEPTH=3

## Simple Decision Tree

In [41]:
from sklearn.tree import DecisionTreeClassifier

# Create decision tree
dt = DecisionTreeClassifier(
    criterion='gini', max_depth=MAX_DEPTH, random_state=0)

# Create the pipeline and fit it to training data
dt_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=None)),
    ('dt', dt)
    
])

# Here are the different parameters I will vary
dt_grid_param={
    'scaler': [None, StandardScaler()],
    'dt__criterion': ['gini', 'entropy'],
    'dt__max_depth': [3, 5, 6, 7, 8, 9],
    'pca__n_components': [None, 128, 76, 16]
}

# Build the GridSearchCV object using versbose and parallel execution options
grid_dt=GridSearchCV(dt_pipe, dt_grid_param, cv=CV, refit=True, return_train_score=True, verbose=True, n_jobs=-1, iid=True)


# grid_cv.get_params().keys()
# Fit the model
grid_dt.fit(X['trainX'], y['trainX'])


# Display top 5 best test score
columns=[
    'param_scaler', 'param_dt__criterion', 'param_dt__max_depth', 'param_pca__n_components', 'mean_test_score', 'std_test_score', 'mean_train_score'
]
pd.DataFrame(grid_dt.cv_results_).sort_values('mean_test_score', ascending=False)[columns].head(5)


Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    9.3s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:   22.0s
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:   25.7s finished


Unnamed: 0,param_scaler,param_dt__criterion,param_dt__max_depth,param_pca__n_components,mean_test_score,std_test_score,mean_train_score
76,,entropy,7,76,0.883333,0.035642,0.993463
78,,entropy,7,16,0.880952,0.04136,0.989868
23,"StandardScaler(copy=True, with_mean=True, with...",gini,6,16,0.880952,0.024973,0.954749
39,"StandardScaler(copy=True, with_mean=True, with...",gini,8,16,0.87381,0.036461,0.990469
47,"StandardScaler(copy=True, with_mean=True, with...",gini,9,16,0.871429,0.02656,0.996416


In [None]:
from sklearn.tree import export_graphviz

# Export decision tree
dot_data = export_graphviz(
    dt, out_file=None,
    feature_names=features.columns, class_names=['died', 'survived'],
    filled=True, rounded=True, proportion=True   
)
