# Simple decision trees

## Load X,y data from NPZ

Using the function added to *mylib.py* file, it's now easy to grab data and X/y vectors ready to be used for model training and tuning

In [1]:
# Run content of mylib.py file
%run mylib.py

# Load data from NPZ file
#data=loadNpz()
(data, X, y)=loadXy()

Loading 'train' set
  loading  data
     shape: (281, 299, 299, 3) - dtype: float64
  loading  features
     shape: (281, 2048) - dtype: float64
  loading  filenames
     shape: (281,) - dtype: <U48
  loading  labels
     shape: (281,) - dtype: int32


Loading 'test' set
  loading  data
     shape: (50, 299, 299, 3) - dtype: float64
  loading  features
     shape: (50, 2048) - dtype: float64
  loading  filenames
     shape: (50,) - dtype: <U30
  loading  labels
     shape: (50,) - dtype: int32


Loading 'valid' set
  loading  data
     shape: (139, 299, 299, 3) - dtype: float64
  loading  features
     shape: (139, 2048) - dtype: float64
  loading  filenames
     shape: (139,) - dtype: <U30
  loading  labels
     shape: (139,) - dtype: int32


building 'trainX' set
  building  data
     shape: (420, 299, 299, 3) - dtype: float64
  building  features
     shape: (420, 2048) - dtype: float64
  building  filenames
     shape: (420,) - dtype: <U48
  building  labels
     shape: (420,) - dt

## Four different decision trees

In that Notebook, I will try four different type of decision trees:

* DecisionTreeClassifier
* RandomForestClassifier
* LinearSVC, which is the same as SVM(kernel='linear')
* SVM(kernel='rbf')

For each of them, I'll do a grid search to fine tune the hyperparameters. I'll also save the model with the best hyperparameters on disk, in order to use them at the end of this project.

Grid search parameters will include PCA on the data (I'll build pipelines to do so). I've decided to vary PCA component number to the corresponding value of total variance:

* 90% => 128 components
* 80% => 76 components
* 50% => 16 components

> Note: values above have been taken from the results available in the Notebook No 02

In the next cell, I will import needed liraries and define some constant I will use across the Notebook

In [2]:
import pandas as pd

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

# cv parameter of the GridSearchCV object
CV=5

# max tree depth to us in model
MAX_DEPTH=3

## DecisionTreeClassifier

First, I will perform a grid search on some hyper-parameters with a *max_depth* value of 3, this is to answer the questions:

* What accuracy can you achieve with a depth of 3?
* Plot the corresponding tree with graphviz
* Do you get better results if you reduce the number of dimensions with PCA first?

Second, I'll redo a grid search using different values for the *max_depth* parameter and see what append. If I get better results, I'll use it to evaluate the model and compare to others.


### Work with fixed max_depth=3

In [3]:
from sklearn.tree import DecisionTreeClassifier


# Create the pipeline and fit it to training data
dt_pipe = Pipeline([
    ('pca', PCA(n_components=None)),
    ('dt', DecisionTreeClassifier(criterion='gini', max_depth=MAX_DEPTH, random_state=0))
    
])

# Here are the different parameters I will vary
dt_grid_param={
    'dt__criterion': ['gini', 'entropy'],
    'dt__max_depth': [3],
    'pca__n_components': [None, 128, 76, 16]
}

# Build the GridSearchCV object using versbose and parallel execution options
grid_dt=GridSearchCV(dt_pipe, dt_grid_param, cv=CV, refit=True, return_train_score=True, verbose=True, n_jobs=-1, iid=True)


# Fit the model
grid_dt.fit(X['trainX'], y['trainX'])


Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    6.5s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('dt', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=...        min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best'))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'dt__criterion': ['gini', 'entropy'], 'dt__max_depth': [3], 'pca__n_components': [None, 128, 76, 16]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=True)

In [4]:
# Display top 5 best test score
columns=[
    'param_dt__criterion', 'param_dt__max_depth', 'param_pca__n_components', 'mean_test_score', 'std_test_score', 'mean_train_score'
]
pd.DataFrame(grid_dt.cv_results_).sort_values('mean_test_score', ascending=False)[columns].head(5)


Unnamed: 0,param_dt__criterion,param_dt__max_depth,param_pca__n_components,mean_test_score,std_test_score,mean_train_score
4,entropy,3,,0.809524,0.038771,0.853656
5,entropy,3,128.0,0.809524,0.038771,0.853656
6,entropy,3,76.0,0.809524,0.038771,0.853656
7,entropy,3,16.0,0.809524,0.038771,0.854256
0,gini,3,,0.778571,0.018679,0.792872


In [5]:
best_estimator=grid_dt.best_estimator_

best_estimator.score(X['test'], y['test'])

# saveModel(dt, 'decision-tree')


0.8

In [6]:
from sklearn.tree import export_graphviz

# Export decision tree
#dot_data = export_graphviz(
#    dt, out_file=None,
#    class_names=data['class_name'],
#    filled=True, rounded=True, proportion=True   
#)


### *max_depth* as an hyper-parameter of the grid-search

In [7]:
# Here are the different parameters I will vary
dt_grid_param['dt__max_depth']=[3, 4, 5, 6, 7, 8, 9, 10]

# Build the GridSearchCV object using versbose and parallel execution options
grid_dt=GridSearchCV(dt_pipe, dt_grid_param, cv=CV, refit=True, return_train_score=True, verbose=True, n_jobs=-1, iid=True)

# Fit the model
grid_dt.fit(X['trainX'], y['trainX'])

Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   28.6s
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:   47.0s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('dt', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=...        min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best'))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'dt__criterion': ['gini', 'entropy'], 'dt__max_depth': [3, 4, 5, 6, 7, 8, 9, 10], 'pca__n_components': [None, 128, 76, 16]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=True)

In [8]:
# Display top 5 best test score
columns=[
    'param_dt__criterion', 'param_dt__max_depth', 'param_pca__n_components', 'mean_test_score', 'std_test_score', 'mean_train_score'
]
pd.DataFrame(grid_dt.cv_results_).sort_values('mean_test_score', ascending=False)[columns].head(5)


Unnamed: 0,param_dt__criterion,param_dt__max_depth,param_pca__n_components,mean_test_score,std_test_score,mean_train_score
30,gini,10,76,0.895238,0.036078,0.999408
25,gini,9,128,0.892857,0.031901,0.998225
26,gini,9,76,0.890476,0.041704,0.998225
13,gini,6,128,0.890476,0.038211,0.967856
22,gini,8,76,0.890476,0.039482,0.99347


In [9]:
grid_dt.best_estimator_.score(X['test'], y['test'])

0.84

In [13]:
grid_dt.best_estimator_.get_params()

{'memory': None,
 'steps': [('pca',
   PCA(copy=True, iterated_power='auto', n_components=76, random_state=None,
     svd_solver='auto', tol=0.0, whiten=False)),
  ('dt',
   DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, presort=False, random_state=0,
               splitter='best'))],
 'pca': PCA(copy=True, iterated_power='auto', n_components=76, random_state=None,
   svd_solver='auto', tol=0.0, whiten=False),
 'dt': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False, random_state=0,
        

In [11]:
saveModel(grid_dt.best_estimator_, 'decision-tree')

Saving model decision-tree to model-decision-tree.sav
