<a href="https://colab.research.google.com/github/timcsmith/MIS536-Public/blob/master/Notebooks/Class08b_decision_tree_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class08 - Prediction using Decision Tree (with Hyperparameter Tuning)

## Introduction and Overview



In this project, we reuse the [UCI's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/census+income) census data.

The first section of this exercise is a duplicate of the previous Class08-random-forest-default.ipynb. The second section demonstrates how we can use hyper parameter tuning techniques to identify the best performing parameters for the given model (in this case, Random Forests).



# Section 1: Prediction with Decision Tree (using default parameters)



## Step 1: Install and import necessary packages

In [None]:
# import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import numpy as np

## Step 2: Load, clean and prepare data


### 2.1 Read data (income.csv)

In [None]:
income_df = pd.read_csv("https://raw.githubusercontent.com/timcsmith/MIS536-Public/master/Data/income.csv", engine='python', delimiter=", ")

### 2.2 Explore the dataset

In [None]:
# Explore the dataset
# read the first row of the dataset 
print(income_df.head())
print(income_df.columns)
print(income_df.describe())
print(income_df.info())

   age         workclass  fnlwgt  ... hours-per-week  native-country income
0   39         State-gov   77516  ...             40   United-States  <=50K
1   50  Self-emp-not-inc   83311  ...             13   United-States  <=50K
2   38           Private  215646  ...             40   United-States  <=50K
3   53           Private  234721  ...             40   United-States  <=50K
4   28           Private  338409  ...             40            Cuba  <=50K

[5 rows x 15 columns]
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')
                age        fnlwgt  ...  capital-loss  hours-per-week
count  32561.000000  3.256100e+04  ...  32561.000000    32561.000000
mean      38.581647  1.897784e+05  ...     87.303830       40.437456
std       13.640433  1.055500e+05  ...    402.960219       12.

### 2.3 Clean/transform data (where necessary)

In [None]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
income_df.columns = [s.strip() for s in income_df.columns] 
income_df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [None]:
# clean the datast: sex is not numeric.
income_df.sex = income_df.sex.replace("Male", 0, regex=True)
income_df.sex = income_df.sex.replace("Female", 1, regex=True)
income_df.sex

0        0
1        0
2        0
3        0
4        1
        ..
32556    1
32557    0
32558    1
32559    0
32560    1
Name: sex, Length: 32561, dtype: int64

In [None]:
# Transform our predictors into integers. This is necessary if we later want to test precision and recall. 
income_df.income.unique()
income_df.income = income_df.income.replace("<=50K", 0, regex=True)
income_df.income = income_df.income.replace(">50K", 1, regex=True)


## Step 3 Split data intro training and validation sets

In [None]:
# construct datasets for analysis
target = 'income'
predictors = ['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week']
X = income_df[predictors]
y = income_df[target]
print(X)
print(y)

       age  sex  capital-gain  capital-loss  hours-per-week
0       39    0          2174             0              40
1       50    0             0             0              13
2       38    0             0             0              40
3       53    0             0             0              40
4       28    1             0             0              40
...    ...  ...           ...           ...             ...
32556   27    1             0             0              38
32557   40    0             0             0              40
32558   58    1             0             0              40
32559   22    0             0             0              20
32560   52    1         15024             0              40

[32561 rows x 5 columns]
0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: income, Length: 32561, dtype: int64


In [None]:
# create the training set and the test set 
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size=0.3, random_state=1)
print(train_X)
print(valid_X)

       age  sex  capital-gain  capital-loss  hours-per-week
16525   44    0             0             0              60
14551   22    1             0             0              30
518     21    1             0             0              35
22524   46    0             0             0              40
11425   17    0             0             0              20
...    ...  ...           ...           ...             ...
32511   25    1             0             0              40
5192    32    0         15024             0              45
12172   27    0             0             0              40
235     59    0             0             0              40
29733   33    0             0          1902              45

[22792 rows x 5 columns]
       age  sex  capital-gain  capital-loss  hours-per-week
9646    62    1             0             0              66
709     18    0             0             0              25
7385    25    0         27828             0              50
16671   33    

## Step 4: Create and train model


You can find details about the DecisionTree classifier [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

### 4.1 Create a decision tree using all of the default parameters

In [None]:
dtree=DecisionTreeClassifier()

### 4.2 Fit the model to the training data

In [None]:
dtree.fit(train_X, train_y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

### 4.3 Review of the performance of the model on the validation/test data

In [None]:
validation_predictions = dtree.predict(valid_X)

print(confusion_matrix(valid_y, validation_predictions))
print(accuracy_score(valid_y, validation_predictions))
print(precision_score(valid_y, validation_predictions))
print(recall_score(valid_y, validation_predictions))

[[7125  425]
 [1289  930]]
0.8245470365441704
0.6863468634686347
0.4191077061739522


## Step 5: Deploy model

In this exercise (predicting income), there is no model deployment. Here we develop a model and test its performance on the validation data.

What does "deploying" a model mean? Up to this point, we've trained a model to our training data and then estimated the performance of this model on new data by testing its performance on validation data.

In this course, we finish after building the model. In practice, the model is used by an organization/company in some way. Using the model is often referred to as "deploying" the model.

How a model is deployed can vary. It may simply be deployed as a notebook that reads the latest predictor data and uses the developed model to make predictions. The model can also be deployed inside enterprise decision support software that automatically makes predictions on incoming data.

# Section 2: Prediction with Decision Tree (using hyperparmater tuning)


This section demonstrates how to refine the performance of a decision tree using hyperparameter tuning techniques. 

This section doesn't duplicate the data loading, cleaning, and splitting of the first section. This section shows how to create and test a random forest classifier using RandomSearchCV and GridSearchCV techniques. 

Both RandomSearchCV and GridSearchCV test different model parameters. These help to determine the parameters that produce the best performing model.

## Step 1: Determine the parameters that can be "tuned"

You can review the parameters of the model which you're trying to "tune". In this case, we're using a DecisionTreeClassifier. Begin by reviewing the parameters for this model found [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

After reviewing these parameters (while also understanding something about DecisionTrees), we can identify the following parameters that could affect model fit. 

* criterion
* max_depth
* min_samples_split
* min_samples_leaf
* max_leaf_nodes
* min_impurity_decrease



### Step 2: Create an initial 'wide' range of possible hyperparameter values


Here we create a wide range of possible parameter values for each of the hyperparameters we've listed above. 


In [None]:
# Criterion used to guide data splits
criterion = ['gini', 'entropy']

# Maximum number of levels in tree. If None, then nodes are expanded until all leaves are pure or until all 
# leaves contain less than min_samples_split samples.
# default = None
max_depth = [int(x) for x in np.linspace(5, 200, num = 40)]
max_depth.append(None)

# Minimum number of samples required to split a node
# default is 2
min_samples_split = [1, 3, 5, 8, 10, 15]

# Minimum number of samples required at each leaf node
# default = 1 
min_samples_leaf = [1, 2, 3, 4]

# max_leaf_nodes  - Grow trees with max_leaf_nodes in best-first fashion.
# If None then unlimited number of leaf nodes.
# default=None 
max_leaf_nodes = [None]

# min_impurity_decrease - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
# default=0.0
min_impurity_decrease = [0.000, 0.0005, 0.001, 0.005, 0.01]

# Create the random grid
param_grid_random = { 'criterion': criterion,
                      'max_depth': max_depth,
                      'min_samples_split': min_samples_split,
                      'min_samples_leaf' : min_samples_leaf,
                      'max_leaf_nodes' : max_leaf_nodes,
                      'min_impurity_decrease' : min_impurity_decrease,
                     }



### Step 3: Use Randomize Search to narrow the possible range of parameter values


In [None]:
# Use the param_grid_random for an initial "rough" search using Randomized search
dtree_default = DecisionTreeClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
randomSearch = RandomizedSearchCV(estimator = dtree_default, param_distributions = param_grid_random, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
randomSearch.fit(train_X, train_y)
bestRandomModel = randomSearch.best_estimator_
print('Best parameters found: ', randomSearch.best_params_)


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done  84 tasks      | elapsed:    0.6s


Best parameters found:  {'min_samples_split': 5, 'min_samples_leaf': 3, 'min_impurity_decrease': 0.0005, 'max_leaf_nodes': None, 'max_depth': 95, 'criterion': 'entropy'}


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    1.5s finished


### Step 4: Test the performance of the selected parameters

In [None]:
validation_predictions = bestRandomModel.predict(valid_X)
print('Accuracy Score: ', accuracy_score(valid_y, validation_predictions))
print('Precision Score: ', precision_score(valid_y, validation_predictions))
print('Recall Score: ', recall_score(valid_y, validation_predictions))


Accuracy Score:  0.8301770907974204
Precision Score:  0.9388714733542319
Recall Score:  0.2699414150518252


### Step 5: Use knowledge gained from Step 3 to create new 'narrow' range of possible hyperparameter values

In [None]:
# let's take the best parameters from the the random search, and use this as a base for gridsearch
param_grid = {
              'min_samples_split': [1, 3, 5, 7, 9],  
              'min_samples_leaf': [1, 2, 3, 4, 5],
              'min_impurity_decrease': [0.0003, 0.0005, 0.0008, 0.001, 0.002],
              'max_leaf_nodes': [None], 
              'max_depth': [90,93,95,97,100],
              'criterion': ['entropy'],
              }


### Step 6: Use Grid (exhaustive) to refine model 

In [None]:
# refine our search using param_grid
dtree_tuned = DecisionTreeClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
gridSearch = GridSearchCV(estimator = dtree_tuned, param_grid=param_grid, cv = 3, verbose=2,  n_jobs = -1)
# Fit the random search model
gridSearch.fit(train_X, train_y)
bestGridModel = gridSearch.best_estimator_
print('Best parameters found: ', gridSearch.best_params_)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.


Fitting 3 folds for each of 625 candidates, totalling 1875 fits


[Parallel(n_jobs=-1)]: Done  84 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 490 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 1056 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done 1796 out of 1875 | elapsed:    7.3s remaining:    0.3s


Best parameters found:  {'criterion': 'entropy', 'max_depth': 90, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0003, 'min_samples_leaf': 2, 'min_samples_split': 3}


[Parallel(n_jobs=-1)]: Done 1875 out of 1875 | elapsed:    7.7s finished


### Step 7: Test the performance of the model using identified parameters

In [None]:
validation_predictions = bestGridModel.predict(valid_X)
print(accuracy_score(valid_y, validation_predictions))
print(precision_score(valid_y, validation_predictions))
print(recall_score(valid_y, validation_predictions))


0.8324291124987204
0.9422492401215805
0.2794051374493015


### Step 8: If necessary, repeat steps 5 through 7 with more granular and targeted parameter search

Continuing to tune your model is optional, but if there is evidence that there is room for continued improvement in the model, use the information from Step 6 to create a more refined set of parameter ranges and rerun the GridSearchCV. 

In this case, we increased accuracy from 0.8302 to 0.8325. The importance of this increase is context-dependent. In some contexts, every little increase is worth the effort -- in other contexts, such increases would have minimal to no benefit. In this case, we'll assume that any incremental increase in performance from repeated searching is not enough to justify continuing. 

NOTE: The decision to continue further is dependent on the expected incremental increase in performance, the importance of any small incremental increase (to the business), and the resources you have (time and computing power). Recognize that you generally have a "diminishing" returns situation, where it takes an increasingly large amount of computation (training) to gain an increasingly smaller increase in performance. 