__Assignment 5__

1. [Import](#Import)
1. [Assignment 5](#Assignment-5)
    1. [Train decision tree](#Train-decision-tree)
    1. [Improve the model](#Improve-the-model)
    1. [Review results](#Review-results)

# Import

<a id = 'Import'></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn import tree

import warnings

warnings.simplefilter("ignore")
np.set_printoptions(threshold=np.inf, suppress=True)
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:88% !important; }</style>"))

# Assignment 5

Assignment Content:

Launch the “DecisionTreesAssignment.ipynb” file in Jupyter Notebook and complete the assignment. Don’t forget to include tae.data in your directory or change the code to read this input file. Namely, build a decision tree classifier and test the classifier on previous and future data. Write a one page report explaining the following:

- How the decision tree classifier works
- Thoroughly explain the code (i.e. your implementation of the decision tree classifier)
- Try to get as high accuracy as possible on the future data.
- Is it possible to have both high training/previous and test/future accuracy? Why or why not?
- Play around with the sklearn DecisionTreeClassifier library, tweaking the various decision tree parameters and report on the results of your findings

Deliverables:

- Completed DecisionTreesAssignment.ipynb code which compiles, runs, and gives fairly good training and test accuracy (they don’t need to be 100%, but justify your answer in the report)
- Report (as docx, pdf, or any other popular text format)


<a id = 'Assignment-5'></a>

In [2]:
# Dataset used: https://archive.ics.uci.edu/ml/datasets/Teaching+Assistant+Evaluation

ta_eval_raw = pd.read_csv("s3://tdp-ml-datasets/misc/tae.data")

# Used for sampling (previous data) and (future data)
total_sample_num = ta_eval_raw.shape[0]

# in order to simulate what would happen in real life scenarios, we choose
# 50% of this dataset at random to be our previous (training) data

previous_data_to_train_on = ta_eval_raw.sample(
    int(0.5 * total_sample_num), replace=True
)

# review size of dataset
print("Full dataset dimensions: {0}".format(ta_eval_raw.shape))

Full dataset dimensions: (150, 6)


In [3]:
# convert data into numpy arrays
X = previous_data_to_train_on.iloc[:, :-1].values
y = previous_data_to_train_on.iloc[:, -1].values.reshape(-1, 1)

print("X dimensions: {0}".format(X.shape))
print("y dimensions: {0}".format(y.shape))

X dimensions: (75, 5)
y dimensions: (75, 1)


In [4]:
# TODO: Build your decision tree model here (keep the name of your decision tree consistent throughout the program)
# Hint: Read the sklearn DecisionTreeClassifier docs:
# http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# Hint: You may also find it useful to visualize the tree in Graphviz as we've done in the lecture
decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

## Train decision tree

<a id = 'Train-decision-tree'></a>

In [5]:
# in order to simulate what would happen in real life scenarios, we choose
# 50% of this dataset at random to be our future (test) data
future_data_we_will_test_on = ta_eval_raw.sample(
    int(0.5 * total_sample_num), replace=True
)

# true values so that we can compare how good our classifier that we built performs
true_value_training = list(previous_data_to_train_on.iloc[:, 5])
true_value_future = list(future_data_we_will_test_on.iloc[:, 5])

# using Decision Tree to predict the labels i.e. 1, 2, or 3 for training and test data
prediction_on_trainingdata = decision_tree.predict(
    previous_data_to_train_on.iloc[:, :-1]
)
prediction_on_futuredata = decision_tree.predict(
    future_data_we_will_test_on.iloc[:, :-1]
)

print(
    "Data We Trained On Prediction Accuracy: {0}%".format(
        sum(true_value_training == prediction_on_trainingdata)
        / len(prediction_on_trainingdata)
    )
)
print(
    "Future Data Prediction Accuracy: {0}%".format(
        sum(true_value_future == prediction_on_futuredata)
        / len(prediction_on_futuredata)
    )
)

Data We Trained On Prediction Accuracy: 0.9733333333333334%
Future Data Prediction Accuracy: 0.6533333333333333%


## Improve the model

<a id = 'Improve-the-model'></a>

In [6]:
# use sklearn's train_test_split to create train/test sets
# create a test sample that is 50% of the sample just like the workflow above
from sklearn.model_selection import train_test_split

X = ta_eval_raw.iloc[:, :-1].values
y = ta_eval_raw.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print("X_train dimensions: {0}".format(X_train.shape))
print("X_test dimensions: {0}".format(X_test.shape))
print("y_train dimensions: {0}".format(y_train.shape))
print("y_test dimensions: {0}".format(y_test.shape))

X_train dimensions: (120, 5)
X_test dimensions: (30, 5)
y_train dimensions: (120,)
y_test dimensions: (30,)


In [7]:
# baseline
decision_tree = tree.DecisionTreeClassifier(random_state=1)
decision_tree.fit(X_train, y_train)

y_pred_train = decision_tree.predict(X_train)
y_pred_test = decision_tree.predict(X_test)

print("Baseline performance: \n")

print(
    "Data We Trained On Prediction Accuracy: {0}%".format(
        sum(y_train == y_pred_train) / len(y_pred_train)
    )
)
print(
    "Future Data Prediction Accuracy: {0}%".format(
        sum(y_test == y_pred_test) / len(y_pred_test)
    )
)

Baseline performance: 

Data We Trained On Prediction Accuracy: 0.9583333333333334%
Future Data Prediction Accuracy: 0.5666666666666667%


In [8]:
# setup parameter grid for grid search
parameters = {
    "criterion": ["gini", "entropy"],
    "min_samples_split": range(2, 10),
    "max_depth": range(2, 10),
    "max_leaf_nodes": range(2, 30),
}
print(parameters)

{'criterion': ['gini', 'entropy'], 'min_samples_split': range(2, 10), 'max_depth': range(2, 10), 'max_leaf_nodes': range(2, 30)}


In [9]:
# perform grid search with 10 fold cross validation using the parameter grid above
# this ran in less than 30 seconds on my local machine
from sklearn.model_selection import GridSearchCV

decision_tree = tree.DecisionTreeClassifier(random_state=1)
grid_search = GridSearchCV(
    estimator=decision_tree, param_grid=parameters, cv=10, refit=True
)
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=1,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(2, 10),
                         'm

In [10]:
# display best parameters chosen by grid search
print("Best parameters: {0}".format(grid_search.best_params_))

Best parameters: {'criterion': 'entropy', 'max_depth': 6, 'max_leaf_nodes': 10, 'min_samples_split': 2}


## Review results

<a id = 'Review-results'></a>

In [11]:
# generate predictions using best model and review accuracies
y_pred_train = grid_search.predict(X_train)
y_pred_test = grid_search.predict(X_test)

print("Grid search model performance: \n")

print(
    "Data We Trained On Prediction Accuracy: {0}%".format(
        sum(y_train == y_pred_train) / len(y_pred_train)
    )
)
print(
    "Future Data Prediction Accuracy: {0}%".format(
        sum(y_test == y_pred_test) / len(y_pred_test)
    )
)

Grid search model performance: 

Data We Trained On Prediction Accuracy: 0.6666666666666666%
Future Data Prediction Accuracy: 0.5%


In [12]:
# review best model as determined by GridSearchCV
grid_search.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=6,
                       max_features=None, max_leaf_nodes=10,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')