__Assignment 6B__

1. [Import](#Import)
1. [Assignment 6B](#Assignment-6B)
    1. [Load-data](#load-data)
    1. [Cross-validation](#Cross-validation)    
    1. [Evaluation](#Evaluation)

# Import

<a id = 'Import'></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import (
    cross_val_score,
    train_test_split,
    StratifiedKFold,
    StratifiedShuffleSplit,
    KFold,
)
from sklearn.metrics import accuracy_score

import warnings

warnings.simplefilter("ignore")
np.set_printoptions(threshold=np.inf, suppress=True)
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:88% !important; }</style>"))

# Assignment 6B
Assignment Content:

1. Using the dataset tae.data implement two different cross-validation procedures in the following way:

    - Import data
    - Split data as needed into training and test sets
    - Fit a decision tree algorithm to the training data (Hint: we did this in the Decision Tree module)
    - Test the trained decision tree to the test data
    - Evaluate the performance of the decision tree on the test data reporting error rate or accuracy rate

Deliverables:

 -Two .ipynb files each pertaining to a different cross-validation procedure and each following steps 1 through 5. The code should also print out the error rate or accuracy rate of the cross-validation procedure (averaged over the number of iterations if needed)

<a id = 'Assignment-6B'></a>

## Load data

<a id = 'load-data'></a>

In [2]:
# load data
ta_eval_raw = pd.read_csv("s3://tdp-ml-datasets/misc/tae.data")

# split independent and dependent variables
X = ta_eval_raw.iloc[:, :-1].values
y = ta_eval_raw.iloc[:, -1].values.reshape(-1)

# X_train, X_test will be used in the CV procedure.
# y_train, y_test is the holdout data set and will be used as a final evaluation of the model
# outside of the CV procedure.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

## Cross-validation

<a id = 'Cross-validation'></a>

In [3]:
# create decision tree model and evaluate using KFold cross validation
decision_tree = tree.DecisionTreeClassifier(random_state=1)
kf = KFold(n_splits=10, random_state=1, shuffle=True)

scores = []
for train_index, test_index in kf.split(X_train):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    decision_tree.fit(X_train, y_train)
    y_pred = decision_tree.predict(X_test)
    scores.append(accuracy_score(y_test, y_pred))

print("Individual scores: \n {0}".format(scores))
print("")
print("Mean score: {}".format(np.mean(scores)))

Individual scores: 
 [0.5714285714285714, 0.7142857142857143, 0.7142857142857143, 0.42857142857142855, 0.7142857142857143, 0.6923076923076923, 0.6923076923076923, 0.6923076923076923, 0.46153846153846156, 0.6153846153846154]

Mean score: 0.6296703296703295


> Remarks - I am intentionally passing in X_train and y_train, as opposed to the full dataset comprised by X and y. cross_val_score internally splits the data its given into train and test sets, then performs KFold cross validation. In this case, X_train and y_train will be evaluated in 10 folds. Then I will make predictions using X_test and evaluate against y_test, which is truly unseen data in this implementation.

> Instead of using cross_val_score, I used KFold to perform a 10 fold split. KFold creates sets of indices that are used for specifying which samples from the training data will be in the training fold or the validation fold. For each split, I fit the decision tree model using the training fold, then calculate the accuracy score on the fold set aside as the validation set.


## Evaluation

<a id = 'Evaluation'></a>

In [4]:
# baseline
decision_tree.fit(X_train, y_train)

y_pred_test = decision_tree.predict(X_test)

print(
    "Future Data Prediction Accuracy: {0}%".format(
        sum(y_test == y_pred_test) / len(y_pred_test)
    )
)

Future Data Prediction Accuracy: 0.6%


> Remarks - The model's performance on the holdout set is quite a bit worse than the average cross validation accuracy. The cross_validation accuracy is slightly worse in this implementation (with this random number seed).