Assignment Content:

1. Using the dataset tae.data implement two different cross-validation procedures in the following way:

    - Import data
    - Split data as needed into training and test sets
    - Fit a decision tree algorithm to the training data (Hint: we did this in the Decision Tree module)
    - Test the trained decision tree to the test data
    - Evaluate the performance of the decision tree on the test data reporting error rate or accuracy rate

Deliverables:

 -Two .ipynb files each pertaining to a different cross-validation procedure and each following steps 1 through 5. The code should also print out the error rate or accuracy rate of the cross-validation procedure (averaged over the number of iterations if needed)

## Import data

In [1]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.cross_validation import cross_val_score, train_test_split, StratifiedKFold, StratifiedShuffleSplit

import warnings; warnings.simplefilter('ignore')
np.set_printoptions(threshold = np.inf, suppress = True)
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:88% !important; }</style>"))

ta_eval_raw = pd.read_csv('tae.data')




## Split data

In [2]:
# Split independent and dependent variables

X = ta_eval_raw.iloc[:,:-1].values
y = ta_eval_raw.iloc[:,-1].values.reshape(-1)

# Xtrain, XTest will be used in the CV procedure.
# yTrain, yTest is the holdout data set and will be used as a final evaluation of the model
# outside of the CV procedure.

XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size = 0.1, random_state = 1)



## Evaluate decision tree using a cross-validation technique

In [3]:
decision_tree = tree.DecisionTreeClassifier(random_state = 1)
scores = cross_val_score(decision_tree
                        ,XTrain
                        ,yTrain
                        ,cv = StratifiedShuffleSplit(yTrain, n_iter = 10, random_state = 1)
                        ,scoring = 'accuracy')
print('Individual scores: \n {0}'.format(scores))
print('')
print('Mean score: {}'.format(np.mean(scores)))

Individual scores: 
 [0.35714286 0.71428571 0.71428571 0.5        0.85714286 0.64285714
 0.64285714 0.42857143 0.57142857 0.78571429]

Mean score: 0.6214285714285713


> Remarks - I am intentionally passing in XTrain and yTrain, as opposed to the full dataset comprised by X and y. cross_val_score internally splits the data its given into train and test sets, then performs KFold cross validation. In this case, XTrain and yTrain will be evaluated in 10 folds. Then I will make predictions using XTest and evaluate against yTest, which is truly unseen data in this implementation.

> Remarks - The difference between this implementation and the first is that I am explicitly telling cross_val_score that I want to use StratifiedShuffleSplit, rather than letting it default to StratifiedKFold. When using StratifiedKFold, every samples will appear in the test set once, whereas StratifiedShuffleSplit shuffles the data set before making each split, so each sample may or may not appear in the the test data set.

## Final evaluation using holdout set

In [4]:
# Baseline

decision_tree.fit(XTrain, yTrain)

yPredTest = decision_tree.predict(XTest)

print('Future Data Prediction Accuracy: {0}%'.format(sum(yTest == yPredTest) / len(yPredTest)))


Future Data Prediction Accuracy: 0.5625%


> Remarks - The model's performance on the holdout set is quite a bit worse than the average cross validation accuracy. The cross_validation accuracy is slightly worse in this implementation (with this random number seed).