Assignment Content:

Launch the “DecisionTreesAssignment.ipynb” file in Jupyter Notebook and complete the assignment. Don’t forget to include tae.data in your directory or change the code to read this input file. Namely, build a decision tree classifier and test the classifier on previous and future data. Write a one page report explaining the following:

- How the decision tree classifier works
- Thoroughly explain the code (i.e. your implementation of the decision tree classifier)
- Try to get as high accuracy as possible on the future data.
- Is it possible to have both high training/previous and test/future accuracy? Why or why not?
- Play around with the sklearn DecisionTreeClassifier library, tweaking the various decision tree parameters and report on the results of your findings

Deliverables:

- Completed DecisionTreesAssignment.ipynb code which compiles, runs, and gives fairly good training and test accuracy (they don’t need to be 100%, but justify your answer in the report)
- Report (as docx, pdf, or any other popular text format)


## Importing Data

In [1]:
import pandas as pd
from sklearn import tree

#Dataset used: https://archive.ics.uci.edu/ml/datasets/Teaching+Assistant+Evaluation

ta_eval_raw = pd.read_csv('tae.data')

#Used for sampling (previous data) and (future data) 

total_sample_num = ta_eval_raw.shape[0]

#in order to simulate what would happen in real life scenarios, we choose
#50% of this dataset at random to be our previous (training) data

previous_data_to_train_on = ta_eval_raw.sample(int(0.5 * total_sample_num), replace = True)

## Assignment: Write code to build a decision tree classifier on this dataset

In [2]:
print('Dataset dimensions: {0}'.format(ta_eval_raw.shape))

Dataset dimensions: (151, 6)


In [3]:
# Convert data into numpy arrays
X = previous_data_to_train_on.iloc[:,:-1].values
y = previous_data_to_train_on.iloc[:,-1].values.reshape(-1, 1)


In [4]:
#TODO: Build your decision tree model here (keep the name of your decision tree consistent throughout the program)
#Hint: Read the sklearn DecisionTreeClassifier docs:
#http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
#Hint: You may also find it useful to visualize the tree in Graphviz as we've done in the lecture

decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X, y)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Testing classifier on previous (training) data and future (test) data

In [6]:
#in order to simulate what would happen in real life scenarios, we choose
#50% of this dataset at random to be our future (test) data

future_data_we_will_test_on = ta_eval_raw.sample(int(0.5 * total_sample_num), replace = True)

#True values so that we can compare how good our classifier that we built performs

true_value_training = list(previous_data_to_train_on.iloc[:, 5])
true_value_future = list(future_data_we_will_test_on.iloc[:, 5])

#Using Decision Tree to predict the labels i.e. 1, 2, or 3 for training and test data

prediction_on_trainingdata = decision_tree.predict(previous_data_to_train_on.iloc[:, :-1])
prediction_on_futuredata = decision_tree.predict(future_data_we_will_test_on.iloc[:, :-1])

print('Data We Trained On Prediction Accuracy: {0}%'.format(sum(true_value_training == prediction_on_trainingdata) / len(prediction_on_trainingdata)))
print('Future Data Prediction Accuracy: {0}%'.format(sum(true_value_future == prediction_on_futuredata) / len(prediction_on_futuredata)))


Data We Trained On Prediction Accuracy: 1.0%
Future Data Prediction Accuracy: 0.6533333333333333%


- How the decision tree classifier works
    * The decision tree classifier represents the relationship between independent variables and a dependent variable in the form of an upside down tree. At each level of the tree, the model splits groups of observations based on values for a certain feature. These splits are referred to as 'nodes'. As the number of levels in the tree increases, the logic determining the splits becomes more refined. A 'leaf' is a node that does not involve a splitting of the observations - leaves represent the final prediction for the observations in the leaf.
    
- Is it possible to have both high training/previous and test/future accuracy? Why or why not?
    * Yes, it is possible to have both high training/previous and test/future accuracy. The challange is to ensure the model fit on the training data does not overfit the data. Overfitting occurs when the model captures a relationship between the independent/dependent variables that does not exist in the test dataset. Overfit models are said to have high variance, which means the model stretches to far to capture relationships exactly and becomes very sensitive to change in the underlying data, as well as poor predictors of unseen observations. While avoiding models with high variance, we also need to avoid models with high bias. Models with high bias are those that do a poor job of understanding the relationship between variables. High bias models are also said to underfit the data.
     The goal is to define a model that generalizes well, which means the model adequately understands the training data (low bias) while also performing well on data unseen during the fit stage (low varaince).
    
- Thoroughly explain the code (i.e. your implementation of the decision tree classifier)
    * First, the DecisionTreeClassifier is instantiated. Second, the model's fit method is called.
The default DecisionTreeClassifier parameters create a model that overfits the data. This is evident by the fact that the training accuracy is very high, while the test accuracy is much lower (the specific accuracy values depend on the samples chosen but the train/test accuracies are typically around 97%/60%). This means the model is identifying characteristics in the training data that are not occurring on the test data set. Fortunately, scikit-learn's implementation of the DecisionClassifier provides many parameters that can be adjusted to try and determine a model with both low bias and low variance. Some of the key parameters include:
        - criterion: takes values 'gini' or 'entropy' - the function used to measure the quality of the split
        - max_depth: the maximum depth of the tree
        - min_samples_split: the minimum number of observations required to split an internal node.
        - min_samples_leaf: the minimum number of samples needed to be in a leaf node
        - max_featurs: the number of features to evaluate when looking for the best split
        - max_leaf_nodes: caps the number of leaf nodes possible
        
- Play around with the sklearn DecisionTreeClassifier library, tweaking the various decision tree parameters and report on the results of your findings
    * 