# Assignment 8 - Decision Trees 

## Introduction

We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is `Wine_Quality_Data.csv`.

In [20]:
from __future__ import print_function
import os
#data_path = ['data']

## Question 1

* Import the data and examine the features.
* We will be using all of them to predict `color` (white or red), but the colors feature will need to be integer encoded.

In [21]:
import pandas as pd
import numpy as np

#read the dataset

df = pd.read_csv("Wine_Quality_Data.csv")

In [22]:
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [None]:
df.shape

In [30]:
#wwrite your code to integer encode the target column, color (white or red).
df = df.replace({'color' : {'red' :1, 'white' :0}})

In [31]:
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


## Question 2

* Use `train_test_split` to split data into train and test sets. 
* Check the percent composition of each quality level for both the train and test data sets.

In [39]:
#Write your code
from sklearn.model_selection import train_test_split
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X.head()
y.head()
X_train, X_test, Y_train, Y_test = train_test_split(X,y, test_size=0.2, random_state=42)

print('y_train\n',Y_train.value_counts(normalize=True))
print('y_test\n',Y_test.value_counts(normalize=True))

y_train
 0    0.757937
1    0.242063
Name: color, dtype: float64
y_test
 0    0.737692
1    0.262308
Name: color, dtype: float64


## Question 3

* Fit a decision tree classifier with no set limits on maximum depth, features, or leaves.
* Determine how many nodes are present and what the depth of this (very large) tree is.
* Using this tree, measure the prediction error in the train and test data sets. What do you think is going on here based on the differences in prediction error?

In [43]:
#Write code for fitting a decision tree classifier with no set limits on maximum depth, features, or leaves.
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier(criterion='gini')

DTC = DTC.fit(X_train, Y_train)
y_predict = DTC.predict(X_test)
y_predict

array([0, 1, 0, ..., 0, 0, 0], dtype=int64)

In [44]:
def model_accuracy(real,predict):
    return sum(real == predict) / float(real.shape[0])

In [45]:
print(model_accuracy(Y_test,y_predict))

0.98


In [52]:
print("depth\n",DTC.get_depth())
print("leaves\n",DTC.get_n_leaves())
print("nodes\n",DTC.tree_.node_count)

depth
 19
leaves
 81
nodes
 161


In [51]:
DTC.tree_.node_count
DTC.tree_.max_depth

19

In [None]:
#Write code for determining how many nodes are present and what the depth of this (very large) tree is.
n_nodes = 
depth = 
print('Number of nodes:',n_nodes)
print('Depth:', depth)

In [53]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def eval_performance(y_real, y_pred, prediction_type):
    accuracy = accuracy_score(y_real,y_pred)
    precision = precision_score(y_real,y_pred)
    recall = recall_score(y_real,y_pred)
    f1score = f1_score(y_real,y_pred)
    
    metrics = pd.Series({'accuracy': accuracy,
                         'precision': precision,
                         'recall': recall,
                         'f1_score': f1score },
                          name = prediction_type)
    return metrics

In [57]:
#Train
y_train_prediction = DTC.predict(X_train)
train_per = eval_performance(Y_train,y_train_prediction, "Train")
# Test
y_test_prediction = DTC.predict(X_test)
test_per = eval_performance(Y_test,y_test_prediction, "Test")

prediction_performance_table = pd.concat([train_per, test_per], axis = 1)
print(prediction_performance_table)

              Train      Test
accuracy   0.999615  0.980000
precision  0.999205  0.961877
recall     0.999205  0.961877
f1_score   0.999205  0.961877


The prediction of the model is pretty good. However, there might be a little overfitting as the performance of the training data is too good than the testing data.
We can see that the training data has almost 100% accuracy, and it is better than test accuracy. Therefore, our model performance might have a chance to be overfitted.

## Question 4

* Find th feature importance of the dataset.
* Using grid search with cross validation, find a decision tree that performs well on the test data set. 
* Measure the accuracy, precision, recall and F1-score on the training and test sets as before.

In [59]:

DTC.feature_importances_

array([0.0046868 , 0.04922317, 0.00247365, 0.00524934, 0.20778294,
       0.00167511, 0.68909374, 0.01031577, 0.01068861, 0.01641021,
       0.00138561, 0.00101507])

In [None]:
#Write your code to find out feature importnce



In [61]:
from sklearn.model_selection import GridSearchCV

# Write your code to set parameters (max_depth and feature importancr) for cross validation
params = {'max_depth':range(1,DTC.tree_.max_depth),'max_features':range(1,len(DTC.feature_importances_)+1)}

# Write your code to apply Grid search cv using the paramters
grid_search_CV = GridSearchCV(DTC,params,scoring = 'accuracy',cv=3)

#write your code to fit the model
grid_search_CV.fit(X_train,Y_train )

GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': range(1, 19),
                         'max_features': range(1, 13)},
             scoring='accuracy')

In [63]:
print('Number of nodes: ', grid_search_CV.best_estimator_.tree_.node_count)
print('Depth of model:  ', grid_search_CV.best_estimator_.tree_.max_depth)

Number of nodes:  71
Depth of model:   7


In [65]:
#Train
y_train_prediction = grid_search_CV.predict(X_train)
train_per = eval_performance(Y_train,y_train_prediction , "Train")
# Test
y_test_prediction = grid_search_CV.predict(X_test)
test_per = eval_performance(Y_test,y_test_prediction , "Test")

prediction_performance_table = pd.concat([train_per, test_per], axis = 1)
print(prediction_performance_table)

              Train      Test
accuracy   0.994612  0.983077
precision  0.994373  0.973294
recall     0.983307  0.961877
f1_score   0.988809  0.967552
