
## Decision Tree Classifier


For this weeks homework we are going to explore ideas around decision tree implementation!  

We will implement some helper functions that would be necessary for a home-grown tree:
  - calc_entropy
  - calc_gini
  
and them test them out at given data splits.
  
And finally, to perform predictive and descriptive analytics we use the [Decision Tree Classifier](https://scikit-learn.org/stable/modules/tree.html#classification) class in the scikit-learn package.

  
For this assignment, the stopping condition will be the depth of the tree. The impurity measure should be `Entropy`.

To test our tree built from the Decision Tree Classifier class, we will use the Melbourne housing data (that has been cleaned and pruned) and use the files:

   - `melb_tree_train.csv` for training the decision tree (we'll also see what happens if we use the same data to test as we used to train the data in the last problem)
   - `melb_tree_test.csv` for testing the decision tree

There are 10 features in these dataframes that we can use to describe and predict the class label housing "Type", which is 'h' house, 'u' duplex, or 't' townhome.

In [1]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import log2
from sklearn import tree # you'll probably need to install this - look at Q6 for a link
import graphviz # you'll probably need to install this - look at Q6 for a link
import seaborn as sns
sns.set_style('darkgrid')

In [2]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [3]:
df_train = pd.read_csv('https://gist.githubusercontent.com/yanyanzheng96/f8ba57f8377dee0810271475c728fca8/raw/afa3fa4305b55e31135980835d40b27af31f288c/melb_tree_train.csv')
df_train.head()

Unnamed: 0,Rooms,Type,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea
0,2,t,771000.0,13.8,2.0,1.0,1.0,0.0,99.0,1992.0,Bayside
1,3,t,700000.0,7.9,3.0,2.0,1.0,189.0,110.0,1990.0,Banyule
2,3,u,975000.0,12.1,3.0,2.0,1.0,277.0,109.0,1975.0,Glen Eira
3,3,h,1290000.0,8.0,3.0,1.0,1.0,618.0,132.0,1960.0,Moonee Valley
4,2,u,500000.0,4.2,2.0,1.0,1.0,0.0,86.0,2000.0,Melbourne


In [4]:
df_test = pd.read_csv('https://gist.githubusercontent.com/yanyanzheng96/ec66da011b165f0e282c0c1f8447010e/raw/d02c3829a7e9db2d156ba1ab9d5bb4d18ae57be8/melb_tree_test.csv')
df_test.head()

Unnamed: 0,Rooms,Type,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea
0,2,t,930000.0,2.6,2.0,1.0,1.0,97.0,85.0,2004.0,Yarra
1,3,t,815000.0,11.0,3.0,2.0,2.0,159.0,130.0,2014.0,Hobsons Bay
2,4,h,638000.0,13.0,4.0,2.0,1.0,624.0,258.0,2005.0,Moreland
3,2,t,595000.0,11.2,2.0,2.0,1.0,201.0,111.0,2005.0,Moreland
4,3,t,620500.0,11.2,3.0,2.0,1.0,158.0,117.0,2011.0,Darebin


## Q1 Load the Data
Load in the melb_tree_train.csv into a dataframe, and split that dataframe into `df_X`, which contains the features of the data set (everything but `Type`), and `s_y`, the series containing just the class label (just `Type`). The lengths of `df_X` and `s_y` should match.

In [5]:
df_X = df_train.drop(['Type'], axis=1)
s_y = df_train['Type']

In [6]:
print(df_X.shape)
print(s_y.shape)

(810, 10)
(810,)


## Q2 Implement a function to calculate entropy
Implement a function `calc_entropy` that takes the the class label series, `s_y`, as a parameter. Implement using the definition on p128 in the DM book and only use pandas and log2 libraries

In [7]:
# calc_entropy(s_y) definition
def calc_entropy(s_y):
    probs = s_y.value_counts(normalize=True).values
    entropy = 0
    for prob in probs:
        entropy -= (prob * log2(prob))

    return entropy

In [8]:
s_y.value_counts()

t    281
u    281
h    248
Name: Type, dtype: int64

In [9]:
s_y.value_counts(normalize=True)

t    0.346914
u    0.346914
h    0.306173
Name: Type, dtype: float64

In [10]:
s_y.value_counts(normalize=True).values

array([0.34691358, 0.34691358, 0.30617284])

In [11]:

probs = s_y.value_counts(normalize=True).values
entropy = 0
for prob in probs:
    entropy -= (prob * log2(prob))

In [12]:
entropy

1.582533311426178

## Q3 Use the entropy function to
  - (a) Calculate the entropy of the entire training set
  - (b) Calculate the entropy of the three partitions formed from the following three intervals:

(i) Landsize $\in [0,200]$

(ii) Landsize $\in (200,450]$

(iii) Landsize $\in (450, \infty)$

In [13]:
# The entire data set
calc_entropy(s_y)

1.582533311426178

In [14]:
# Less than or equal to 200
calc_entropy(df_train[df_train['Landsize'] <= 200]['Type'])

1.3456432116206725

In [15]:
# Between 200 and 450
calc_entropy(df_train[(df_train['Landsize'] > 200) & (df_train['Landsize'] <= 450)]['Type'])

1.4660501816027975

In [16]:
# greater than 450
calc_entropy(df_train[(df_train['Landsize'] > 450)]['Type'])

1.09954792005911

## Q4 Create a decision tree
Using [scikit-learn](https://scikit-learn.org/stable/modules/tree.html#tree) create a multi class classifer for the data set using the Entropy impurity measure and a max depth of 3.

Note that scikit-learn's algorithm doesn't handle categorical data, so that needs to be preprocessed using an one hot encoding.

Display the tree using `export_text` from sklearn.tree, and use that information to write some descriptive analytics on the classification of houses.  For extra fun, use the export_graphviz to draw the graph (see documentation on the [scikit-learn webpage](https://scikit-learn.org/stable/modules/tree.html#classification)).

In [17]:
# Create Tree
df_X = pd.get_dummies(df_X) # one hot encoding for categorical data

clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3) # specify 'hyperparameters' for a decision tree
clf = clf.fit(df_X, s_y) # fit feature dataframes df_X and target column s_y into the decision tree model for training to build the 'real nodes', i.e. 'parameters'

In [18]:
# Display text version of the tree
t = tree.export_text(clf, feature_names=list(df_X.columns))
print(t)

|--- YearBuilt <= 1977.50
|   |--- Rooms <= 2.50
|   |   |--- Price <= 915000.00
|   |   |   |--- class: u
|   |   |--- Price >  915000.00
|   |   |   |--- class: h
|   |--- Rooms >  2.50
|   |   |--- Landsize <= 429.00
|   |   |   |--- class: h
|   |   |--- Landsize >  429.00
|   |   |   |--- class: h
|--- YearBuilt >  1977.50
|   |--- BuildingArea <= 104.31
|   |   |--- Landsize <= 76.00
|   |   |   |--- class: u
|   |   |--- Landsize >  76.00
|   |   |   |--- class: t
|   |--- BuildingArea >  104.31
|   |   |--- Landsize <= 391.00
|   |   |   |--- class: t
|   |   |--- Landsize >  391.00
|   |   |   |--- class: h



In [19]:
# Display graphviz version of the tree
graph_data = tree.export_graphviz(clf, out_file=None,
                                  feature_names=list(df_X.columns),
                                  class_names=list(set(s_y)),
                                   filled=True, rounded=True)
graph = graphviz.Source(graph_data, )
graph

ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH

<graphviz.sources.Source at 0x163306650>

➡️ Answer containing your descriptive analytics in markdown here ⬅️
### Answer
* **h - house**: houses both old and new (pre and post 1977.5) are typically relatively small, with either 2.5 or less rooms and being cheaper than townhomes (less than \$915,000) or they have a building area of 104.315 square meters or less and landsize of 76.0 square meters or less.
* **u - duplex**: duplexes are newer, most of them being built after 1977.5, and their land size corresponds with the building area: duplexes with area less than 104.315 square meters have a landsize greater than 76 square meters (as compared to homes which have less), and larger duplexes have landsize less than 391 square meters (smaller than townhomes).
* **t - townhome**: townhomes are typically built 1977 or earlier and when they have more than 2.5 rooms they have a landsize of 429.0 square meters or less, and when they have less than 2.5 rooms they're on the more expensive size of more than \$915,000.

## Q5 Calculate the Accuracy and Display Learning Curve
Load in the test data from melb_tree_test.csv.

Use the scikit-learn library to create many decision trees, each one with a different configuration (aka Hyperparameters).  You will create 28 different trees by:

    - Varying the max depth from 2 to 15 with the Entropy as the impurity measure

Implementation tip: you can create an array of numbers from 2 to 15 by using the numpy function [arange](https://numpy.org/doc/stable/reference/generated/numpy.arange.html).

For each of the 28 decistion trees, calculate the error rate by using the data in the:
  - Training set, and
  - Test set.

Display the results graphicaly, and offer an analysis of the trend (or if no trend present, offer a hypotheisis of why).  The max depth should be on the x-axis, and the error rate should be on the y-axis (see figure 3.23 in your DM textbook for a similar style of graph that uses leaf nodes instead of depth for the x-axis). Your plot will include 4 series of data
   - Test error (entropy)
   - Training error (entropy)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# Load in the test data
X_test = df_test.drop(['Type'], axis=1)
X_test = pd.get_dummies(X_test)
y_test = df_test['Type']

In [None]:
# Build the trees using the training data
trees_dict = {}

for criterion in ['entropy']:
    for i in range(2, 16):
        name = criterion + str(i)
        clf = tree.DecisionTreeClassifier(criterion=criterion, max_depth=i)
        clf = clf.fit(df_X, s_y)
        trees_dict[name] = clf

train_accuracies = []
for t in trees_dict.values():  # loop over decision tree names
    y_pred = t.predict(df_X)   # make the prediction for each tree
    acc_array = []
    for actual, pred in zip(s_y, y_pred): # compara each predicted instance with the label
        if pred == actual:
            acc_array.append(True)
        else:
            acc_array.append(False)
    acc = 1 - (sum(acc_array) / len(acc_array))
    train_accuracies.append(acc)

test_accuracies = []
for t in trees_dict.values():
    y_pred = t.predict(X_test)
    acc_array=[]
    for actual, pred in zip(y_test, y_pred):
        if pred == actual:
            acc_array.append(True)
        else:
            acc_array.append(False)
    acc = 1 - (sum(acc_array) / len(acc_array))
    test_accuracies.append(acc)

In [None]:
# Plot the 4 learning curves
depths = list(range(2, 16))

entropy_train = train_accuracies[0:14]
entropy_test = test_accuracies[0:14]


plt.plot(depths, entropy_train, label='entopy train')
plt.plot(depths, entropy_test, label='entropy test')
plt.legend()
plt.xlabel('Maximum Tree Depth')
plt.ylabel('Error')
plt.title('Error by Tree Depth & Impurity Measure')

➡️ Answer containing your analysis of the trend (or if no trend present, offer a hypotheisis of why) here ⬅️

#### **Answer**

For both gini index and entropy criterion for building the decision tree, the train accuracy consistently increases as the maximum tree depth increases before leveling off very close to 100% accuracy. The test accuracy, however, increases up to a maximum tree depth of around 3-4 before decreasing. Both of these trends are likely present due to overfitting: the model is fit very closely to the training data, so the training accuracy is very high, but it loses its ability to generalize because it's predicting in a very specific way that holds true for the training data but not for the test data.