# Decision Trees in Practice

In this assignment we will explore various techniques for preventing overfitting in decision trees. We will extend the implementation of the binary decision trees that we implemented in the previous assignment. You will have to use your solutions from this previous assignment and extend them.

In this assignment you will:

- Implement binary decision trees with different early stopping methods.
- Compare models with different stopping parameters.
- Visualize the concept of overfitting in decision trees.

In [2]:
import graphlab
import pandas as pd
import numpy as np
from __future__ import division

In [3]:
loans = graphlab.SFrame('lending-club-data.gl/')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Santosh\AppData\Local\Temp\graphlab_server_1482637946.log.0


This non-commercial license of GraphLab Create for academic use is assigned to santosh.chilkunda@gmail.com and will expire on July 20, 2017.


In [4]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: +1 if(x==0) else -1)

In [5]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'

In [6]:
loans = loans[features + [target]]

Subsample dataset to make sure classes are balanced

In [7]:
safe_loans_data = loans[loans['safe_loans'] == +1]
risky_loans_data = loans[loans['safe_loans'] == -1]

pctage = (len(risky_loans_data) / len(safe_loans_data))
safe_loans_data2 = safe_loans_data.sample(pctage, seed=1)

print "num safe loans:", len(safe_loans_data2)
print "num risky loans:", len(risky_loans_data)

loans_data = risky_loans_data.append(safe_loans_data2)

num safe loans: 23358
num risky loans: 23150


Apply one-hot encoding

In [8]:
for feature in features:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})    
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)
    
    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)

In [9]:
features = loans_data.column_names()
features.remove('safe_loans')  # Remove the response variable
features

['grade.A',
 'grade.B',
 'grade.C',
 'grade.D',
 'grade.E',
 'grade.F',
 'grade.G',
 'term. 36 months',
 'term. 60 months',
 'home_ownership.MORTGAGE',
 'home_ownership.OTHER',
 'home_ownership.OWN',
 'home_ownership.RENT',
 'emp_length.1 year',
 'emp_length.10+ years',
 'emp_length.2 years',
 'emp_length.3 years',
 'emp_length.4 years',
 'emp_length.5 years',
 'emp_length.6 years',
 'emp_length.7 years',
 'emp_length.8 years',
 'emp_length.9 years',
 'emp_length.< 1 year',
 'emp_length.n/a']

In [10]:
train_data, validation_data = loans_data.random_split(0.8, seed=1)

In [11]:
def intermediate_node_num_mistakes(labels_in_node):
    if (len(labels_in_node) == 0):
        return 0
    
    num_safe_loans = len(labels_in_node[labels_in_node == +1])
    num_bad_loans = len(labels_in_node[labels_in_node == -1])
    
    if(num_safe_loans > num_bad_loans):
        num_mistakes = num_bad_loans
    else:
        num_mistakes = num_safe_loans
    
    return num_mistakes

In [12]:
# Test case 1
example_labels = np.array([-1, -1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print 'Test passed!'
else:
    print 'Test 1 failed... try again!'

# Test case 2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print 'Test passed!'
else:
    print 'Test 3 failed... try again!'
    
# Test case 3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print 'Test passed!'
else:
    print 'Test 3 failed... try again!'

Test passed!
Test passed!
Test passed!


In [13]:
def get_best_splitting_feature(data, features, target):
    total_num_samples = len(data)
    
    best_feature = None
    lowest_err = 2    
    
    for feature in features:
        left_split = data[data[feature] == 0]
        right_split = data[data[feature] == 1]
        
        left_split_ce = intermediate_node_num_mistakes(left_split[target])
        right_split_ce = intermediate_node_num_mistakes(right_split[target])
        
        ce = (left_split_ce + right_split_ce) / total_num_samples
        
        if(ce < lowest_err):
            best_feature = feature
            lowest_err = ce
    
    return best_feature

In [14]:
if get_best_splitting_feature(train_data, features, 'safe_loans') == 'term. 36 months':
    print 'Test passed!'
else:
    print 'Test failed... try again!'

Test passed!


In [15]:
def create_leaf(target_values):    
    # Create a leaf node
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True}   ## YOUR CODE HERE 
   
    # Count the number of data points that are +1 and -1 in this node.
    num_ones = len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])    

    # For the leaf node, set the prediction to be the majority class.
    # Store the predicted class (1 or -1) in leaf['prediction']
    if num_ones > num_minus_ones:
        leaf['prediction'] = +1         ## YOUR CODE HERE
    else:
        leaf['prediction'] = -1         ## YOUR CODE HERE        

    # Return the leaf node
    return leaf 

Early stopping condition 3: Minimum gain in error reduction

In [16]:
def error_reduction(error_before_split, error_after_split):
    # Return the error before the split minus the error after the split.
    return (error_before_split - error_after_split)

In [17]:
def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10, min_node_size = 1, min_error_reduction = -1):
    remaining_features = features[:]
    
    target_values = data[target]
    
    # Stopping condition 1: All nodes are of the same type.    
    if(intermediate_node_num_mistakes(target_values) == 0):
        print "Stopping condition 1 reached. All nodes are of the same type."
        return (create_leaf(target_values))
    
    # Stopping condition 2: No more features to split on.
    if(remaining_features == None):
        print "Stopping condition 2 reached. No more features to split on."
        return (create_leaf(target_values))
        
    # Early stopping condition 1: Reached max depth limit.
    if(current_depth >= max_depth):
        print "Early stopping condition 1 reached. Reached max depth limit."
        return (create_leaf(target_values))
        
    # Early stopping condition 2: Reached the minimum node size.
    if(len(data) <= min_node_size):
        print "Early stopping condition 2 reached. Reached the minimum node size."
        return (create_leaf(target_values))
    
    # Find the best splitting feature    
    splitting_feature = get_best_splitting_feature(data, features, target)
    
    left_split = data[data[splitting_feature] == 0]
    right_split = data[data[splitting_feature] == 1]
    
    remaining_features.remove(splitting_feature)
    
    # Early stopping condition 3: Minimum error reduction
    # Calculate the error before splitting (number of misclassified examples 
    # divided by the total number of examples)
    error_before_split = intermediate_node_num_mistakes(target_values) / float(len(data))
    
    # Calculate the error after splitting (number of misclassified examples 
    # in both groups divided by the total number of examples)
    left_mistakes = intermediate_node_num_mistakes(left_split[target])
    right_mistakes = intermediate_node_num_mistakes(right_split[target])
    error_after_split = (left_mistakes + right_mistakes) / float(len(data))

    # If the error reduction is LESS THAN OR EQUAL TO min_error_reduction, return a leaf.
    if (error_reduction(error_before_split, error_after_split) <= min_error_reduction):
        print "Early stopping condition 3 reached. Minimum error reduction."
        return (create_leaf(target_values))
    
    print "Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split))
    
    # Create a leaf node if the split is "perfect"
    if len(left_split) == len(data):
        print "Creating leaf node."
        return create_leaf(left_split[target])
    if len(right_split) == len(data):
        print "Creating right node."    
        return create_leaf(right_split[target])
    
    left_tree = decision_tree_create(left_split, remaining_features, target, current_depth + 1, max_depth, min_node_size, min_error_reduction)
    right_tree = decision_tree_create(right_split, remaining_features, target, current_depth + 1, max_depth, min_node_size, min_error_reduction)

    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

Build a tree!

In [18]:
max_depth = 6
min_node_size = 100,
min_error_reduction = 0.0

In [19]:
my_decision_tree_new = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6, min_node_size = 100, min_error_reduction=0.0)

Split on feature term. 36 months. (9223, 28001)
Split on feature grade.A. (9122, 101)
Early stopping condition 3 reached. Minimum error reduction.
Split on feature emp_length.n/a. (96, 5)
Early stopping condition 2 reached. Reached the minimum node size.
Early stopping condition 2 reached. Reached the minimum node size.
Split on feature grade.D. (23300, 4701)
Split on feature grade.E. (22024, 1276)
Split on feature grade.F. (21666, 358)
Split on feature emp_length.n/a. (20734, 932)
Split on feature grade.G. (20638, 96)
Early stopping condition 1 reached. Reached max depth limit.
Early stopping condition 1 reached. Reached max depth limit.
Split on feature grade.A. (702, 230)
Early stopping condition 1 reached. Reached max depth limit.
Early stopping condition 1 reached. Reached max depth limit.
Split on feature emp_length.8 years. (347, 11)
Early stopping condition 3 reached. Minimum error reduction.
Early stopping condition 2 reached. Reached the minimum node size.
Early stopping cond

Let's now train a tree model ignoring early stopping conditions 2 and 3 so that we get the same tree as in the previous assignment. To ignore these conditions, we set min_node_size=0 and min_error_reduction=-1 (a negative value). Call this model my_decision_tree_old.

In [20]:
my_decision_tree_old = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6, min_node_size = 0, min_error_reduction=-1)

Split on feature term. 36 months. (9223, 28001)
Split on feature grade.A. (9122, 101)
Split on feature grade.B. (8074, 1048)
Split on feature grade.C. (5884, 2190)
Split on feature grade.D. (3826, 2058)
Split on feature grade.E. (1693, 2133)
Early stopping condition 1 reached. Reached max depth limit.
Early stopping condition 1 reached. Reached max depth limit.
Split on feature grade.E. (2058, 0)
Creating leaf node.
Split on feature grade.D. (2190, 0)
Creating leaf node.
Split on feature emp_length.5 years. (969, 79)
Split on feature grade.C. (969, 0)
Creating leaf node.
Split on feature home_ownership.MORTGAGE. (34, 45)
Split on feature grade.C. (34, 0)
Creating leaf node.
Split on feature grade.C. (45, 0)
Creating leaf node.
Split on feature emp_length.n/a. (96, 5)
Split on feature emp_length.< 1 year. (85, 11)
Split on feature grade.B. (85, 0)
Creating leaf node.
Split on feature grade.B. (11, 0)
Creating leaf node.
Split on feature grade.B. (5, 0)
Creating leaf node.
Split on featu

Making predictions

In [21]:
def classify(tree, x, annotate = False):
    # if the node is a leaf node.
    if tree['is_leaf']:
        if annotate:
             print "At leaf, predicting %s" % tree['prediction']
        return tree['prediction']
    else:        
        # split on feature.
        split_feature_value = x[tree['splitting_feature']]
        if annotate:
             print "Split on %s = %s" % (tree['splitting_feature'], split_feature_value)
        if split_feature_value == 0:
            if annotate:
                print "...left split"
            return classify(tree['left'], x, annotate)
        else:
            if annotate:
                print "...right split"
            return classify(tree['right'], x, annotate)

In [22]:
print validation_data[0]

{'emp_length.3 years': 0L, 'home_ownership.RENT': 1L, 'home_ownership.OWN': 0L, 'emp_length.6 years': 0L, 'emp_length.9 years': 0L, 'emp_length.1 year': 0L, 'home_ownership.OTHER': 0L, 'safe_loans': -1L, 'emp_length.< 1 year': 0L, 'emp_length.10+ years': 0L, 'emp_length.5 years': 0L, 'term. 60 months': 1L, 'home_ownership.MORTGAGE': 0L, 'emp_length.2 years': 1L, 'emp_length.7 years': 0L, 'emp_length.n/a': 0L, 'grade.D': 1L, 'grade.E': 0L, 'grade.F': 0L, 'grade.G': 0L, 'grade.A': 0L, 'grade.B': 0L, 'grade.C': 0L, 'emp_length.4 years': 0L, 'term. 36 months': 0L, 'emp_length.8 years': 0L}


In [23]:
print 'Predicted class: %s ' % classify(my_decision_tree_new, validation_data[0])

Predicted class: -1 


Let's add some annotations to our prediction to see what the prediction path was that lead to this predicted class:

In [24]:
classify(my_decision_tree_new, validation_data[0], annotate=True)

Split on term. 36 months = 0
...left split
Split on grade.A = 0
...left split
At leaf, predicting -1


-1

Let's now recall the prediction path for the decision tree learned in the previous assignment, which we recreated here as my_decision_tree_old.

In [25]:
classify(my_decision_tree_old, validation_data[0], annotate=True)

Split on term. 36 months = 0
...left split
Split on grade.A = 0
...left split
Split on grade.B = 0
...left split
Split on grade.C = 0
...left split
Split on grade.D = 1
...right split
At leaf, predicting -1


-1

Evaluating the model

In [26]:
def evaluate_classification_error(tree, data, target='safe_loans'):
    prediction = data.apply(lambda x: classify(tree, x, False))    
    num_err = np.sum(np.array(prediction != data[target]))
    total_num_points = len(data)
    classification_error = (num_err / total_num_points)
    return classification_error

# Is the validation error of the new decision tree (using early stopping conditions 2 and 3) lower than, higher than, or the same as that of the old decision tree from the previous assignment?

Now, let's use this function to evaluate the classification error of my_decision_tree_new on the validation_set. 

In [27]:
evaluate_classification_error(my_decision_tree_new, validation_data)

0.38367083153813014

Now, evaluate the validation error using my_decision_tree_old.

In [28]:
evaluate_classification_error(my_decision_tree_old, validation_data)

0.38377854373115039

Exploring the effect of max_depth

We will compare three models trained with different values of the stopping criterion. We intentionally picked models at the extreme ends (too small, just right, and too large).

In [29]:
model_1_max_depth = 2
model_2_max_depth = 6
model_3_max_depth = 14

In [30]:
model_1 = decision_tree_create(train_data, features, 'safe_loans', max_depth = model_1_max_depth, min_node_size = 0, min_error_reduction=-1)
model_2 = decision_tree_create(train_data, features, 'safe_loans', max_depth = model_2_max_depth, min_node_size = 0, min_error_reduction=-1)
model_3 = decision_tree_create(train_data, features, 'safe_loans', max_depth = model_3_max_depth, min_node_size = 0, min_error_reduction=-1)

Split on feature term. 36 months. (9223, 28001)
Split on feature grade.A. (9122, 101)
Early stopping condition 1 reached. Reached max depth limit.
Early stopping condition 1 reached. Reached max depth limit.
Split on feature grade.D. (23300, 4701)
Early stopping condition 1 reached. Reached max depth limit.
Early stopping condition 1 reached. Reached max depth limit.
Split on feature term. 36 months. (9223, 28001)
Split on feature grade.A. (9122, 101)
Split on feature grade.B. (8074, 1048)
Split on feature grade.C. (5884, 2190)
Split on feature grade.D. (3826, 2058)
Split on feature grade.E. (1693, 2133)
Early stopping condition 1 reached. Reached max depth limit.
Early stopping condition 1 reached. Reached max depth limit.
Split on feature grade.E. (2058, 0)
Creating leaf node.
Split on feature grade.D. (2190, 0)
Creating leaf node.
Split on feature emp_length.5 years. (969, 79)
Split on feature grade.C. (969, 0)
Creating leaf node.
Split on feature home_ownership.MORTGAGE. (34, 45)
S

In [31]:
print "Training data, classification error (model 1):", evaluate_classification_error(model_1, train_data)
print "Training data, classification error (model 2):", evaluate_classification_error(model_2, train_data)
print "Training data, classification error (model 3):", evaluate_classification_error(model_3, train_data)

Training data, classification error (model 1): 0.400037610144
Training data, classification error (model 2): 0.381850419084
Training data, classification error (model 3): 0.376182033097


# Which tree has the smallest error on the validation data?

In [32]:
print "Validation data, classification error (model 1):", evaluate_classification_error(model_1, validation_data)
print "Validation data, classification error (model 2):", evaluate_classification_error(model_2, validation_data)
print "Validation data, classification error (model 3):", evaluate_classification_error(model_3, validation_data)

Validation data, classification error (model 1): 0.398104265403
Validation data, classification error (model 2): 0.383778543731
Validation data, classification error (model 3): 0.37731581215


Measuring the complexity of the tree

In [33]:
def count_leaves(tree):
    if tree['is_leaf']:
        return 1
    return count_leaves(tree['left']) + count_leaves(tree['right'])

Using the function count_leaves, compute the number of nodes in model_1, model_2, and model_3.

# Which tree has the largest complexity?

In [34]:
print "model_1 complexity:", count_leaves(model_1)
print "model_2 complexity:", count_leaves(model_2)
print "model_3 complexity:", count_leaves(model_3)

model_1 complexity: 4
model_2 complexity: 19
model_3 complexity: 41


Exploring the effect of min_error

In [35]:
model_4_min_error_reduction = -1 # (ignoring this early stopping condition)
model_5_min_error_reduction = 0 # (just right)
model_6_min_error_reduction = 5 # (too positive)

In [36]:
model_4 = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6, min_node_size = 0, min_error_reduction=model_4_min_error_reduction)
model_5 = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6, min_node_size = 0, min_error_reduction=model_5_min_error_reduction)
model_6 = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6, min_node_size = 0, min_error_reduction=model_6_min_error_reduction)

Split on feature term. 36 months. (9223, 28001)
Split on feature grade.A. (9122, 101)
Split on feature grade.B. (8074, 1048)
Split on feature grade.C. (5884, 2190)
Split on feature grade.D. (3826, 2058)
Split on feature grade.E. (1693, 2133)
Early stopping condition 1 reached. Reached max depth limit.
Early stopping condition 1 reached. Reached max depth limit.
Split on feature grade.E. (2058, 0)
Creating leaf node.
Split on feature grade.D. (2190, 0)
Creating leaf node.
Split on feature emp_length.5 years. (969, 79)
Split on feature grade.C. (969, 0)
Creating leaf node.
Split on feature home_ownership.MORTGAGE. (34, 45)
Split on feature grade.C. (34, 0)
Creating leaf node.
Split on feature grade.C. (45, 0)
Creating leaf node.
Split on feature emp_length.n/a. (96, 5)
Split on feature emp_length.< 1 year. (85, 11)
Split on feature grade.B. (85, 0)
Creating leaf node.
Split on feature grade.B. (11, 0)
Creating leaf node.
Split on feature grade.B. (5, 0)
Creating leaf node.
Split on featu

In [37]:
print "Validation data, classification error (model 4):", evaluate_classification_error(model_4, validation_data)
print "Validation data, classification error (model 5):", evaluate_classification_error(model_5, validation_data)
print "Validation data, classification error (model 6):", evaluate_classification_error(model_6, validation_data)

Validation data, classification error (model 4): 0.383778543731
Validation data, classification error (model 5): 0.383778543731
Validation data, classification error (model 6): 0.503446790177


# Using the complexity definition above, which model (model_4, model_5, or model_6) has the largest complexity? Did this match your expectation?

In [38]:
print "model_4 complexity:", count_leaves(model_4)
print "model_5 complexity:", count_leaves(model_5)
print "model_6 complexity:", count_leaves(model_6)

model_4 complexity: 19
model_5 complexity: 13
model_6 complexity: 1


Exploring the effect of min_node_size

In [39]:
model_7_min_node_size = 0 # (too small)
model_8_min_node_size = 2000 # (just right)
model_9_min_node_size = 50000 # (too large)

In [40]:
model_7 = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6, min_node_size = model_7_min_node_size, min_error_reduction=-1)
model_8 = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6, min_node_size = model_8_min_node_size, min_error_reduction=-1)
model_9 = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6, min_node_size = model_9_min_node_size, min_error_reduction=-1)

Split on feature term. 36 months. (9223, 28001)
Split on feature grade.A. (9122, 101)
Split on feature grade.B. (8074, 1048)
Split on feature grade.C. (5884, 2190)
Split on feature grade.D. (3826, 2058)
Split on feature grade.E. (1693, 2133)
Early stopping condition 1 reached. Reached max depth limit.
Early stopping condition 1 reached. Reached max depth limit.
Split on feature grade.E. (2058, 0)
Creating leaf node.
Split on feature grade.D. (2190, 0)
Creating leaf node.
Split on feature emp_length.5 years. (969, 79)
Split on feature grade.C. (969, 0)
Creating leaf node.
Split on feature home_ownership.MORTGAGE. (34, 45)
Split on feature grade.C. (34, 0)
Creating leaf node.
Split on feature grade.C. (45, 0)
Creating leaf node.
Split on feature emp_length.n/a. (96, 5)
Split on feature emp_length.< 1 year. (85, 11)
Split on feature grade.B. (85, 0)
Creating leaf node.
Split on feature grade.B. (11, 0)
Creating leaf node.
Split on feature grade.B. (5, 0)
Creating leaf node.
Split on featu

# Using the results obtained in this section, which model (model_7, model_8, or model_9) would you choose to use?

In [41]:
print "Validation data, classification error (model 7):", evaluate_classification_error(model_7, validation_data)
print "Validation data, classification error (model 8):", evaluate_classification_error(model_8, validation_data)
print "Validation data, classification error (model 9):", evaluate_classification_error(model_9, validation_data)

Validation data, classification error (model 7): 0.383778543731
Validation data, classification error (model 8): 0.384532529082
Validation data, classification error (model 9): 0.503446790177


In [42]:
print "model_7 complexity:", count_leaves(model_7)
print "model_8 complexity:", count_leaves(model_8)
print "model_9 complexity:", count_leaves(model_9)

model_7 complexity: 19
model_8 complexity: 12
model_9 complexity: 1
