## Implementing binary decision trees

The goal of this notebook is to implement your own binary decision tree classifier. You will:
* Use SFrames to do some feature engineering.
* Transform categorical variables into binary variables.
* Write a function to compute the number of misclassified examples in an intermediate node.
* Write a function to find the best feature to split on.
* Build a binary decision tree from scratch.
* Make predictions using the decision tree.
* Evaluate the accuracy of the decision tree.
* Visualize the decision at the root node.

**Important Note:** In this assignment, we will focus on building decision trees where the data contain **only binary (0 or 1) features.** This allows us to avoid dealing with:
* Multiple intermediate nodes in a split
* The thresholding issues of real-valued features.

In [1]:
import numpy as np
import pandas as pd

1. Load in the LendingClub dataset with the software of your choice.

In [2]:
loans = pd.read_csv('lending-club-data.csv', low_memory=False)

In [3]:
loans.shape

(122607, 68)

In [4]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1


2. Reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan. You should have code analogous to

In [5]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: 1 if x == 0  else -1)
loans.drop('bad_loans', axis = 'columns',  inplace = True)

3. We will only be considering these four features:

In [6]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'

In [7]:
loans_data = loans[features+[target]]
loans_data.head()

Unnamed: 0,grade,term,home_ownership,emp_length,safe_loans
0,B,36 months,RENT,10+ years,1
1,C,60 months,RENT,< 1 year,-1
2,C,36 months,RENT,10+ years,1
3,C,36 months,RENT,10+ years,1
4,A,36 months,RENT,3 years,1


### Subsample dataset to make sure classes are balanced

4. Just as we did in the previous assignment, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. You should have code analogous to

In [8]:
# safe_loans_raw = loans[loans[target] == 1]
# risky_loans_raw = loans[loans[target] == -1]

# # Since there are less risky loans than safe loans, find the ratio of the sizes
# # and use that percentage to undersample the safe loans.
# percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
# safe_loans = safe_loans_raw.sample(percentage, seed = 1)
# risky_loans = risky_loans_raw
# loans_data = risky_loans.append(safe_loans)

# print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))
# print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))
# print "Total number of loans in our new dataset :", len(loans_data)

### Train-test split

In [9]:
import json

In [10]:
with open('module-5-assignment-2-train-idx.json') as file:
    train_idx = json.load(file)

In [11]:
with open('module-5-assignment-2-test-idx.json') as file:
    test_idx = json.load(file)

In [12]:
train = loans_data.iloc[train_idx]

In [13]:
test = loans_data.iloc[test_idx]

In [14]:
train.shape, test.shape

((37224, 5), (9284, 5))

In [15]:
total_data = train.append(test)

In [16]:
total_data

Unnamed: 0,grade,term,home_ownership,emp_length,safe_loans
1,C,60 months,RENT,< 1 year,-1
6,F,60 months,OWN,4 years,-1
7,B,60 months,RENT,< 1 year,-1
10,C,36 months,RENT,< 1 year,-1
12,B,36 months,RENT,3 years,-1
...,...,...,...,...,...
122390,C,60 months,MORTGAGE,10+ years,1
122419,C,36 months,RENT,7 years,1
122445,D,36 months,RENT,< 1 year,1
122461,E,36 months,RENT,7 years,1


In [17]:
total_data.reset_index(inplace = True)

In [18]:
total_data.drop('index', axis = 1, inplace = True)
total_data

Unnamed: 0,grade,term,home_ownership,emp_length,safe_loans
0,C,60 months,RENT,< 1 year,-1
1,F,60 months,OWN,4 years,-1
2,B,60 months,RENT,< 1 year,-1
3,C,36 months,RENT,< 1 year,-1
4,B,36 months,RENT,3 years,-1
...,...,...,...,...,...
46503,C,60 months,MORTGAGE,10+ years,1
46504,C,36 months,RENT,7 years,1
46505,D,36 months,RENT,< 1 year,1
46506,E,36 months,RENT,7 years,1


## Transform categorical data into binary features

In this assignment, we will implement **binary decision trees** (decision trees for binary features, a specific case of categorical variables taking on two values, e.g., true/false). Since all of our features are currently categorical features, we want to turn them into binary features.


In [19]:
categorical_features = list(total_data.select_dtypes('object').columns)
categorical_features

['grade', 'term', 'home_ownership', 'emp_length']

In [20]:
from sklearn.preprocessing import LabelEncoder

In [21]:
from sklearn.preprocessing import OneHotEncoder

In [22]:
encoder = OneHotEncoder()

In [23]:
for feature in categorical_features:
    encoder.fit(total_data[[feature]])
    name_of_new_features = encoder.categories_[0]
    enc_df = pd.DataFrame()
    enc_df = pd.DataFrame(encoder.transform(total_data[[feature]]).toarray())
    fetures_dict = {}
    for i in range(len(name_of_new_features)):
        fetures_dict[enc_df.columns[i]] =  name_of_new_features[i]
    enc_df = enc_df.rename(columns = fetures_dict)
    total_data = total_data.join(enc_df)
    total_data.drop(feature, axis = 1, inplace = True)

In [24]:
# for feature in categorical_features:
#     dummy = pd.get_dummies(total_data[feature], prefix = feature)
#     total_data = total_data.join(dummy)
#     total_data.drop(feature, axis = 1, inplace = True)

In [25]:
total_data.head()

Unnamed: 0,safe_loans,A,B,C,D,E,F,G,36 months,60 months,...,2 years,3 years,4 years,5 years,6 years,7 years,8 years,9 years,< 1 year,NaN
0,-1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,-1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


5. This technique of turning categorical variables into binary variables is called one-hot encoding. Using the software of your choice, perform one-hot encoding on the four features described above. **You should now have 25 binary features.**

In [26]:
total_data.shape

(46508, 26)

### Train-test split

In [27]:
train_data = total_data[:len(train_idx)]

In [28]:
train_data

Unnamed: 0,safe_loans,A,B,C,D,E,F,G,36 months,60 months,...,2 years,3 years,4 years,5 years,6 years,7 years,8 years,9 years,< 1 year,NaN
0,-1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,-1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37219,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
37220,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
37221,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37222,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
test_data = total_data[-len(test_idx):].reset_index().drop('index', axis = 1)

In [30]:
test_data

Unnamed: 0,safe_loans,A,B,C,D,E,F,G,36 months,60 months,...,2 years,3 years,4 years,5 years,6 years,7 years,8 years,9 years,< 1 year,NaN
0,-1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9279,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9280,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9281,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9282,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [31]:
new_features  = list(total_data.columns)
new_features.remove('safe_loans')
print(new_features)

['A', 'B', 'C', 'D', 'E', 'F', 'G', ' 36 months', ' 60 months', 'MORTGAGE', 'OTHER', 'OWN', 'RENT', '1 year', '10+ years', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '< 1 year', nan]


## Decision tree implementation

In this section, we will implement binary decision trees from scratch. There are several steps involved in building a decision tree. For that reason, we have split the entire assignment into several sections.

#### Function to count number of mistakes while predicting majority class

 Now, we will write a function that calculates the number of misclassified examples when predicting the majority class. This will be used to help determine which feature is the best to split on at a given node of the tree.

**Note:** Keep in mind that in order to compute the number of mistakes for a majority classifier, we only need the label (y values) of the data points in the node.

`Steps to follow:`

* Step 1: Calculate the number of safe loans and risky loans.
* Step 2: Since we are assuming majority class prediction, all the data points that are not in the majority class are considered mistakes.
* Step 3: Return the number of mistakes.

Now, let us write the function *intermediate_node_num_mistakes* which computes the number of misclassified examples of an intermediate node given the set of labels (y values) of the data points contained in the node. Your code should be analogous to

In [32]:
def intermediate_node_num_mistakes(labels_in_node):
    # Corner case: If labels_in_node is empty, return 0
    if len(labels_in_node) == 0:
        return 0    
    # Count the number of 1's (safe loans)
    ## YOUR CODE HERE
    count_one = (labels_in_node==1).sum()
    # Count the number of -1's (risky loans)
    ## YOUR CODE HERE
    count_minus_one = (labels_in_node == -1).sum()
    # Return the number of mistakes that the majority classifier makes.
    ## YOUR CODE HERE
    return min(count_one, count_minus_one)


8. Because there are several steps in this assignment, we have introduced some stopping points where you can check your code and make sure it is correct before proceeding. To test your intermediate_node_num_mistakes function, run the following code until you get a Test passed!, then you should proceed. Otherwise, you should spend some time figuring out where things went wrong.

In [33]:
# Test case 1
example_labels = np.array([-1, -1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print('Test passed!')
else:
    print('Test 1 failed... try again!')

# Test case 2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print('Test passed!')
else:
    print('Test 3 failed... try again!')
    
# Test case 3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print('Test passed!')
else:
    print('Test 3 failed... try again!')

Test passed!
Test passed!
Test passed!


In [34]:
(np.array([-1, -1, 1, 1, 1, 1, 1]) ==1).sum()

5

### Function to pick best feature to split on

The function best_splitting_feature takes 3 arguments:

1. The data
2. The features to consider for splits (a list of strings of column names to consider for splits)
3. The name of the target/label column (string)

9. Follow these steps to implement best_splitting_feature:

* Step 1: Loop over each feature in the feature list
* Step 2: Within the loop, split the data into two groups: one group where all of the data has feature value 0 or False (we will call this the left split), and one group where all of the data has feature value 1 or True (we will call this the right split). Make sure the left split corresponds with 0 and the right split corresponds with 1 to ensure your implementation fits with our implementation of the tree building process.
* Step 3: Calculate the number of misclassified examples in both groups of data and use the above formula to compute theclassification error.
* Step 4: If the computed error is smaller than the best error found so far, store this feature and its error.

In [35]:
def best_splitting_feature(data, features, target):
    
    target_values = data[target]
    best_feature = None # Keep track of the best feature 
    best_error = 10     # Keep track of the best error so far 
    # Note: Since error is always <= 1, we should intialize it with something larger than 1.

    # Convert to float to make sure error gets computed correctly.
    num_data_points = float(len(data))  
    
    # Loop through each feature to consider splitting on that feature
    for feature in features:
        
        # The left split will have all data points where the feature value is 0
        left_split = data[data[feature] == 0]
        
        # The right split will have all data points where the feature value is 1
        ## YOUR CODE HERE
        right_split =  data[data[feature]==1]
            
        # Calculate the number of misclassified examples in the left split.
        # Remember that we implemented a function for this! (It was called intermediate_node_num_mistakes)
        # YOUR CODE HERE
        left_mistakes = intermediate_node_num_mistakes(left_split[target])

        # Calculate the number of misclassified examples in the right split.
        ## YOUR CODE HERE
        right_mistakes = intermediate_node_num_mistakes(right_split[target])
            
        # Compute the classification error of this split.
        # Error = (# of mistakes (left) + # of mistakes (right)) / (# of data points)
        ## YOUR CODE HERE
        error = (left_mistakes+right_mistakes)/num_data_points

        # If this is the best error we have found so far, store the feature as best_feature and the error as best_error
        ## YOUR CODE HERE
        if error < best_error:
            best_feature = feature
            best_error = error
    
    return best_feature # Return the best feature we found

In [36]:
#To test your best_splitting_feature function, run the following code:


if best_splitting_feature(train_data, new_features, 'safe_loans') == 'term_ 36 months':
    print( 'Test passed!')
else:
    print ('Test failed... try again!')

Test failed... try again!


### Building the tree

 Each node in the decision tree is represented as a dictionary which contains the following keys and possible values:

In [37]:
# { 
#    'is_leaf'            : True/False.
#    'prediction'         : Prediction at the leaf node.
#    'left'               : (dictionary corresponding to the left tree).
#    'right'              : (dictionary corresponding to the right tree).
#    'splitting_feature'  : The feature that this node splits on
# }


10. First, we will write a function that creates a leaf node given a set of target values. Your code should be analogous to

In [38]:
def create_leaf(target_values):    
    # Create a leaf node
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True   }   ## YOUR CODE HERE 
   
    # Count the number of data points that are +1 and -1 in this node.
    num_ones = len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])    

    # For the leaf node, set the prediction to be the majority class.
    # Store the predicted class (1 or -1) in leaf['prediction']
    if num_ones > num_minus_ones:
        leaf['prediction'] = 1         ## YOUR CODE HERE
    else:
        leaf['prediction'] = -1        ## YOUR CODE HERE        

    # Return the leaf node
    return leaf 

We have provided a function that learns the decision tree recursively and implements 3 stopping conditions:

1. **Stopping condition 1:** All data points in a node are from the same class.
2. **Stopping condition 2:** No more features to split on.
3. **Additional stopping condition:** In addition to the above two stopping conditions covered in lecture, in this assignment we will also consider a stopping condition based on the max_depth of the tree. By not letting the tree grow too deep, we will save computational effort in the learning process.

11. Now, we will provide a Python skeleton of the learning algorithm. Note that this code is not complete; it needs to be completed by you if you are using Python. Otherwise, your code should be analogous to

In [39]:
def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10):
    remaining_features = features[:] # Make a copy of the features.
    
    target_values = data[target]
    print("--------------------------------------------------------------------")
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))
    

    # Stopping condition 1
    # (Check if there are mistakes at current node.
    # Recall you wrote a function intermediate_node_num_mistakes to compute this.)
    if intermediate_node_num_mistakes(target_values) == 0:  ## YOUR CODE HERE
        print("Stopping condition 1 reached.")     
        # If not mistakes at current node, make current node a leaf node
        return create_leaf(target_values)
    
    # Stopping condition 2 (check if there are remaining features to consider splitting on)
    if remaining_features == []:   ## YOUR CODE HERE
        print("Stopping condition 2 reached.")    
        # If there are no remaining features to consider, make current node a leaf node
        return create_leaf(target_values)    
    
    # Additional stopping condition (limit tree depth)
    if current_depth >= max_depth :  ## YOUR CODE HERE
        print("Reached maximum depth. Stopping for now.")
        # If the max tree depth has been reached, make current node a leaf node
        return create_leaf(target_values)

    # Find the best splitting feature (recall the function best_splitting_feature implemented above)
    ## YOUR CODE HERE
    splitting_feature = best_splitting_feature(data, remaining_features, target)
    
    # Split on the best feature that we found. 
    left_split = data[data[splitting_feature] == 0]
    right_split = data[data[splitting_feature] == 1]      ## YOUR CODE HERE
    remaining_features.remove(splitting_feature)
    print("Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split)))
    
    # Create a leaf node if the split is "perfect"
    if len(left_split) == len(data):
        print("Creating leaf node.")
        return create_leaf(left_split[target])
    if len(right_split) == len(data):
        print("Creating leaf node.")
        ## YOUR CODE HERE
        return create_leaf(right_split[target])

        
    # Repeat (recurse) on left and right subtrees
    left_tree = decision_tree_create(left_split, remaining_features, target, current_depth + 1, max_depth)        
    ## YOUR CODE HERE
    right_tree = decision_tree_create(right_split, remaining_features, target, current_depth+1, max_depth)

    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

### Build the tree!

12. Train a tree model on the train_data. Limit the depth to 6 (max_depth = 6) to make sure the algorithm doesn't run for too long. Call this tree my_decision_tree. Warning: The tree may take 1-2 minutes to learn.

In [40]:
decision_tree_6 = decision_tree_create(train_data, new_features, 'safe_loans', max_depth = 6)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Split on feature  36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
Split on feature A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Split on feature B. (8074, 1048)
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
Split on feature C. (5884, 2190)
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
Split on feature D. (3826, 2058)
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
Split on feature E. (1693, 2133)
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
Reached maximum depth. Stopping for 

In [41]:
decision_tree_6['left']['right']['right']

{'splitting_feature': None,
 'left': None,
 'right': None,
 'is_leaf': True,
 'prediction': -1}

### Making predictions with a decision tree

13. As discussed in the lecture, we can make predictions from the decision tree with a simple recursive function. Write a function called classify, which takes in a learned tree and a test point x to classify. Include an option annotate that describes the prediction path when set to True. Your code should be analogous to

In [42]:
def classify(tree, x, annotate = False):   
    # if the node is a leaf node.
    if tree['is_leaf']:
        if annotate: 
            print( "At leaf, predicting %s" % tree['prediction'])
        return tree['prediction'] 
    else:
        # split on feature.
        split_feature_value = x[tree['splitting_feature']]
        if annotate: 
            print ("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

14. Now, let's consider the first example of the test set and see what my_decision_tree model predicts for this data point.

In [43]:
print(test_data.iloc[0])
print('Predicted class: %s ' % classify(decision_tree_6, test_data.iloc[0]))

safe_loans   -1.0
A             0.0
B             0.0
C             0.0
D             1.0
E             0.0
F             0.0
G             0.0
 36 months    0.0
 60 months    1.0
MORTGAGE      0.0
OTHER         0.0
OWN           0.0
RENT          1.0
1 year        0.0
10+ years     0.0
2 years       1.0
3 years       0.0
4 years       0.0
5 years       0.0
6 years       0.0
7 years       0.0
8 years       0.0
9 years       0.0
< 1 year      0.0
NaN           0.0
Name: 0, dtype: float64
Predicted class: -1 


15. Let's add some annotations to our prediction to see what the prediction path was that lead to this predicted class:

In [44]:
classify(decision_tree_6, test_data.iloc[0], annotate=True)

Split on  36 months = 0.0
Split on A = 0.0
Split on B = 0.0
Split on C = 0.0
Split on D = 1.0
At leaf, predicting -1


-1

**Quiz question:** What was the feature that my_decision_tree first split on while making the prediction for test_data[0]?

**Ans:** term_ 36 months

**Quiz question:** What was the first feature that lead to a right split of test_data[0]?

**Ans:** grade_D

**Quiz question:** What was the last feature split on before reaching a leaf node for test_data[0]?

**Ans:** grade_D

## Evaluating your decision tree

16. Now, we will write a function to evaluate a decision tree by computing the classification error of the tree on the given dataset. Write a function called evaluate_classification_error that takes in as input:

    1. tree (as described above)
    2. data (a data frame of data points)

In [45]:
def evaluate_classification_error(tree, data, target):
    # Apply the classify(tree, x) to each row in your data
    prediction = data.apply(lambda x: classify(tree, x), axis = 1)
    
    # Once you've made the predictions, calculate the classification error and return it
    ## YOUR CODE HERE
    error = (prediction != data[target]).sum()
    error = error/len(data)
    return round(error, 2)

In [46]:
evaluate_classification_error(decision_tree_6, test_data, "safe_loans")

0.38

**Quiz Question:** Rounded to 2nd decimal point, what is the classification error of my_decision_tree on the test_data?

**Ans:** 0.38

In [47]:
test_data.apply(lambda x: classify(decision_tree_6, x), axis = 1)

0      -1
1       1
2      -1
3      -1
4       1
       ..
9279   -1
9280    1
9281   -1
9282   -1
9283    1
Length: 9284, dtype: int64

Printing out a decision stump

18. As discussed in the lecture, we can print out a single decision stump (printing out the entire tree is left as an exercise to the curious reader). Here we provide Python code to visualize a decision stump. If you are using different software, make sure your code is analogous to:

In [48]:
def print_stump(tree, name = 'root'):
    split_name = tree['splitting_feature'] # split_name is something like 'term. 36 months'
    if split_name is None:
        print( "(leaf, label: %s)" % tree['prediction'])
        return None
#     split_feature, split_value = split_name.split('.')
    print ('                       %s' % name)
    print('         |---------------|----------------|')
    print('         |                                |')
    print('         |                                |')
    print('         |                                |')
    print('  [{0} == 0]               [{0} == 1]    '.format(split_name))
    print('         |                                |')
    print('         |                                |')
    print('         |                                |')
    print('    (%s)                         (%s)' \
        % (('leaf, label: ' + str(tree['left']['prediction']) if tree['left']['is_leaf'] else 'subtree'),
           ('leaf, label: ' + str(tree['right']['prediction']) if tree['right']['is_leaf'] else 'subtree')))

Exploring the intermediate left subtree
The tree is a recursive dictionary, so we do have access to all the nodes! We can use

my_decision_tree['left'] to go left

my_decision_tree['right'] to go right

In [49]:
print_stump(decision_tree_6)

                       root
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [ 36 months == 0]               [ 36 months == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


**Quiz Question:** What is the feature that is used for the split at the root node?

**Ans:** term_36 month

20. We can print out the left subtree by running the code

In [50]:
print_stump(decision_tree_6['left'], decision_tree_6['splitting_feature'])

                        36 months
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [A == 0]               [A == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


We can similarly print out the left subtree of the left subtree of the root by running the code

In [51]:
print_stump(decision_tree_6['left']['left'], decision_tree_6['left']['splitting_feature'])

                       A
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [B == 0]               [B == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


In [52]:
print_stump(decision_tree_6['left']['left']['left'], decision_tree_6['left']['left']['splitting_feature'])

                       B
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [C == 0]               [C == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (leaf, label: -1)


**Quiz question:** What is the path of the first 3 feature splits considered along the *left-most* branch of my_decision_tree?

**Ans:**  term. 36 months, grade.A, grade.B

In [53]:
print_stump(decision_tree_6['right'], decision_tree_6['splitting_feature'])

                        36 months
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [D == 0]               [D == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (leaf, label: -1)


In [54]:
print_stump(decision_tree_6['right']['right'], decision_tree_6['right']['splitting_feature'])

(leaf, label: -1)


**Quiz question:** What is the path of the first 3 feature splits considered along the *right-most* branch of my_decision_tree?

**Ans:**  term. 36 months, grade.D, no third feature because second split resulted in leaf