In [42]:
import pandas as pd
import numpy as np

loans = pd.read_csv('lending-club-data.csv')
loans.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1


### Getting the data ready

We will be using a dataset from the LendingClub.

Load the dataset into a data frame named loans.

Extracting the target and the feature columns

We will now repeat some of the feature processing steps that we saw in the previous assignment:

First, we re-assign the target to have +1 as a safe (good) loan, and -1 as a risky (bad) loan.

Next, we select four categorical features:

  *  grade of the loan
  *  the length of the loan term
  *  the home ownership status: own, mortgage, rent
  *  number of years of employment.

Your code should be analogous to the following:

In [43]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans.drop('bad_loans',axis = 1)
target = 'safe_loans'
loans = loans[features + [target]]

### Transform categorical data into binary features

Just like the previous assignment, we will implement binary decision trees. Since all of our features are currently categorical features, we want to turn them into binary features. Here is a reminder of what one-hot encoding is.

In [44]:
categorical_variables = []
for feat_name, feat_type in zip(loans.columns.values, loans.dtypes):
    if feat_type == object:
        categorical_variables.append(feat_name)
categorical_variables

['grade', 'term', 'home_ownership', 'emp_length']

In [45]:
for feature in categorical_variables:
    data_val = list(loans[feature].unique())
    data_dict = {val:idx for val, idx in zip(data_val, range(len(data_val)))}
    #loans_data_new[feature] = loans_data[feature].apply(lambda x: data_dict[x])
    loans[feature] = loans[feature].apply(lambda x: data_dict[x])
loans.head()

Unnamed: 0,grade,term,home_ownership,emp_length,safe_loans
0,0,0,0,0,1
1,1,1,0,1,-1
2,1,0,0,0,1
3,1,0,0,0,1
4,2,0,0,2,1


Then follow the following steps:

 *   Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding. Alternatively, see #7 for implementation hints.
 *   Load the JSON files into the lists train_idx and test_idx.
 *   Perform train/validation split using train_idx and test_idx. 

In [46]:
import json

train_idx = json.loads(open('module-8-assignment-2-train-idx.json').read())
test_idx = json.loads(open('module-8-assignment-2-test-idx.json').read())

train_data = loans.iloc[train_idx]
test_data = loans.iloc[test_idx]
train_data.head()

Unnamed: 0,grade,term,home_ownership,emp_length,safe_loans
1,1,1,0,1,-1
6,4,1,1,4,-1
7,0,1,0,1,-1
10,1,0,0,1,-1
12,0,0,0,2,-1


### Weighted decision trees

Let's modify our decision tree code from Module 5 to support weighting of individual data points.

Weighted error definition

Consider a model with N data points with:

 *   Predictions y^1,…,y^n
 *   Target y1,…,yn
 *   Data point weights α1,…,αn

Then the weighted error is defined by:

E(α,y^)=∑i=1nαi×1[yi≠yi^]∑i=1nαi

where 1[yi≠y^i] is an indicator function that is set to 1 if yi≠y^i.

Write a function to compute weight of mistakes

Write a function that calculates the weight of mistakes for making the "weighted-majority" predictions for a dataset. The function accepts two inputs:

  *  labels_in_node: y1,…,yn
  *  data_weights: Data point weights α1,…,αn

We are interested in computing the (total) weight of mistakes, i.e.

WM(α,y^)=∑i=1nαi×1[yi≠yi^]

This quantity is analogous to the number of mistakes, except that each mistake now carries different weight. It is related to the weighted error in the following way:

E(α,y^)=WM(α,y^)∑i=1nαi

The function intermediate_node_weighted_mistakes should first compute two weights:

  *  WM−1: weight of mistakes when all predictions are y^i=−1 i.e. WM(α,−1)
  *  WM+1: weight of mistakes when all predictions are y^i=+1 i.e. WM(α,+1)

where −1 and +1 are vectors where all values are -1 and +1 respectively.

After computing WM−1 and WM+1, the function intermediate_node_weighted_mistakes should return the lower of the two weights of mistakes, along with the class associated with that weight. The function should be analogous to the following Python function:

In [47]:
def intermediate_node_weighted_mistakes(labels_in_node, data_weights):
    # Sum the weights of all entries with label +1
    total_weight_positive = sum(data_weights[labels_in_node == +1])
    
    # Weight of mistakes for predicting all -1's is equal to the sum above
    weighted_mistakes_all_negative = total_weight_positive
    
    # Sum the weights of all entries with label -1
    total_weight_negative = sum(data_weights[labels_in_node == -1])
    
    # Weight of mistakes for predicting all +1's is equal to the sum above
    
    weighted_mistakes_all_positive = total_weight_negative
    
    # Return the tuple (weight, class_label) representing the lower of the two weights
    #    class_label should be an integer of value +1 or -1.
    # If the two weights are identical, return (weighted_mistakes_all_positive,+1)
    
    if weighted_mistakes_all_positive <= weighted_mistakes_all_negative:        
        return (weighted_mistakes_all_positive, +1)
    else:        
        return (weighted_mistakes_all_negative, -1)

Recall that the classification error is defined as follows:

classification error=# mistakes / # all data points

# Question 1
If we set the weights α=1 for all data points, how is the weight of mistakes WM(α,ŷ) related to the classification error? 

* WM(α,ŷ) = N * [classification error] 

### Function to pick best feature to split on

We continue modifying our decision tree code from the earlier assignment to incorporate weighting of individual data points. The next step is to pick the best feature to split on.

The best_splitting_feature function is similar to the one from the earlier assignment with two minor modifications:

  *  The function best_splitting_feature should now accept an extra parameter data_weights to take account of weights of data points.
  *  Instead of computing the number of mistakes in the left and right side of the split, we compute the weight of mistakes for both sides, add up the two weights, and divide it by the total weight of the data.

Your function should be analogous to the following Python function:

In [48]:
# If the data is identical in each feature, this function should return None

def best_splitting_feature(data, features, target, data_weights):
    
    # These variables will keep track of the best feature and the corresponding error
    best_feature = None
    best_error = float('+inf') 
    num_points = float(len(data))

    # Loop through each feature to consider splitting on that feature
    for feature in features:
        
        # The left split will have all data points where the feature value is 0
        # The right split will have all data points where the feature value is 1
        left_split = data[data[feature] == 0]
        right_split = data[data[feature] == 1]
        
        # Apply the same filtering to data_weights to create left_data_weights, right_data_weights
        
        left_data_weights = data_weights[data[feature] == 0]
        right_data_weights = data_weights[data[feature] == 1]
                    
        # DIFFERENT HERE
        # Calculate the weight of mistakes for left and right sides
    
        left_weighted_mistakes, left_class = intermediate_node_weighted_mistakes(left_split[target], left_data_weights)
        right_weighted_mistakes, right_class = intermediate_node_weighted_mistakes(right_split[target], right_data_weights)
        
        # DIFFERENT HERE
        # Compute weighted error by computing
        #  ( [weight of mistakes (left)] + [weight of mistakes (right)] ) / [total weight of all data points]
        ## YOUR CODE HERE
        error = (left_weighted_mistakes + right_weighted_mistakes)/(sum(left_data_weights) + sum(right_data_weights))
        
        # If this is the best error we have found so far, store the feature and the error
        if error < best_error:
            best_feature = feature
            best_error = error
    
    # Return the best feature we found
    return best_feature

### Building the tree

With the above functions implemented correctly, we are now ready to build our decision tree. Recall from the previous assignments that each node in the decision tree is represented as a dictionary which contains the following keys:

{ 

   'is_leaf'            : True/False.
   
   'prediction'         : Prediction at the leaf node.
   
   'left'               : (dictionary corresponding to the left tree).
   
   'right'              : (dictionary corresponding to the right tree).
   
   'features_remaining' : List of features that are posible splits.
   
}

Let us start with a function that creates a leaf node given a set of target values. The create_leaf function should be analogous to the following cell:

In [49]:
def create_leaf(target_values, data_weights):
    
    # Create a leaf node
    leaf = {'splitting_feature' : None,
            'is_leaf': True}
    
    # Computed weight of mistakes.
    # Store the predicted class (1 or -1) in leaf['prediction']
    weighted_error, best_class = intermediate_node_weighted_mistakes(target_values, data_weights)
    leaf['prediction'] = best_class
    
    return leaf

Now write a function that learns a weighted decision tree recursively and implements 3 stopping conditions:

  *  All data points in a node are from the same class.
  *  No more features to split on.
  *  Stop growing the tree when the tree depth reaches max_depth.

Since there are many steps involved, we provide you with a Python skeleton, along with explanatory comments.

In [50]:
def weighted_decision_tree_create(data, features, target, data_weights, current_depth = 1, max_depth = 10):
    remaining_features = features[:] # Make a copy of the features.
    target_values = data[target]
    print("--------------------------------------------------------------------")
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))
    
    # Stopping condition 1. Error is 0.
    if intermediate_node_weighted_mistakes(target_values, data_weights)[0] <= 1e-15:
        print("Stopping condition 1 reached.")                
        return create_leaf(target_values, data_weights)
    
    # Stopping condition 2. No more features.
    if remaining_features == []:
        print("Stopping condition 2 reached.")                
        return create_leaf(target_values, data_weights)    
    
    # Additional stopping condition (limit tree depth)
    if current_depth > max_depth:
        print("Reached maximum depth. Stopping for now.")
        return create_leaf(target_values, data_weights)
    
    # If all the datapoints are the same, splitting_feature will be None. Create a leaf
    splitting_feature = best_splitting_feature(data, features, target, data_weights)
    remaining_features.remove(splitting_feature)
        
    left_split = data[data[splitting_feature] == 0]
    right_split = data[data[splitting_feature] == 1]
    
    left_data_weights = data_weights[data[splitting_feature] == 0]
    right_data_weights = data_weights[data[splitting_feature] == 1]
    
    print("Split on feature %s. (%s, %s)" % (\
              splitting_feature, len(left_split), len(right_split)))
    
    # Create a leaf node if the split is "perfect"
    if len(left_split) == len(data):
        print("Creating leaf node.")
        return create_leaf(left_split[target], data_weights)
    if len(right_split) == len(data):
        print("Creating leaf node.")
        return create_leaf(right_split[target], data_weights)
    
    # Repeat (recurse) on left and right subtrees
    left_tree = weighted_decision_tree_create(
        left_split, remaining_features, target, left_data_weights, current_depth + 1, max_depth)
    right_tree = weighted_decision_tree_create(
        right_split, remaining_features, target, right_data_weights, current_depth + 1, max_depth)
    
    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

Finally, write a recursive function to count the nodes in your tree. The function should be analogous to

In [51]:
def count_nodes(tree):
    if tree['is_leaf']:
        return 1
    return 1 + count_nodes(tree['left']) + count_nodes(tree['right'])

### Making predictions with a weighted decision tree

To make a single prediction, we must start at the root and traverse down the decision tree in recursive fashion. Write a function classify that makes a single prediction. It should be analogous to the following:



In [52]:
def classify(tree, x, annotate = False):   
    # If the node is a leaf node.
    if tree['is_leaf']:
        if annotate: 
            print("At leaf, predicting %s" % tree['prediction'])
        return tree['prediction'] 
    else:
        # Split on feature.
        split_feature_value = x[tree['splitting_feature']]
        if annotate: 
            print("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

### Evaluating the tree

Create a function called evaluate_classification_error. It takes in as input:

  *  tree (as described above)
  *  data (an data frame)

The function does not change because of adding data point weights. It is analogous to this Python function:

In [53]:
def evaluate_classification_error(tree, data):
    # Apply the classify(tree, x) to each row in your data
    prediction = data.apply(lambda x: classify(tree, x))
    
    # Once you've made the predictions, calculate the classification error
    return (prediction != data[target]).sum() / float(len(data))

To build intuition on how weighted data points affect the tree being built, consider the following:

Suppose we only care about making good predictions for the first 10 and last 10 items in train_data, we assign weights:

   * 1 to the last 10 items
   * 1 to the first 10 items
   * and 0 to the rest.

Let us fit a weighted decision tree with max_depth = 2. Then compute the classification error on the subset_20, i.e. the subset of data points whose weight is 1 (namely the first and last 10 data points). 

In [54]:
# Assign weights
example_data_weights = np.array([1.] * 10 + [0.]*(len(train_data) - 20) + [1.] * 10)
# Train a weighted decision tree model.
small_data_decision_tree_subset_20 = weighted_decision_tree_create(train_data, features, target,
                         example_data_weights, max_depth=2)

--------------------------------------------------------------------
Subtree, depth = 1 (37224 data points).
Split on feature home_ownership. (16710, 3075)
--------------------------------------------------------------------
Subtree, depth = 2 (16710 data points).
Split on feature grade. (4553, 4310)
--------------------------------------------------------------------
Subtree, depth = 3 (4553 data points).
Stopping condition 1 reached.
--------------------------------------------------------------------
Subtree, depth = 3 (4310 data points).
Stopping condition 1 reached.
--------------------------------------------------------------------
Subtree, depth = 2 (3075 data points).
Split on feature grade. (806, 784)
--------------------------------------------------------------------
Subtree, depth = 3 (806 data points).
Stopping condition 1 reached.
--------------------------------------------------------------------
Subtree, depth = 3 (784 data points).
Stopping condition 1 reached.




Now, we will compute the classification error on the subset_20, i.e. the subset of data points whose weight is 1 (namely the first and last 10 data points).

In [55]:
evaluate_classification_error(small_data_decision_tree_subset_20, train_data)

KeyError: ('home_ownership', 'occurred at index grade')