<a href="https://colab.research.google.com/github/yimingm/MSSP608-Practical-Machine-Learning/blob/master/Homework1_DecisionTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to homework 1
On Canvas (or in this notebook's files) you’ll find a new file, **lendingclub.csv**. Each row of this file represents a single user
account on **LendingClub.com**. The site consists of two types of users, **borrowers** who are applying for a new loan, and **investors** who lend money for fixed periods of time. Each row in our dataset represents a single borrower at the time they apply for their first peer-to-peer loan, and each row contains nine columns:

- Amount requested for their first loan
- Year the loan was requested (this dataset covers only a five-year period, 2008-2012)
- Title of the loan application (written by the borrower)
- FICO score (credit rating) of the borrower
- “Debt-to-Income”: A ratio of the borrower’s total monthly debt payments, excluding home
mortgage and the requested loan, to the borrower’s self-reported monthly income.
- ZIP code of the borrower (the last two digits of each ZIP are masked for anonymity)
- U.S. state that the borrower resides in.
- Length of time that the borrower has been employed at their current job, from 0 to 10+ years.
- A binary outcome variable for whether the user’s loan application was accepted by investors.

Our goal with this project will be to automatically predict whether a borrower will be approved for a loan from the investor members of the website, based only on the data provided above. For each of the following questions, add a series of code and markdown cells to develop an easily readable report responding to the question. 

In [None]:
import math
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy import stats
from matplotlib import dates
from datetime import datetime

from sklearn.metrics import accuracy_score, precision_score, recall_score, cohen_kappa_score, confusion_matrix, mutual_info_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split

# Download the Data 
Keep this if you are working in Google Colab. Delete this if you are working on your own computer and have the data downloaded already. 

In [None]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=0B5qTk6DHjanhOV9LRE5DY3l1T2pGemVBNTVQVzVsMlFCcHF3' -O lendingclub.csv

--2020-02-03 02:34:36--  https://docs.google.com/uc?export=download&id=0B5qTk6DHjanhOV9LRE5DY3l1T2pGemVBNTVQVzVsMlFCcHF3
Resolving docs.google.com (docs.google.com)... 173.194.212.113, 173.194.212.138, 173.194.212.101, ...
Connecting to docs.google.com (docs.google.com)|173.194.212.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-10-5c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/hq3u6n9nrm0du8j10i684rqq6mvm3bh2/1580695200000/09819396713149841370/*/0B5qTk6DHjanhOV9LRE5DY3l1T2pGemVBNTVQVzVsMlFCcHF3?e=download [following]
--2020-02-03 02:34:36--  https://doc-10-5c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/hq3u6n9nrm0du8j10i684rqq6mvm3bh2/1580695200000/09819396713149841370/*/0B5qTk6DHjanhOV9LRE5DY3l1T2pGemVBNTVQVzVsMlFCcHF3?e=download
Resolving doc-10-5c-docs.googleusercontent.com (doc-10-5c-docs.googleusercontent.com)... 173.194.216.132, 2607:f8b0:400c:c12::84
Connecting

# Question 1:
- Why is this data about users valuable to LendingClub?
1. Firstly, this data about users is useful for target advertisement. For example, with the information of application date and location, LendingClub is able to adjust the amount of advertisement according to peak/off seasons and locations.
2. Other than getting more LendingClub users, the dataset is also useful for LendingClub investors. The title of the loan informs investors where the money goes. Investors can maximize the payback from loan amount and period.


- Name at least two different ways this automated prediction could be used either for in-app product changes, or business decision-making.

1. With the automated prediction, the loan application process can be speeded up and thus the money of the investors can be more effectively allocated and iterated. Less time is wasted on manual process.
2. Since the automated prediction applies same rules to all applications, the application results are highly consistence, which is fair to borrowers. With the automated prediction, Human errors are also avoided.
3. Along with the expand of business, the company needs to look for more investors. Loan purposes with high rates of acceptance may attract corresponding investors. 



# Question 2: 


- Class distribution of the outcome labels.

  There are two labels of the outcome in the dataset: accept, reject. “accept” means the loan is approved, and “reject” means the loan was not approved, there is no other options. In the dataset, there are 71858 rejections and 9245 acceptances.

- List of features you used from the data, including their name and data type (numeric or
nominal). For each feature, provide min, mean, and max values (if the feature is numeric) or
list all possible labels (if the feature is nominal).
![alt text](https://drive.google.com/uc?id=1AdFLvxXvOw-qugfv8ydusaHSa6K4PvJJ)

- List of hyperparameter settings for the decision tree.

  Most of the parameters of the decision tree is by default, except for criterion, I used entropy instead of gini.

  DecisionTreeClassifier (ccp_alpha=0.0, class_weight=None, criterion='entropy', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2 min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best')


- Performance of the trained classifier
  
  The 95% accuracy is quiet satisfying for this arbitrary single built decision tree. The 0.76 kappa is considered good. 

  Accuracy: 95.0 

  Kappa: 0.760 

  Precision: 0.779
  
  ![alt text](https://drive.google.com/uc?id=1irieTWWNGuACoV6_Q5UPK9Tj_KV2t6OO)


**Step 1: Open Dataset**

In [None]:
lendingclub = pd.read_csv("lendingclub.csv") 
lendingclub.head()

Unnamed: 0,amount,date,title,fico,dti,zip,state,emp_length,policy_code,year,outcome
0,2500.0,Dec-2011,bike,740.0,1.0,309xx,GA,< 1 year,1.0,2011,accept
1,12000.0,Dec-2011,Consolidation,675.0,10.78,913xx,CA,10+ years,1.0,2011,accept
2,21000.0,Dec-2011,Debt Cleanup,705.0,13.22,335xx,FL,10+ years,1.0,2011,accept
3,31825.0,Dec-2011,Debt Consolidation Loan,760.0,14.03,080xx,NJ,5 years,1.0,2011,accept
4,12000.0,Dec-2011,Debt Consolidation,725.0,16.7,088xx,NJ,10+ years,1.0,2011,accept


**Step 2: Descriptive Analysis - Nominal Features**

I chose amount, fico and emp_length to be the features of the decision tree. The following code provide the labels of outcome and information of the features.

In [None]:
print(lendingclub["outcome"].value_counts())
print(lendingclub.emp_length.unique())

reject    71858
accept     9245
Name: outcome, dtype: int64
['< 1 year' '10+ years' '5 years' '9 years' '6 years' '2 years' '3 years'
 '7 years' '8 years' '4 years' '1 year']


Descriptive Analysis - Numeric Features

In [None]:
print(lendingclub["fico"].describe())
print(lendingclub["amount"].describe())

count    81103.000000
mean       603.010961
std        173.772811
min          0.000000
25%        585.000000
50%        653.000000
75%        691.000000
max        850.000000
Name: fico, dtype: float64
count     81103.000000
mean      12959.437855
std       10315.880464
min         500.000000
25%        5000.000000
50%       10000.000000
75%       20000.000000
max      500000.000000
Name: amount, dtype: float64


**Step 3: Build Decision Tree**

The following codes trained the decision tree and provided the performance (accuracy, kappa and precision) of the trained classifier using scikit-learn. 
Other than the performance of decision tree with all three features, decision trees with single feature are also available in the following code.

In [None]:
# Pack the training sets with single features and 3-feature-set
employment_features = ["emp_length"]
amount_features = ["amount"]
fico_features = ["fico"]

feature_sets = {
    "employment": employment_features,
    "amount": amount_features,
    "fico": fico_features,
    "employment + amount + fico": employment_features + amount_features + fico_features
}

In [None]:
# Compare feature sets after training each feature set
best = 0
best_name = None
best_actual = None
best_predictions = None

precisions = []
kappas = []
accuracies = []

predictions = {}
actual = None

# For each feature set, we evaluate our model on both the train and the test set
for set_name, feature_set in feature_sets.items():

    # Create a dummyset with only the features in our feature set
    X = lendingclub.loc[:, feature_set]
    X = pd.get_dummies(X)

    y = lendingclub["outcome"]

    # Use scikit-learn to create our train/test split and train our decision tree
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=123)
    model = DecisionTreeClassifier(criterion="entropy", random_state=123).fit(X_train, y_train)
    
    # List of hyperparameter settings
    print(model)
    # Calculate our accuracy, kappa and precision on the train and test sets
    y_pred = model.predict(X_test)

    accuracy = 100*accuracy_score(y_test, y_pred)
    kappa = cohen_kappa_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, pos_label= "accept")

    metric_to_optimize = accuracy
    
    if metric_to_optimize > best:
        best = metric_to_optimize
        best_name = set_name
        
    predictions[set_name] = y_pred
    actual = np.array(list(y_test))
    
    # The printed result includes performance of the feature set and the confusion matrix
    print(f"Results for {set_name}:")
    print(confusion_matrix(y_test, y_pred))
    print(f"Accuracy: {accuracy:.1f} Kappa: {kappa:.3f} Precision: {precision:.3f}")
    precisions.append(precision)
    kappas.append(kappa)
    accuracies.append(accuracy)
    print("------------------------")
    
print(f"Best feature set is: {best_name} \nWith: {best:.1f}% accuracy.")    

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')


  _warn_prf(average, modifier, msg_start, len(result))


Results for employment:
[[    0  1872]
 [    0 14349]]
Accuracy: 88.5 Kappa: 0.000 Precision: 0.000
------------------------
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')
Results for amount:
[[  417  1455]
 [  223 14126]]
Accuracy: 89.7 Kappa: 0.290 Precision: 0.652
------------------------
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_we

# Question 3: 

Is the decision tree that you trained accurate enough to be used for each of the two business purposes you proposed in question 1? Why or why not?

The current decision tree only contains three features: loan amount, fico score and employment length. It leaves out several relevant features such as geographic information and loan purpose. Although 95% accuracy and 77.9% precision seem accurate with low money loss, we still don’t how the false predictions look like. If the false predictions have similar characteristics, for example, if the false loans are of the same purpose, that would be unfair and irresponsible for corresponding investors.

For the purpose of attracting more investors, feature “title” needs to be added. With the new decision tree, although false predictions still exist, we will be able to tell generally what loan purpose is worth to invest.


# Scoring Rubric
![alt text](https://docs.google.com/uc?export=view&id=1ELG4QWnPjWgiUJI0eL6YiVbMWKlSVdcP)
