# Description
在这个项目中，我们将完整走完一个data science的完整流程：从data cleaning and feature selection to machine learning. 这个项目中，我们将聚焦『信用建模』体系。我们将使用[Lending Club](https://www.lendingclub.com/info/download-data.action)提供的数据。Lending Club是一个个人放贷平台。简单的工作流程是这样的：申请借贷的人(borrower)填写一个详细的申请单子，提供以往的金融历史信息，借款的理由等等.然后Lending Club会根据这些信息，估算这个borrower的信用值，根据信用值不同，生成的利率不同。高利率说明该borrower信用不好，更具『危险性』，不愿意还钱；低利率则相反。利率可以是5.32%-30.99%不等，如果borrower接受Lending Club提供的利率，那么Lending Club便会将在自己的平台上发布借贷信息。投资者（放贷的人）会通过这个平台浏览到这些信息，他们会根据自己的需要决定是否放贷，选择相应的借贷人放贷等。当投资者同意放贷后，借贷人便会收到贷款。之后就是分期付款，还款总数即是投资人的贷款金额+利息+Lending Club的中间费用。通过Lending Club提供的数据，我们要回答的问题是：

* Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?

关于数据集的数据字典，可以从[这里](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit)找到

## Reading the data
首先对数据集进行一个初步的整理：
* 移除掉原数据的第一行，因为他是额外的信息`Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)`并不是行名。
* `desc`列，因为他提供的是关于每一个贷款的描述，是无用信息。
* `url`列，提供的是每一个贷款在Lending Club网站上的位置，无用。
* 删除掉那些缺失值超过总数据50%的列。

上述过程的代码：
```python
import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False)
```

We work on `loans_2007.csv` file.

In [136]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

loans_2007 = pd.read_csv("loans_2007.csv")
print(loans_2007.iloc[0,:])
print(loans_2007.shape)

id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                       

  interactivity=interactivity, compiler=compiler, result=result)


一共有52列，我们需要删除一些属性：
* leak information from the future (after the loan has already been funded)

* don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)

* formatted poorly and need to be cleaned up

* require more data or a lot of processing to turn into a useful feature

* contain redundant information

We need to especially pay attention to data leakage, since it can cause our model to overfit. This is because the model would be using data about the target column that wouldn't be available when we're using the model on future loans.

对比数据字典，对于头18列，我们drop掉下面的几列：
* `id` 由LC随机生成的数字，无用。
* `member_id` 同上
* `funded_amnt` leaks data from the future.(这样看，我们的目的是判断给定信息的借贷人，是否应该借贷给他？`funded_amnt`是借款人收到的借贷金额，即判定同意借贷给他了，所以泄露了未来信息)
* `funded_amnt_inv` 投资人给的钱，同上。
* `grade` 冗余信息，表达的意思跟利率列(`int_rate`)重复。
* `sub_grade` 同上
* `emp_tittle` 借款人的职业信息，这是有用的，但是需要别的数据支持和大量的处理，故放弃。
* `issue_d` leaks data from the future

In [137]:
drop_cols = ['id','member_id','funded_amnt','funded_amnt_inv'
            ,'grade','sub_grade','emp_title','issue_d']

loans_2007 = loans_2007.drop(drop_cols, axis=1)

在接着看18列，决定删掉这些列：
* `zip_code` 与`addr_state`信息重复
* `out_prncp` leaks data from the future
* `out_prncp_inv` 同上
* `total_pymnt` 同上
* `total_pymnt_inv`同上
* `total_rec_prncp` 同上

In [138]:
drop_cols = ['zip_code', 'out_prncp', 'out_prncp_inv',
            'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp']
loans_2007 = loans_2007.drop(drop_cols, axis=1)
loans_2007

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.00,Verified,Fully Paid,n,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.00,Source Verified,Charged Off,n,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.00,Not Verified,Fully Paid,n,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.00,Source Verified,Fully Paid,n,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,3000.0,60 months,12.69%,67.79,1 year,RENT,80000.00,Source Verified,Current,n,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
5,5000.0,36 months,7.90%,156.46,3 years,RENT,36000.00,Source Verified,Fully Paid,n,...,161.03,Jan-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
6,7000.0,60 months,15.96%,170.08,8 years,RENT,47004.00,Not Verified,Fully Paid,n,...,1313.76,May-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
7,3000.0,36 months,18.64%,109.43,9 years,RENT,48000.00,Source Verified,Fully Paid,n,...,111.34,Dec-2014,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
8,5600.0,60 months,21.28%,152.39,4 years,OWN,40000.00,Source Verified,Charged Off,n,...,152.39,Aug-2012,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
9,5375.0,60 months,12.69%,121.45,< 1 year,RENT,15000.00,Verified,Charged Off,n,...,121.45,Mar-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


再看最后18列，以下是不要的列：
* `total_rec_int`
* `total_rec_late_fee`
* `recoveries`
* `collection_recovery_fee`
* `last_pymnt_d`
* `last_pymnt_amnt`
All because of leaking data from future.

In [139]:
drop_cols = ['total_rec_int', 'total_rec_late_fee', 'recoveries',
            'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt']
loans_2007 = loans_2007.drop(drop_cols, axis=1)
print(loans_2007.iloc[0,:])
print(loans_2007.shape)

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               J

## Target column
我们应该使用`loan_status`来作为要预测属性

In [140]:
print(loans_2007['loan_status'].value_counts())

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64


对于这些结果表示什么意思，可以google得到。charged off通俗讲就是『钱没还(完)』

Only the `Fully Paid` and `Charged Off` values describe the final outcome of the loan. While the `Default` status resembles the `Charged Off` status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance.

In [141]:
#移除掉不需要的行
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]

status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}

loans_2007 = loans_2007.replace(status_replace)
loans_2007.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,1,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,0,n,...,f,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,1,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,1,n,...,f,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
5,5000.0,36 months,7.90%,156.46,3 years,RENT,36000.0,Source Verified,1,n,...,f,Jan-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


drop掉那些只有一个值得列，这些列对模型的建立无用

In [142]:
orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
    col_series = loans_2007[col].dropna().unique() #先把series的Na drop了，否则na也会被当做一个unique value, 由于不是inplace的，
    #所以并不改变原df
    if len(col_series) == 1:
        drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print(drop_columns)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


In [143]:
print(loans_2007.shape)

(38770, 23)


## Preparing the features
初步的data cleaning 结束了，进行进一步的处理

In [144]:
null_counts = loans_2007.isnull().sum()
print(null_counts)

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64


对于这些包含null值的列：
* remove `pub_rec_bankruptcies`列
* 对于 `emp_length`列，na值用0替代
* remove `revol_util`和`title`列中包含null的行，保留这两列。

In [145]:
loans_2007 = loans_2007.drop(['pub_rec_bankruptcies'],
                            axis=1)
loans_2007 = loans_2007.dropna(axis=0)
print(loans_2007.dtypes.value_counts())

object     11
float64    10
int64       1
dtype: int64


In [146]:
object_cols_df = loans_2007.select_dtypes(include=["object"])
print(object_cols_df.iloc[0,:])

term                     36 months
int_rate                    10.65%
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line          Jan-1985
revol_util                   83.7%
last_credit_pull_d        Jun-2016
Name: 0, dtype: object


In [147]:
cols = ['home_ownership', 'emp_length', 'verification_status','term', 'addr_state']
for c in cols:
    print(loans_2007[c].value_counts())

RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64
Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64
 36 months    28234
 60 months     9441
Name: term, dtype: int64
CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
LA     420
AL     420
KY     311
OK     285
KS     249
UT     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
AK      76
WY      76
SD      60
VT  

In [148]:
print(loans_2007["purpose"].value_counts())
print(loans_2007["title"].value_counts())

debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64
Debt Consolidation                                2068
Debt Consolidation Loan                           1599
Personal Loan                                      624
Consolidation                                      488
debt consolidation                                 466
Credit Card Consolidation                          345
Home Improvement                                   336
Debt consolidation                                 314
Small Business Loan                                298
Credit Card Loan                                   294
Personal                      

`home_ownership` `verification_status` `term`只含有少量的diecrete categorical values, 我们可以使用get dummies来转换这些categorical类型。

`purpose`和`tittle`包含重复信息，所以我们保留`purpose`列因为他含有少量的diecrete categorical values

`addr_state`值太多，如果用get_dummies处理会多出来很多列，故不要。

In [149]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans_2007 = loans_2007.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans_2007["int_rate"] = loans_2007["int_rate"].str.rstrip("%").astype("float")
loans_2007["revol_util"] = loans_2007["revol_util"].str.rstrip("%").astype("float")
loans_2007 = loans_2007.replace(mapping_dict)

In [150]:
cat_columns = ["home_ownership", "verification_status", "purpose", "term"]
dummy_df = pd.get_dummies(loans_2007[cat_columns])
loans_2007 = pd.concat([loans_2007, dummy_df], axis=1)
loans_2007 = loans_2007.drop(cat_columns, axis=1)

## Building Machine Learning Model

In [151]:
loans_2007.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37675 entries, 0 to 39785
Data columns (total 38 columns):
loan_amnt                              37675 non-null float64
int_rate                               37675 non-null float64
installment                            37675 non-null float64
emp_length                             37675 non-null int64
annual_inc                             37675 non-null float64
loan_status                            37675 non-null int64
dti                                    37675 non-null float64
delinq_2yrs                            37675 non-null float64
inq_last_6mths                         37675 non-null float64
open_acc                               37675 non-null float64
pub_rec                                37675 non-null float64
revol_bal                              37675 non-null float64
revol_util                             37675 non-null float64
total_acc                              37675 non-null float64
home_ownership_MORTGAGE    

## Error Metrics
我们的target column是`loan_status`,实际上是一个二分类问题，我们可以用fpr(false prediction rate), tpr(true prediction rate)来衡量模型

In [152]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
'''
 by setting the class_weight parameter to balanced when creating the LogisticRegression instance. 
 This tells scikit-learn to penalize the misclassification of the minority class during the training process. 
 The penalty means that the logistic regression classifier pays more attention to correctly classifying rows 
 where loan_status is 0.
'''
lr = LogisticRegression()

cols = loans_2007.columns
train_cols = cols.drop("loan_status")
features = loans_2007[train_cols]
target = loans_2007["loan_status"]

predictions = cross_val_predict(lr, features, target, cv=3)

predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans_2007["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans_2007["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)
print(tpr)
print(fpr)

0.9984268484530676
0.9986179664363277


In [153]:
'''
 by setting the class_weight parameter to balanced when creating the LogisticRegression instance. 
 This tells scikit-learn to penalize the misclassification of the minority class during the training process. 
 The penalty means that the logistic regression classifier pays more attention to correctly classifying rows 
 where loan_status is 0.
'''

lr = LogisticRegression(class_weight="balanced")

predictions = cross_val_predict(lr, features, target, cv=3)

predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans_2007["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans_2007["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)
print(tpr)
print(fpr)

0.627884111169376
0.6177690029615005


In [154]:
penalty = {
    0: 10,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)

predictions = cross_val_predict(lr, features, target, cv=3)

predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans_2007["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans_2007["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp  / (tp + fn)
fpr = fp  / (fp + tn)
print(tpr)
print(fpr)

0.22623885684320924
0.2238894373149062


In [155]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(class_weight="balanced", random_state=1)
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.`
tp_filter = (predictions == 1) & (loans_2007["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans_2007["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.9630964866282119
0.9634748272458046
