## Lending Club
Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return.

Each borrower completes a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data and their own data science process to assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back.

A higher interest rate means that the borrower is a risk and more unlikely to pay back the loan. While a lower interest rate means that the borrower has a good credit history and is more likely to pay back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given a grade according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.

Investors are primarily interested in receiving a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the origination fee that Lending Club charges.

The borrower will make monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off before they see a return in money. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition to the requested amount. Many loans aren't completely paid off on time and some borrowers default on the loan.

### Objective
Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')

In [2]:
loans_2007 = pd.read_csv("loans_2007.csv", low_memory = False)
loans_2007.shape

(42538, 52)

In [3]:
loans_2007.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


### Data Cleaning

In [4]:
loans_2007.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 52 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          42538 non-null  object 
 1   member_id                   42535 non-null  float64
 2   loan_amnt                   42535 non-null  float64
 3   funded_amnt                 42535 non-null  float64
 4   funded_amnt_inv             42535 non-null  float64
 5   term                        42535 non-null  object 
 6   int_rate                    42535 non-null  object 
 7   installment                 42535 non-null  float64
 8   grade                       42535 non-null  object 
 9   sub_grade                   42535 non-null  object 
 10  emp_title                   39909 non-null  object 
 11  emp_length                  41423 non-null  object 
 12  home_ownership              42535 non-null  object 
 13  annual_inc                  425

In [5]:
print("Percentage of missing values in each column")
(loans_2007.isna().sum()/loans_2007.shape[0]) * 100

Percentage of missing values in each column


id                            0.000000
member_id                     0.007053
loan_amnt                     0.007053
funded_amnt                   0.007053
funded_amnt_inv               0.007053
term                          0.007053
int_rate                      0.007053
installment                   0.007053
grade                         0.007053
sub_grade                     0.007053
emp_title                     6.180356
emp_length                    2.621186
home_ownership                0.007053
annual_inc                    0.016456
verification_status           0.007053
issue_d                       0.007053
loan_status                   0.007053
pymnt_plan                    0.007053
purpose                       0.007053
title                         0.037613
zip_code                      0.007053
addr_state                    0.007053
dti                           0.007053
delinq_2yrs                   0.075227
earliest_cr_line              0.075227
inq_last_6mths           

Looking at the first 18 columns:
| name                | dtype   | first value | description                                                                                                                       |   |
|---------------------|---------|-------------|-----------------------------------------------------------------------------------------------------------------------------------|---|
| id                  | object  | 1077501     | A unique LC assigned ID for the loan listing.                                                                                     |   |
| member_id           | float64 | 1.2966e+06  | A unique LC assigned Id for the borrower member.                                                                                  |   |
| loan_amnt           | float64 | 5000        | The listed amount of the loan applied for by the borrower.                                                                        |   |
| funded_amnt         | float64 | 5000        | The total amount committed to that loan at that point in time.                                                                    |   |
| funded_amnt_inv     | float64 | 49750       | The total amount committed by investors for that loan at that point in time.                                                      |   |
| term                | object  | 36 months   | The number of payments on the loan. Values are in months and can be either 36 or 60.                                              |   |
| int_rate            | object  | 10.65%      | Interest Rate on the loan                                                                                                         |   |
| installment         | float64 | 162.87      | The monthly payment owed by the borrower if the loan originates.                                                                  |   |
| grade               | object  | B           | LC assigned loan grade                                                                                                            |   |
| sub_grade           | object  | B2          | LC assigned loan subgrade                                                                                                         |   |
| emp_title           | object  | NaN         | The job title supplied by the Borrower when applying for the loan.                                                                |   |
| emp_length          | object  | 10+ years   | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. |   |
| home_ownership      | object  | RENT        | The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.               |   |
| annual_inc          | float64 | 24000       | The self-reported annual income provided by the borrower during registration.                                                     |   |
| verification_status | object  | Verified    | Indicates if income was verified by LC, not verified, or if the income source was verified                                        |   |
| issue_d             | object  | Dec-2011    | The month which the loan was funded                                                                                               |   |
| loan_status         | object  | Charged Off | Current status of the loan                                                                                                        |   |
| pymnt_plan          | object  | n           | Indicates if a payment plan has been put in place for the loan                                                                    |   |
| purpose             | object  | car         | A category provided by the borrower for the loan request.                                                                         |   |

- Columns like id and member_id are for uniquely identifying the transaction and borrower respectively.
- Grade and sub grade are the categorical columns that consist of the same information as the present in the int_rate.
- funded_amnt, funded_amnt_inv, issue_d are columns that can leak the information regarding the loans
- emp_title though can be useful to find more information regarding the type of people who tends to borrow more money, we will ignore this column, since it requires addition processing and more data

In [6]:
loans_2007 = loans_2007.drop(['id', 'member_id', 'grade', 'sub_grade', 'funded_amnt', 'funded_amnt_inv', 'emp_title', 'issue_d'], axis = 1)

For the next 18 columns:
| name                | dtype   | first value | description                                                                                                                                                                                              |
|---------------------|---------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| title               | object  | Computer    | The loan title provided by the borrower                                                                                                                                                                  |
| zip_code            | object  | 860xx       | The first 3 numbers of the zip code provided by the borrower in the loan application.                                                                                                                    |
| addr_state          | object  | AZ          | The state provided by the borrower in the loan application                                                                                                                                               |
| dti                 | float64 | 27.65       | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| delinq_2yrs         | float64 | 0           | The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years                                                                                             |
| earliest_cr_line    | object  | Jan-1985    | The month the borrower's earliest reported credit line was opened                                                                                                                                        |
| inq_last_6mths      | float64 | 1           | The number of inquiries in past 6 months (excluding auto and mortgage inquiries)                                                                                                                         |
| open_acc            | float64 | 3           | The number of open credit lines in the borrower's credit file.                                                                                                                                           |
| pub_rec             | float64 | 0           | Number of derogatory public records                                                                                                                                                                      |
| revol_bal           | float64 | 13648       | Total credit revolving balance                                                                                                                                                                           |
| revol_util          | object  | 83.7%       | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.                                                                               |
| total_acc           | float64 | 9           | The total number of credit lines currently in the borrower's credit file                                                                                                                                 |
| initial_list_status | object  | f           | The initial listing status of the loan. Possible values are – W, F                                                                                                                                       |
| out_prncp           | float64 | 0           | Remaining outstanding principal for total amount funded                                                                                                                                                  |
| out_prncp_inv       | float64 | 0           | Remaining outstanding principal for portion of total amount funded by investors                                                                                                                          |
| total_pymnt         | float64 | 5863.16     | Payments received to date for total amount funded                                                                                                                                                        |
| total_pymnt_inv     | float64 | 5833.84     | Payments received to date for portion of total amount funded by investors                                                                                                                                |
| total_rec_prncp     | float64 | 5000        | Principal received to date                                                                                                                                                                               |    

- zip_code shows only first 3 values (5 values in total) and is redundant to the addr_state
- out_prncp, out_prncp_inv, total_pymnt, total_pymnt_inv, total_rec_prncp are the infromation obtained once the loan as been provided. So they leak the information.

In [7]:
loans_2007 = loans_2007.drop(['zip_code', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp'], axis = 1)

For the remaining columns
| name                       | dtype   | first value | description                                                                                          |
|----------------------------|---------|-------------|------------------------------------------------------------------------------------------------------|
| total_rec_int              | float64 | 863.16      | Interest received to date                                                                            |
| total_rec_late_fee         | float64 | 0           | Late fees received to date                                                                           |
| recoveries                 | float64 | 0           | post charge off gross recovery                                                                       |
| collection_recovery_fee    | float64 | 0           | post charge off collection fee                                                                       |
| last_pymnt_d               | object  | Jan-2015    | Last month payment was received                                                                      |
| last_pymnt_amnt            | float64 | 171.62      | Last total payment amount received                                                                   |
| last_credit_pull_d         | object  | Jun-2016    | The most recent month LC pulled credit for this loan                                                 |
| collections_12_mths_ex_med | float64 | 0           | Number of collections in 12 months excluding medical collections                                     |
| policy_code                | float64 | 1           | publicly available policy_code=1 new products not publicly available policy_code=2                   |
| application_type           | object  | INDIVIDUAL  | Indicates whether the loan is an individual application or a joint application with two co-borrowers |
| acc_now_delinq             | float64 | 0           | The number of accounts on which the borrower is now delinquent.                                      |
| chargeoff_within_12_mths   | float64 | 0           | Number of charge-offs within 12 months                                                               |
| delinq_amnt                | float64 | 0           | The past-due amount owed for the accounts on which the borrower is now delinquent.                   |
| pub_rec_bankruptcies       | float64 | 0           | Number of public record bankruptcies                                                                 |
| tax_liens                  | float64 | 0           | Number of tax liens                                                                                  |

total_rec_int, total_rec_late_fee, recoveries, collection_recovery_fee, last_pymnt_d, last_pymnt_amnt columns contain information once loan has been granted and payment from borrower has started so it causes the information leakage.

In [8]:
loans_2007 = loans_2007.drop(['total_rec_int', 'total_rec_late_fee', 'recoveries',\
                              'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt'], axis = 1)

In [9]:
loans_2007.shape

(42538, 32)

By understanding the columns, we were able to reduce the number of columns from 52 to 32

loan_status column could be used as our target since it contains the infromation about weather loan has been paid or not. But this column contains categorical information in text form. We need to convert it into appropriate numerical form.

In [10]:
loans_2007['loan_status'].value_counts(normalize = True) * 100

Fully Paid                                             77.902903
Charged Off                                            13.245562
Does not meet the credit policy. Status:Fully Paid      4.673798
Current                                                 2.259316
Does not meet the credit policy. Status:Charged Off     1.789115
Late (31-120 days)                                      0.056424
In Grace Period                                         0.047020
Late (16-30 days)                                       0.018808
Default                                                 0.007053
Name: loan_status, dtype: float64

| Loan Status                                         | Meaning                                                                                                                                           |
|-----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Fully Paid                                          | Loan has been fully paid off.                                                                                                                     |
| Charged Off                                         | Loan for which there is no longer a reasonable expectation of further payments.                                                                   |
| Does not meet the credit policy. Status:Fully Paid  | While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.    |
| Does not meet the credit policy. Status:Charged Off | While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace. |
| In Grace Period                                     | The loan is past due but still in the grace period of 15 days.                                                                                    |
| Late (16-30 days)                                   | Loan hasn't been paid in 16 to 30 days (late on the current payment).                                                                             |
| Late (31-120 days)                                  | Loan hasn't been paid in 31 to 120 days (late on the current payment).                                                                            |
| Current                                             | Loan is up to date on current payments.                                                                                                           |
| Default                                             | Loan is defaulted on and no payment has been made for more than 121 days.                                                                         |

We are more or less interested in finding if the loan will be paid completely of will be charged off. So we will remove other loan status and treat the problem as a binary classifcation problem.
Also , fully paid will be represented by 1 while charged off by 0

In [11]:
loans_2007 = loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]
loans_2007['loan_status'] = loans_2007['loan_status'].replace({
"Fully Paid" : 1,
"Charged Off" : 0})

In [12]:
loans_2007.shape

(38770, 32)

In [13]:
loans_2007['loan_status'].value_counts()

1    33136
0     5634
Name: loan_status, dtype: int64

There are more records for the fully paid than the charged off. This can cause our model to be more biased towards the fully paid.

In order to reduce the number of columns from the dataset, we will remove the columns that contains just single value since they wont contribute much to the model

In [14]:
drop_columns = []
for col in loans_2007.columns:
    if len(loans_2007[col].dropna(axis = 0).unique()) == 1:              drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis = 1)
print("Columns that are dropped are: \n{}".format(drop_columns))

Columns that are dropped are: 
['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']
