# LendingClub

This project develops the machine learning component of a theoretical loan-approval process at LendingClub. The goal of this component is to free up humans to do the tasks that only humans can do, while letting machine learning take care of the rest, allowing for nearly unlimited scaling.

## Business Context

I read [this article](https://www.debt.org/credit/loans/personal/lending-club-review/) on how LendingClub loans work to get an idea of how to structure the problem.

There are multiple areas of the LendingClub's business model as described above in which machine learning could be useful.

However, we are focused strictly on *whether or not to allow a borrower to pass the screening phase*.

## Import Packages

In [2]:
# data loading and wrangling
from zipfile import ZipFile
import pandas as pd
import numpy as np

## Load Data

In [4]:
# temporarily unzip arhived data to see names of files within
with ZipFile('archive.zip') as zipArchive:
    print(zipArchive.namelist())

['accepted_2007_to_2018Q4.csv.gz', 'accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv', 'rejected_2007_to_2018Q4.csv.gz', 'rejected_2007_to_2018q4.csv/rejected_2007_to_2018Q4.csv']


In [9]:
# load data on accepted applications as a Pandas DataFrame
with ZipFile('archive.zip') as zipArchive:
    with zipArchive.open('accepted_2007_to_2018Q4.csv.gz') as file:
        accepted = pd.read_csv(file, compression='gzip')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [10]:
# load data on rejected applications
with ZipFile('archive.zip') as zipArchive:
    with zipArchive.open('rejected_2007_to_2018Q4.csv.gz') as file:
        rejected = pd.read_csv(file, compression='gzip')

## Data Wrangling<img src="https://assets.change.org/photos/8/ls/ns/SALsnsqkxAKFuim-1600x900-noPad.jpg?1586140235" align='right' alt="Lasso" style="width: 10%;"/>

In [11]:
# how many records? How many features?
accepted.shape

(2260701, 151)

In [12]:
rejected.shape

(27648741, 9)

I wonder why there are a different number of features for applications which were rejected?

In [13]:
rejected.columns

Index(['Amount Requested', 'Application Date', 'Loan Title', 'Risk_Score',
       'Debt-To-Income Ratio', 'Zip Code', 'State', 'Employment Length',
       'Policy Code'],
      dtype='object')

In [84]:
# check to see if any column names are the same for both dataframes
num_cols_in_both = np.sum(accepted.columns.isin([rejected.columns]))
print(f'{num_cols_in_both} out of {len(rejected.columns)} columns in rejected are also in accepted.')

0 out of 9 columns in rejected are also in accepted.


In [49]:
# look at accepted columns, a few at a time
accepted.dtypes[135:]

hardship_end_date                              object
payment_plan_start_date                        object
hardship_length                               float64
hardship_dpd                                  float64
hardship_loan_status                           object
orig_projected_additional_accrued_interest    float64
hardship_payoff_balance_amount                float64
hardship_last_payment_amount                  float64
disbursement_method                            object
debt_settlement_flag                           object
debt_settlement_flag_date                      object
settlement_status                              object
settlement_date                                object
settlement_amount                             float64
settlement_percentage                         float64
settlement_term                               float64
dtype: object

In order to predict whether an applicant should be accepted or rejected, we need a set of features which is common to both types of applicants in our data.

Let's see if we can find common features within the two frames, even if they are named differently. Below are the obvious matches.

        rejected                accepted
        
        Amount Requested        loan_amnt
        
        Loan Title              purpose?
        
        Risk_Score                ???
        
        Debt-to-Income Ratio      dti
        
        Zip Code                 zip_code
        
        State                    addr_state
        
        Employment Length        emp_length

Let's see if we can match up Loan Title with a column in accepted (maybe purpose.)

In [71]:
# inspect examples of Loan Titles
rejected['Loan Title'].value_counts()

Debt consolidation                      6418016
debt_consolidation                      5895211
Other                                   2656222
Credit card refinancing                 2298199
other                                   2042528
                                         ...   
dacapo                                        1
Consolidation for Good Credit Senior          1
Independent Auto Dealer                       1
Need money to relocate and buy home           1
The Right Way, One More Time                  1
Name: Loan Title, Length: 73928, dtype: int64

In [72]:
# inspect examples of "purpose" in accepted df
accepted.purpose.value_counts()

debt_consolidation    1277877
credit_card            516971
home_improvement       150457
other                  139440
major_purchase          50445
medical                 27488
small_business          24689
car                     24013
vacation                15525
moving                  15403
house                   14136
wedding                  2355
renewable_energy         1445
educational               424
Name: purpose, dtype: int64

Excellent! It appears that Loan Title in rejected is the same / similar information to purpose in accepted.

        rejected                accepted
        
        Amount Requested        loan_amnt
        
        Loan Title              purpose
        
        Risk_Score                ???
        
        Debt-to-Income Ratio      dti
        
        Zip Code                 zip_code
        
        State                    addr_state
        
        Employment Length        emp_length

We were able to match all of the features which seem suitable for prediction, with the exception of risk score. =( There are some scores recorded in the accepted dataframe, but none appear to be LendingClub's "proprietary risk score", which I assume is the "risk score" in the rejected dataframe.

Let me check one more thing:

In [83]:
# what range are risk scores in?
rejected.loc[:5, 'Risk_Score']

0    693.0
1    703.0
2    715.0
3    698.0
4    509.0
5    645.0
Name: Risk_Score, dtype: float64

In [82]:
# what range are fico scores in?
accepted.loc[:5, ['fico_range_low', 'fico_range_high']]

Unnamed: 0,fico_range_low,fico_range_high
0,675.0,679.0
1,715.0,719.0
2,695.0,699.0
3,785.0,789.0
4,695.0,699.0
5,690.0,694.0


Actually, it looks like the risk score in the rejected data is NOT proprietary, and probably just a FICO score, so let's match those together.

        rejected                accepted
        
        Amount Requested        loan_amnt
        
        Loan Title              purpose
        
        Risk_Score              average of: fico_range_low, fico_range_high
        
        Debt-to-Income Ratio      dti
        
        Zip Code                 zip_code
        
        State                    addr_state
        
        Employment Length        emp_length

Now, it's time to REALLY wrangle this data!