## 1. Data Cleaning

We follow the __Quartz: Bad Data Guide__ at https://github.com/Quartz/bad-data-guide, and conduct the following steps:

1. Check data size before reading-in. As a rule of thumb, the data size should be at most 1/4 of the computer's RAM.
1. Summarize the variables through grouping. Understand the meaning of each variable, and group them into meaningful baskets.
1. Inspect missing values. Analyze their causes, and decides what to do with them.
1. Variable transformation. 
    - Properly format variables. For dates, convert them into computable formats; for texts, remove the redundencies and check for spelling; for numerical values, make sure the units are consistent. 
    - Analyze scope of the data. Does the dataset include enough variables to provide meaningful answers to the questions we're interested in?
    - Check granularity of data. What does each data point represent and in what ways can they be meaningufully aggregated?



We first do each step manually, then wrap our actions into functions so that the process is more replicable on new datasets. We export the cleaned training and test set.

In [57]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_colwidth',2000)
sns.set()

### 1.1 Data Size & First Peek

In [2]:
!ls -lh ../data

total 30M
-rw-r--r-- 1 nleea 197609 1.1M Dec  3 01:32 LCtest_halfCleaned.xlsx
-rw-r--r-- 1 nleea 197609  12M Dec  3 01:32 LCtraining_halfCleaned.xlsx
-rw-r--r-- 1 nleea 197609 1.5M Nov  7 18:52 LendingClubData_testing.xlsx
-rw-r--r-- 1 nleea 197609  16M Dec  2 14:45 LendingClubData_training.xlsx
drwxr-xr-x 1 nleea 197609    0 Dec  3 02:59 statewide_monthly_unemployment
-rw-r--r-- 1 nleea 197609  25K Dec  3 03:01 unemployment.csv
-rw-r--r-- 1 nleea 197609 885K Dec  3 08:47 us_companies_names_industries.csv


The raw data is about 16MB, safe to read into a RAM of 8GB.  

In [3]:
%%time
## Reading in the dataset
rawTraining = pd.read_excel("../data/LendingClubData_training.xlsx")
rawTest = pd.read_excel("../data/LendingClubData_testing.xlsx")

Wall time: 31.5 s


The training set contains 35808 rows, and the test set 3978 rows; both contain the same set of 145 features. Our training to test sample size is very close to 9:1.

In [4]:
rawTraining.shape
# rawTraining.head(5)
rawTest.shape
# rawTest.head(5)
rawTraining.shape[0] / rawTest.shape[0]

(35808, 145)

(3978, 145)

9.001508295625943

A quick peek shows that out of the 145 features, 83 are completely empty. Though they apparently cannot be used for predictive purpose in this iteration, their future inclusion may improve the model's quality. Before hastily discarding them, we want to assess whether they can add any _material insights_ to the current dataset. To do this, we have to first understand what the non-empty variables are telling us.

In [150]:
# The function takes in a lending club dataframe and returns a list of names of columns that are empty.
def empty_columns(lendingClub_df):
    empty_col_bool = lendingClub_df.isnull().sum(axis=0) == lendingClub_df.shape[0]
    empty_cols = lendingClub_df.columns[empty_col_bool]
    return empty_cols
# Do training and test share the same empty columns? Yes!
(empty_columns(rawTraining) != empty_columns(rawTest)).sum()
# We can use one variable to reflect the empty columns in both. We use underline separated names for functions, and upper-case letters
# for variable names.
emptyColumns = empty_columns(rawTraining)
emptyColumns.unique().shape

def non_empty_columns(lendingClub_df):
    nonEmptyCol_bool = lendingClub_df.isnull().sum(axis=0) != lendingClub_df.shape[0]
    nonEmptyCols = lendingClub_df.columns[nonEmptyCol_bool]
    return nonEmptyCols
nonEmptyColumns = non_empty_columns(rawTraining)
nonEmptyColumns.shape

0

(83,)

(62,)

### 1.2 Summary of Variables

__Cleaning the Data Dictionary__

To conviniently consult the data dictionary, we read it in as a pandas DataFrame. 

- The original dictionary was split between three excel sheets. We manually copied and pasted three sheets into one, and call it the "flattened" dictionary.

- Notice that some variable names in the data dictionary contain a trailing space. We remove them.

- We further discover there's one column that isn't documented in the data dictionary -- `verification_status_joint`. A closer look shows that in the data dictionary, the same variable is termed as `verified_status_joint`. We'd modify the data dictionary to ensure consistency in naming. 

In [151]:
## Read in the data dictionary. 
dataDict = pd.read_excel("../dict/LendingClubDataDictionary_Flattened.xlsx")
dataDict.columns = ["Variable", "Description"]
dataDict.drop_duplicates(subset="Variable", keep="first", inplace=True)
# Notice some variable names in the data dictionary end with an extra space. We need to remove them.
dataDict["Variable"] = dataDict["Variable"].str.rstrip()
variableNames = list(dataDict["Variable"])
# Any columns not documented in the data dictionary?
columns = list(rawTraining.columns)
inDictBool = [(i in variableNames) for i in columns]
(~np.array(inDictBool)).sum()
# We change the naming in data dictionary to ensure consistency
dataDict["Variable"] = dataDict["Variable"].str.replace("verified_status_joint","verification_status_joint")
# Set variable as index for easier selection.
dataDict.set_index("Variable", inplace=True)
dataDict.to_excel("../dict/LendingClubDataDictionary_Cleaned.xlsx")

1

__Summarizing the non-empty variables__

We have 62 non-empty variables. To better summarize the content, we put them into four baskets:

- Loan Conditions:
    - _Information on the application_. Such as `Loan Amount` applied for, `Title`, `Purpose`, `Description` for the loan and `Application Type` indicating whether the application is made by an individual or jointly), 
    - _Loan parameters_. Such as the actual `Amount Funded`, `Interest Rate`, and `Grade` assigned to the loan
    - _`Loan Status`_. Whether it's fully paid for charged-off.
- Borrower's Financial Conditions:
    - _Employment Status_. `Annual Income`, `Employer Title`, `Employment Length` all fall under this category.
    - _Residental Status_. This includes `Homeonwership`, `State`, `Zip Code`.
    - _`Debt-to-income Ratio`_ is also in this category.
- Borrower's Credit Situation:
    - _Length of credit history_, as reflected in `Earliest Credit Line`.
    - _History of late payment_, as captured by `Delinquencies in 2 years`, `Monthes Since Last Delinquency`, number of `Accounts Now Delinquent`, `Delinquent Amount`, `Chargeoffs in 12 Months` etc.
    - _Credit utilization_. This includes `Revolving Balance`, `Revolving Balance Utilization`, `Number of Total Accounts`. 
    - _New credit line inquiries_. `Inquiries in 6 months`, `Last Credit Pulled Date`.
    - _Other credit burdens_ such as number of `Tax Liens`, number of `Public Bankcruptcy Records` or `Derogatory Public Records`, `Collections in 12 Months`.
- Payment on this loan.
    - _Payments received_ so far on principle and interest.
    - _Settlement plan_. If default occurs, whether a `Debt Settlement Flag`(plan) is agreed upon, and parameters and progress on that settlement.

In [152]:
# Tidying up the variable names. 
variable_name_original = list(non_empty_columns(rawTraining))
variable_name_tidy = ["Loan Amount", "Funded Amount", "Funded Amount Investor", "Term", "Interest Rate", "Installment", "Grade", 
                      "Sub Grade", "Employer Title", "Employment Length", "Home Ownership", "Annual Income", "Verification Status",
                     "Issued Date", "Loan Status", "Payment Plan", "Description", "Purpose", "Title", "Zip Code", "State", "Debt-to-income Ratio",
                     "Delinquencies in 2 years", "Earliest Credit Line", "Inquiries in 6 months", "Months Since Last Delinquency",
                     "Months Since Last Public Record", "Open Accounts", "Derogatory Public Records", "Revolving Balance", 
                     "Revolving Balance Utilization", "Number of Total Accounts", "Initial List Status", "Outstanding Principle", 
                     "Outstanding Principle Investor", "Total Payment", "Total Payment Investor", "Total Received Principle", "Total Received Interest",
                     "Total Received Late Fee", "Recoveries", "Collection Recovery Fee", "Last Payment Date", "Last Payment Amount", "Last Credit Pulled Date",
                     "Collections in 12 Months", "Policy Code", "Application Type", "Accounts Now Delinquent", "Chargeoffs in 12 Months", "Delinquent Amount", 
                      "Public Bankcruptcy Records", "Tax Liens", "Hardship Flag", "Disbursement Method", "Debt Settlement Flag", "Debt Settlement Flag Date",
                     "Settlement Status", "Settlement Date", "Settlement Amount", "Settlement Percentage", "Settlement Term"]
tidy_original_dict = dict(zip(variable_name_tidy, variable_name_original))

variable_names_grouped = {
    "Loan Condition": ["Loan Amount", "Funded Amount", "Funded Amount Investor", "Term", "Interest Rate", "Installment", "Grade", 
                      "Sub Grade", "Loan Status", "Issued Date","Title", "Description", "Purpose", "Application Type", "Policy Code",
                      "Initial List Status"],
    "Borrower Financial Condition": ["Employer Title", "Employment Length", "Home Ownership", "Annual Income", "Verification Status",
                     "Zip Code", "State", "Debt-to-income Ratio"],
    "Credit Situation": ["Delinquencies in 2 years", "Earliest Credit Line", "Inquiries in 6 months", "Months Since Last Delinquency",
                     "Months Since Last Public Record", "Open Accounts", "Derogatory Public Records", "Revolving Balance", 
                     "Revolving Balance Utilization", "Number of Total Accounts", "Last Credit Pulled Date", "Collections in 12 Months", 
                        "Accounts Now Delinquent", "Chargeoffs in 12 Months", "Delinquent Amount", "Public Bankcruptcy Records", "Tax Liens",
                        ],
    "Payment on Loan": ["Outstanding Principle", 
                     "Outstanding Principle Investor", "Total Payment", "Total Payment Investor", "Total Received Principle", 
                        "Total Received Interest","Total Received Late Fee", "Recoveries", "Collection Recovery Fee", "Payment Plan",
                        "Last Payment Date", "Last Payment Amount", "Hardship Flag", "Disbursement Method", "Debt Settlement Flag", 
                        "Debt Settlement Flag Date", "Settlement Status", "Settlement Date", "Settlement Amount", "Settlement Percentage", "Settlement Term"]
}
variable_names_grouped_list = []
for group in variable_names_grouped.values():
    variable_names_grouped_list.extend(group)
len(variable_names_grouped_list)

62

### 1.3 Missing Values

#### Empty Variables

The empty variables fall into one of the three categories: information on joint application, borrower's other credit burdens, and hardship plan status. We think that while both information on co-applicants and hardship status can reveal interesting patterns, only other credit burdens can add material insights to our analysis of default probability.

- Information on __joint applicants__. Our dataset only contains information on loans applied for by individuals. However, we don't think that joint application should be treated in a fundamentally different way than an individual ones, as long as we can properly aggregate information on all applicants. In future iterations where joint applications are present, the data should be processed in a similar way to this project.
- Applicant's __other credit burdens__. This includes information on the applicant's installment accounts, such as mortgage, auto-loan, and sstudent loans.<span style="color:red"> We think this information can be very helpful for predicting default in some cases. </span> Installment loans typically are collateralized (with the exception of student loans), while Lending Club loans are mostly unsecured, so defaulting on the former usually has a much severer consequence than the latter. If one has already recently defaulted on a mortgage or auto loan, he/she is very likely under significant hardship, and will default on a lending club loan as well. _As such, we recommend including this information in future data collection_.
- __Hardship plan status__. Lending Club offers borrowers three-month "hardship" plans when only a reduced installment has to be paid. Analysis on hardship plan data can shed light on some interesting questions, such as whether enrollment in the hardship plan signals stronger willingness to avoid default, should Lending Club make automatic recommendation of hardship plans to all borrower's likely to become delinquent, etc. 

We remove the empty columns before proceeding to further analysis.

In [8]:
# Check the meaning of empty variables.
dataDict.loc[list(emptyColumns), :]

Unnamed: 0_level_0,Description
Variable,Unnamed: 1_level_1
id,A unique LC assigned ID for the loan listing.
member_id,A unique LC assigned Id for the borrower member.
url,URL for the LC page with listing data.
next_pymnt_d,Next scheduled payment date
mths_since_last_major_derog,Months since most recent 90-day or worse rating
...,...
hardship_dpd,Account days past due as of the hardship plan start date
hardship_loan_status,Loan Status as of the hardship plan start date
orig_projected_additional_accrued_interest,The original projected additional interest amount that will accrue for the given hardship payment plan as of the Hardship Start Date. This field will be null if the borrower has broken their hardship payment plan.
hardship_payoff_balance_amount,The payoff balance amount as of the hardship plan start date


In [153]:
# Removing the emtpy columns, and re-order the columns according to grouping.
def get_non_empty_columns(lendingClub_df_raw):
    training_nonempty = lendingClub_df_raw[list(non_empty_columns(rawTraining))]
    training_nonempty.columns = variable_name_tidy
    training_nonempty = training_nonempty.reindex(columns=variable_names_grouped_list)
    return training_nonempty

training_nonempty = rawTraining.pipe(get_non_empty_columns)
test_nonempty = rawTest.pipe(get_non_empty_columns)


#### Missing Values in Columns

There are 19 columns that contain missing values. 

- Some are discretionary fields that the __applicant didn't fill in__. These include loan `Title`, `Description` for loan purpose, `Employer Title` and `Employment Length`. As the fact that applicant opted to omit these might contain meaningful information, we don't discard them. We simply replace them with "not provided".
- Some are due to __inconsistency in data recording__. 
    - For applicants who have no delinquencies in the last two years, `Months Since Last Delinquency` are recorded as 0 for some, and left as blank for most. We fill the blank entries with 0, keeping in mind that 0 means no delinquency record. For `Months Since Last Public Record`, we repeat the same procedure against the number of `Derogatory Public Records`.
    - All loans that have no `Last Payment Date` have been defaulted on. A closer look reveals that no regular payment was ever recieved on these loans. To ensure consitency in computation for days since the last missed payment, we fill these blanks with loan `Issued Date`.
- Some seem to be caused by __inadequecy in data collection__. `Chargeoffs in 12 Months`, `Collections in 12 Months` have values of either 0 or blank. `Revolving Balance Utilization`, `Public Bankcruptcy Records`, `Last Credit Pulled Date` and `Tax Liens` contain both 0 and blank. As we don't know whether the blank values indicate no such incidences occur or no data available, we discard them in this iteration.<span style="color:orange"> A total of 749 data points (about 2% of the training set) are removed from the training set; 1 data point is removed from the test set. </span> <span style="color:red"> We think that in future iteration, such inconsistency can be resolved through better exception handling on when referencing external databases </span> 
- At last, some are empty because the __features don't apply to the entry__. Most loans don't have a settlement plan, but for those that do, how the settlement is formualated might shed insights on how lending club might capture loss. We leave them as is.

In [154]:
# Check what non-empty columns contain missing values
missing_value_by_column = training_nonempty.isnull().sum(axis=0).to_frame()
missing_value_nonzero = missing_value_by_column[(missing_value_by_column != 0).any(axis=1)]
missing_value_nonzero.shape
missing_value_nonzero

(19, 1)

Unnamed: 0,0
Title,12
Description,11188
Employer Title,2220
Employment Length,953
Months Since Last Delinquency,22933
Months Since Last Public Record,33140
Revolving Balance Utilization,49
Last Credit Pulled Date,2
Collections in 12 Months,56
Chargeoffs in 12 Months,56


In [155]:
# Replace missing values in Title, Description, Employer Title and Employment Length with "Not Provided"
def replace_with_empty_string(lendingClub_df):
    variable_list = ['Title', 'Description', 'Employer Title', 'Employment Length']
    for i in variable_list:
        lendingClub_df[i].fillna("", inplace=True)
    return lendingClub_df

# For Months Since Last Delinquency, and Months Since Last Public Record, we fill the banks with 0,
# keeping in mind 0 means no delinquency/public records.
def fill_blank_months_with_zero(lendingClub_df):
    variable_list = ["Months Since Last Delinquency", "Months Since Last Public Record"]
    for i in variable_list:
        lendingClub_df[i].fillna(0, inplace=True)
    return lendingClub_df

# Fill Last Payment Date of loans on which no payments were ever made with the loan issuance date.
def replace_empty_last_pmt_d_with_issuance_d(lendingClub_df):
    no_last_pmt_bool = lendingClub_df["Last Payment Date"].isnull()
    (lendingClub_df["Last Payment Date"])[no_last_pmt_bool] = (lendingClub_df["Issued Date"])[no_last_pmt_bool]
    return lendingClub_df


# Discard the missing entries in Revolving Balance Utilization, Collections in 12 Months, Chargeoffs in 12 Months, 
# Public Bankcruptcy Records, Tax Liens and Last Credit Pulled Date. 749 entries are removed in the process.
def drop_no_data_entries(lendingClub_df):
    variable_list = ["Revolving Balance Utilization", "Collections in 12 Months", "Chargeoffs in 12 Months",
                    "Public Bankcruptcy Records", "Tax Liens", "Last Credit Pulled Date"]
    lendingClub_df.dropna(axis=0, how='any', subset=variable_list, inplace=True)
    return lendingClub_df

def missing_value_handling(lendingClub_df):
    lendingClub_df.pipe(replace_with_empty_string).pipe(fill_blank_months_with_zero).pipe(replace_empty_last_pmt_d_with_issuance_d).pipe(drop_no_data_entries);
    return lendingClub_df

training_nonempty.pipe(missing_value_handling)
test_nonempty.pipe(missing_value_handling)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Application Type,Policy Code,Initial List Status,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Collections in 12 Months,Accounts Now Delinquent,Chargeoffs in 12 Months,Delinquent Amount,Public Bankcruptcy Records,Tax Liens,Outstanding Principle,Outstanding Principle Investor,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Payment Plan,Last Payment Date,Last Payment Amount,Hardship Flag,Disbursement Method,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,Charged Off,2011-11-01,Restaurant Inventory,Borrower added on 11/03/11 > Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements<br/>,small_business,Individual,1,f,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,0,0.0,0,0.0,0.0,0,0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,n,2013-02-01,773.44,N,Cash,N,NaT,,NaT,,,
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,Fully Paid,2011-11-01,familyneeds my help,Borrower added on 11/01/11 > i need this money to help my family in Thailand due to flooding there...thank you<br/>,other,Individual,1,f,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,0,0.0,0,0.0,0.0,0,0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,n,2011-12-01,9616.95,N,Cash,N,NaT,,NaT,,,
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,Charged Off,2011-11-01,Motorcycle Loan,,car,Individual,1,f,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,0,0.0,0,0.0,0.0,0,0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,n,2012-08-01,118.23,N,Cash,N,NaT,,NaT,,,
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,Charged Off,2011-11-01,Debt Consolidation Loan,,debt_consolidation,Individual,1,f,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,0,0.0,0,0.0,0.0,0,0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,n,2012-04-01,342.90,N,Cash,N,NaT,,NaT,,,
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,Charged Off,2011-11-01,Medical,,other,Individual,1,f,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,0,0.0,0,0.0,0.0,0,0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,n,2012-06-01,100.00,N,Cash,N,NaT,,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35566,12000,12000,725.0,36 months,0.0901,381.66,B,B2,Fully Paid,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Individual,1,f,Bank of America Corp.,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,0.0,0,0.0,0,1.0,0.0,0,0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,n,2008-05-01,11202.55,N,Cash,N,NaT,,NaT,,,
35639,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,Fully Paid,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Individual,1,f,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,0.0,0,0.0,0,1.0,0.0,0,0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,n,2010-02-01,4259.11,N,Cash,N,NaT,,NaT,,,
35653,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,Fully Paid,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,Individual,1,f,E.E. Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,0.0,0,0.0,0,1.0,0.0,0,0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,n,2008-08-01,3891.08,N,Cash,N,NaT,,NaT,,,
35664,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,Fully Paid,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,Individual,1,f,,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,0.0,0,0.0,0,1.0,0.0,0,0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,n,2010-05-01,1571.29,N,Cash,N,NaT,,NaT,,,


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Application Type,Policy Code,Initial List Status,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Collections in 12 Months,Accounts Now Delinquent,Chargeoffs in 12 Months,Delinquent Amount,Public Bankcruptcy Records,Tax Liens,Outstanding Principle,Outstanding Principle Investor,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Payment Plan,Last Payment Date,Last Payment Amount,Hardship Flag,Disbursement Method,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term
0,5000,5000,4975.0,36 months,0.1065,162.87,B,B2,Fully Paid,2011-12-01,Computer,Borrower added on 12/22/11 > I need to upgrade my business technologies.<br>,credit_card,Individual,1,f,,10+ years,RENT,24000.0,Verified,860xx,AZ,27.65,0,1985-01-01,1,0.0,0.0,3,0,13648,0.837,9,2018-10-01,0,0,0,0,0,0,0,0,5863.155187,5833.84,5000.00,863.16,0.00,0.0,0.00,n,2015-01-01,171.62,N,Cash,N,NaT,,NaT,,,
1,2500,2500,2500.0,60 months,0.1527,59.83,C,C4,Charged Off,2011-12-01,bike,Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike. I only need this money because the deal im looking at is to good to pass up.<br><br> Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces<br>,car,Individual,1,f,Ryder,< 1 year,RENT,30000.0,Source Verified,309xx,GA,1.00,0,1999-04-01,5,0.0,0.0,3,0,1687,0.094,4,2016-10-01,0,0,0,0,0,0,0,0,1014.530000,1014.53,456.46,435.17,0.00,122.9,1.11,n,2013-04-01,119.66,N,Cash,N,NaT,,NaT,,,
2,2400,2400,2400.0,36 months,0.1596,84.33,C,C5,Fully Paid,2011-12-01,real estate business,,small_business,Individual,1,f,,10+ years,RENT,12252.0,Not Verified,606xx,IL,8.72,0,2001-11-01,2,0.0,0.0,2,0,2956,0.985,10,2017-06-01,0,0,0,0,0,0,0,0,3005.666844,3005.67,2400.00,605.67,0.00,0.0,0.00,n,2014-06-01,649.91,N,Cash,N,NaT,,NaT,,,
3,10000,10000,10000.0,36 months,0.1349,339.31,C,C1,Fully Paid,2011-12-01,personel,"Borrower added on 12/21/11 > to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.<br>",other,Individual,1,f,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,917xx,CA,20.00,0,1996-02-01,1,35.0,0.0,10,0,5598,0.210,37,2016-04-01,0,0,0,0,0,0,0,0,12231.890000,12231.89,10000.00,2214.92,16.97,0.0,0.00,n,2015-01-01,357.48,N,Cash,N,NaT,,NaT,,,
4,3000,3000,3000.0,60 months,0.1269,67.79,B,B5,Fully Paid,2011-12-01,Personal,"Borrower added on 12/21/11 > I plan on combining three large interest bills together and freeing up some extra each month to pay toward other bills. I've always been a good payor but have found myself needing to make adjustments to my budget due to a medical scare. My job is very stable, I love it.<br>",other,Individual,1,f,University Medical Group,1 year,RENT,80000.0,Source Verified,972xx,OR,17.94,0,1996-01-01,0,38.0,0.0,15,0,27783,0.539,38,2018-04-01,0,0,0,0,0,0,0,0,4066.908161,4066.91,3000.00,1066.91,0.00,0.0,0.00,n,2017-01-01,67.30,N,Cash,N,NaT,,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3973,7450,7450,7450.0,36 months,0.1269,249.91,B,B5,Fully Paid,2011-11-01,wedding,,wedding,Individual,1,f,HUB International Insurance Services,1 year,MORTGAGE,95000.0,Source Verified,802xx,CO,6.61,0,1973-10-01,1,40.0,0.0,13,0,0,0.000,40,2012-06-01,0,0,0,0,0,0,0,0,7717.267751,7717.27,7450.00,267.27,0.00,0.0,0.00,n,2012-06-01,741.40,N,Cash,N,NaT,,NaT,,,
3974,22000,22000,22000.0,36 months,0.0751,684.44,A,A3,Fully Paid,2011-11-01,small_business,,small_business,Individual,1,f,Swiss Re Holding,10+ years,MORTGAGE,115000.0,Verified,640xx,MO,0.31,0,1988-06-01,0,0.0,0.0,9,0,286,0.008,23,2014-11-01,0,0,0,0,0,0,0,0,24639.753202,24639.75,22000.00,2639.75,0.00,0.0,0.00,n,2014-11-01,722.26,N,Cash,N,NaT,,NaT,,,
3975,3250,3250,3250.0,36 months,0.1349,110.28,C,C1,Fully Paid,2011-11-01,November,,credit_card,Individual,1,f,Applebees,10+ years,MORTGAGE,55000.0,Source Verified,560xx,MN,10.71,2,2000-10-01,1,10.0,0.0,17,0,5383,0.376,32,2018-01-01,0,0,0,0,0,0,0,0,3986.026361,3986.03,3250.00,721.03,15.00,0.0,0.00,n,2014-11-01,117.70,N,Cash,N,NaT,,NaT,,,
3976,10000,10000,10000.0,36 months,0.1065,325.74,B,B2,Fully Paid,2011-11-01,debt_consolidation,,debt_consolidation,Individual,1,f,Pfizer,1 year,RENT,122500.0,Verified,111xx,NY,2.88,0,2004-03-01,2,0.0,0.0,8,0,9336,0.333,14,2015-03-01,0,0,0,0,0,0,0,0,11600.366416,11600.37,10000.00,1600.37,0.00,0.0,0.00,n,2014-02-01,3145.92,N,Cash,N,NaT,,NaT,,,


In [156]:
# Check the missing values by column after cleaning
missing_value_by_column = training_nonempty.isnull().sum(axis=0).to_frame()
missing_value_nonzero = missing_value_by_column[(missing_value_by_column != 0).any(axis=1)]
missing_value_nonzero

Unnamed: 0,0
Debt Settlement Flag Date,34933
Settlement Status,34933
Settlement Date,34933
Settlement Amount,34933
Settlement Percentage,34933
Settlement Term,34933


### 1.4 Variable Transformation

#### Removing Columns With No Variation

13 columns contain only one value. 

- Some variables contain only one value because the dataset was pre-processed (possibly by our instructor) to simplify the problem. Examples include `Application Type`, `Hardship Flag`, and `Payment Plan`. `Outstanding Principle` is uniformly zero because in our dataset, a loan is either charged-off or fully paid, in either case no more princple payment is expected. 
- Some might result from incomplete record. It's hard to believe that out of more than thirty thousand applicants, no one has an `Account Now Delinquent`. The same goes for `Tax Liens`, `Chargeoffs in 12 Months` and `Collection in 12 Months`. <span style=color:red> We recommend looking into the data collection process to check for completeness of data. </span>

In [157]:
# Check which columns in the dataset contain no variation.
def columns_no_variation(lendingClub_df):
    col_names_list = list(lendingClub_df.columns)
    num_of_features = len(col_names_list)
    col_names_no_variation = []
    for i in range(num_of_features):
        if lendingClub_df.iloc[:, i].unique().size == 1:
            col_names_no_variation.append(col_names_list[i])
    return col_names_no_variation

def drop_columns_no_variation(lendingClub_df):
    lendingClub_df.drop(columns = columns_no_variation(lendingClub_df), inplace=True)
    return lendingClub_df

columns_no_variation(training_nonempty) == columns_no_variation(test_nonempty)
training_nonempty.pipe(drop_columns_no_variation)
test_nonempty.pipe(drop_columns_no_variation)
training_nonempty.shape
test_nonempty.shape

True

Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,Charged Off,2011-11-01,Restaurant Inventory,Borrower added on 11/03/11 > Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements<br/>,small_business,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,Fully Paid,2011-11-01,familyneeds my help,Borrower added on 11/01/11 > i need this money to help my family in Thailand due to flooding there...thank you<br/>,other,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,Charged Off,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,Charged Off,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,Charged Off,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35566,12000,12000,725.0,36 months,0.0901,381.66,B,B2,Fully Paid,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Bank of America Corp.,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,1.0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,2008-05-01,11202.55,N,NaT,,NaT,,,
35639,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,Fully Paid,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,1.0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,2010-02-01,4259.11,N,NaT,,NaT,,,
35653,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,Fully Paid,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,E.E. Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,1.0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,2008-08-01,3891.08,N,NaT,,NaT,,,
35664,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,Fully Paid,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,1.0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,2010-05-01,1571.29,N,NaT,,NaT,,,


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term
0,5000,5000,4975.0,36 months,0.1065,162.87,B,B2,Fully Paid,2011-12-01,Computer,Borrower added on 12/22/11 > I need to upgrade my business technologies.<br>,credit_card,,10+ years,RENT,24000.0,Verified,860xx,AZ,27.65,0,1985-01-01,1,0.0,0.0,3,0,13648,0.837,9,2018-10-01,0,5863.155187,5833.84,5000.00,863.16,0.00,0.0,0.00,2015-01-01,171.62,N,NaT,,NaT,,,
1,2500,2500,2500.0,60 months,0.1527,59.83,C,C4,Charged Off,2011-12-01,bike,Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike. I only need this money because the deal im looking at is to good to pass up.<br><br> Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces<br>,car,Ryder,< 1 year,RENT,30000.0,Source Verified,309xx,GA,1.00,0,1999-04-01,5,0.0,0.0,3,0,1687,0.094,4,2016-10-01,0,1014.530000,1014.53,456.46,435.17,0.00,122.9,1.11,2013-04-01,119.66,N,NaT,,NaT,,,
2,2400,2400,2400.0,36 months,0.1596,84.33,C,C5,Fully Paid,2011-12-01,real estate business,,small_business,,10+ years,RENT,12252.0,Not Verified,606xx,IL,8.72,0,2001-11-01,2,0.0,0.0,2,0,2956,0.985,10,2017-06-01,0,3005.666844,3005.67,2400.00,605.67,0.00,0.0,0.00,2014-06-01,649.91,N,NaT,,NaT,,,
3,10000,10000,10000.0,36 months,0.1349,339.31,C,C1,Fully Paid,2011-12-01,personel,"Borrower added on 12/21/11 > to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.<br>",other,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,917xx,CA,20.00,0,1996-02-01,1,35.0,0.0,10,0,5598,0.210,37,2016-04-01,0,12231.890000,12231.89,10000.00,2214.92,16.97,0.0,0.00,2015-01-01,357.48,N,NaT,,NaT,,,
4,3000,3000,3000.0,60 months,0.1269,67.79,B,B5,Fully Paid,2011-12-01,Personal,"Borrower added on 12/21/11 > I plan on combining three large interest bills together and freeing up some extra each month to pay toward other bills. I've always been a good payor but have found myself needing to make adjustments to my budget due to a medical scare. My job is very stable, I love it.<br>",other,University Medical Group,1 year,RENT,80000.0,Source Verified,972xx,OR,17.94,0,1996-01-01,0,38.0,0.0,15,0,27783,0.539,38,2018-04-01,0,4066.908161,4066.91,3000.00,1066.91,0.00,0.0,0.00,2017-01-01,67.30,N,NaT,,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3973,7450,7450,7450.0,36 months,0.1269,249.91,B,B5,Fully Paid,2011-11-01,wedding,,wedding,HUB International Insurance Services,1 year,MORTGAGE,95000.0,Source Verified,802xx,CO,6.61,0,1973-10-01,1,40.0,0.0,13,0,0,0.000,40,2012-06-01,0,7717.267751,7717.27,7450.00,267.27,0.00,0.0,0.00,2012-06-01,741.40,N,NaT,,NaT,,,
3974,22000,22000,22000.0,36 months,0.0751,684.44,A,A3,Fully Paid,2011-11-01,small_business,,small_business,Swiss Re Holding,10+ years,MORTGAGE,115000.0,Verified,640xx,MO,0.31,0,1988-06-01,0,0.0,0.0,9,0,286,0.008,23,2014-11-01,0,24639.753202,24639.75,22000.00,2639.75,0.00,0.0,0.00,2014-11-01,722.26,N,NaT,,NaT,,,
3975,3250,3250,3250.0,36 months,0.1349,110.28,C,C1,Fully Paid,2011-11-01,November,,credit_card,Applebees,10+ years,MORTGAGE,55000.0,Source Verified,560xx,MN,10.71,2,2000-10-01,1,10.0,0.0,17,0,5383,0.376,32,2018-01-01,0,3986.026361,3986.03,3250.00,721.03,15.00,0.0,0.00,2014-11-01,117.70,N,NaT,,NaT,,,
3976,10000,10000,10000.0,36 months,0.1065,325.74,B,B2,Fully Paid,2011-11-01,debt_consolidation,,debt_consolidation,Pfizer,1 year,RENT,122500.0,Verified,111xx,NY,2.88,0,2004-03-01,2,0.0,0.0,8,0,9336,0.333,14,2015-03-01,0,11600.366416,11600.37,10000.00,1600.37,0.00,0.0,0.00,2014-02-01,3145.92,N,NaT,,NaT,,,


(35059, 49)

(3977, 49)

#### Formatting the Variables

We did the following basic manipulations:

- `Loan Status` was changed to indicators.
- `Description`: removed strings that are not meaningful.
- `Employment Length`: we checked sorting works on this column.
- `Earliest Credit Line`: we added a column that represents the length of credit history at the time of application in months.
- `Public Bankcruptcy Records`: we discovered that no `Derogatory Public Records` entry was less than `Public Bankcruptcy Records`, which very likely implies that Derogatory Records include Bankcruptcy records by definition. 

We also attempted the following more advanced operations:

- Add statewide unemployment rate at the time of issuance. 
- Add the US quarterly GDP growth at the time of issuance.
- Extract industry that the applicant works in from `Employer Title`. 

In [158]:
# Use a chargeoff indicator for Loan Status.
def chargeoff_indicator(lendingClub_df):
    defaultDict = {"Charged Off" : 1, "Fully Paid" : 0}
    lendingClub_df["Loan Status"].replace(defaultDict, inplace=True)
    return lendingClub_df

training_nonempty.pipe(chargeoff_indicator)
test_nonempty.pipe(chargeoff_indicator)

Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,1,2011-11-01,Restaurant Inventory,Borrower added on 11/03/11 > Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements<br/>,small_business,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,0,2011-11-01,familyneeds my help,Borrower added on 11/01/11 > i need this money to help my family in Thailand due to flooding there...thank you<br/>,other,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,1,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,1,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,1,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35566,12000,12000,725.0,36 months,0.0901,381.66,B,B2,0,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Bank of America Corp.,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,1.0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,2008-05-01,11202.55,N,NaT,,NaT,,,
35639,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,0,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,1.0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,2010-02-01,4259.11,N,NaT,,NaT,,,
35653,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,0,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,E.E. Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,1.0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,2008-08-01,3891.08,N,NaT,,NaT,,,
35664,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,0,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,1.0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,2010-05-01,1571.29,N,NaT,,NaT,,,


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term
0,5000,5000,4975.0,36 months,0.1065,162.87,B,B2,0,2011-12-01,Computer,Borrower added on 12/22/11 > I need to upgrade my business technologies.<br>,credit_card,,10+ years,RENT,24000.0,Verified,860xx,AZ,27.65,0,1985-01-01,1,0.0,0.0,3,0,13648,0.837,9,2018-10-01,0,5863.155187,5833.84,5000.00,863.16,0.00,0.0,0.00,2015-01-01,171.62,N,NaT,,NaT,,,
1,2500,2500,2500.0,60 months,0.1527,59.83,C,C4,1,2011-12-01,bike,Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike. I only need this money because the deal im looking at is to good to pass up.<br><br> Borrower added on 12/22/11 > I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces<br>,car,Ryder,< 1 year,RENT,30000.0,Source Verified,309xx,GA,1.00,0,1999-04-01,5,0.0,0.0,3,0,1687,0.094,4,2016-10-01,0,1014.530000,1014.53,456.46,435.17,0.00,122.9,1.11,2013-04-01,119.66,N,NaT,,NaT,,,
2,2400,2400,2400.0,36 months,0.1596,84.33,C,C5,0,2011-12-01,real estate business,,small_business,,10+ years,RENT,12252.0,Not Verified,606xx,IL,8.72,0,2001-11-01,2,0.0,0.0,2,0,2956,0.985,10,2017-06-01,0,3005.666844,3005.67,2400.00,605.67,0.00,0.0,0.00,2014-06-01,649.91,N,NaT,,NaT,,,
3,10000,10000,10000.0,36 months,0.1349,339.31,C,C1,0,2011-12-01,personel,"Borrower added on 12/21/11 > to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.<br>",other,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,917xx,CA,20.00,0,1996-02-01,1,35.0,0.0,10,0,5598,0.210,37,2016-04-01,0,12231.890000,12231.89,10000.00,2214.92,16.97,0.0,0.00,2015-01-01,357.48,N,NaT,,NaT,,,
4,3000,3000,3000.0,60 months,0.1269,67.79,B,B5,0,2011-12-01,Personal,"Borrower added on 12/21/11 > I plan on combining three large interest bills together and freeing up some extra each month to pay toward other bills. I've always been a good payor but have found myself needing to make adjustments to my budget due to a medical scare. My job is very stable, I love it.<br>",other,University Medical Group,1 year,RENT,80000.0,Source Verified,972xx,OR,17.94,0,1996-01-01,0,38.0,0.0,15,0,27783,0.539,38,2018-04-01,0,4066.908161,4066.91,3000.00,1066.91,0.00,0.0,0.00,2017-01-01,67.30,N,NaT,,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3973,7450,7450,7450.0,36 months,0.1269,249.91,B,B5,0,2011-11-01,wedding,,wedding,HUB International Insurance Services,1 year,MORTGAGE,95000.0,Source Verified,802xx,CO,6.61,0,1973-10-01,1,40.0,0.0,13,0,0,0.000,40,2012-06-01,0,7717.267751,7717.27,7450.00,267.27,0.00,0.0,0.00,2012-06-01,741.40,N,NaT,,NaT,,,
3974,22000,22000,22000.0,36 months,0.0751,684.44,A,A3,0,2011-11-01,small_business,,small_business,Swiss Re Holding,10+ years,MORTGAGE,115000.0,Verified,640xx,MO,0.31,0,1988-06-01,0,0.0,0.0,9,0,286,0.008,23,2014-11-01,0,24639.753202,24639.75,22000.00,2639.75,0.00,0.0,0.00,2014-11-01,722.26,N,NaT,,NaT,,,
3975,3250,3250,3250.0,36 months,0.1349,110.28,C,C1,0,2011-11-01,November,,credit_card,Applebees,10+ years,MORTGAGE,55000.0,Source Verified,560xx,MN,10.71,2,2000-10-01,1,10.0,0.0,17,0,5383,0.376,32,2018-01-01,0,3986.026361,3986.03,3250.00,721.03,15.00,0.0,0.00,2014-11-01,117.70,N,NaT,,NaT,,,
3976,10000,10000,10000.0,36 months,0.1065,325.74,B,B2,0,2011-11-01,debt_consolidation,,debt_consolidation,Pfizer,1 year,RENT,122500.0,Verified,111xx,NY,2.88,0,2004-03-01,2,0.0,0.0,8,0,9336,0.333,14,2015-03-01,0,11600.366416,11600.37,10000.00,1600.37,0.00,0.0,0.00,2014-02-01,3145.92,N,NaT,,NaT,,,


In [159]:
# Remove the formatting strings from description.
def format_description(lendingClub_df):
    lendingClub_df["Description"] = lendingClub_df["Description"].str.replace(r"Borrower.* > ", "").str.replace(r"<.*>", "").str.strip().astype(str)
    return lendingClub_df

training_nonempty.pipe(format_description)
test_nonempty.pipe(format_description)

# Check if sorting works on Employment Length
training_nonempty["Employment Length"].sort_values(ascending=False)

# Add a column that represent the length of credit history at the time of issuance in months.
def add_credit_history_in_months(lendingClub_df):
    credit_history_in_months = np.round((lendingClub_df["Issued Date"] - lendingClub_df["Earliest Credit Line"])/np.timedelta64(1,'M'))
    lendingClub_df["Credit History Length in Months"] = credit_history_in_months
    return lendingClub_df

training_nonempty.pipe(add_credit_history_in_months)
test_nonempty.pipe(add_credit_history_in_months)

# Check if Derogatory Public Records contain the same information as Public Bankcruptcy Records
(training_nonempty["Derogatory Public Records"] != training_nonempty["Public Bankcruptcy Records"]).sum()
# Check if Derogatory Public Records also contain bankcruptcy.
(training_nonempty["Derogatory Public Records"] < training_nonempty["Public Bankcruptcy Records"]).sum()

Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,1,2011-11-01,Restaurant Inventory,Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements,small_business,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,0,2011-11-01,familyneeds my help,i need this money to help my family in Thailand due to flooding there...thank you,other,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,1,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,1,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,1,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35566,12000,12000,725.0,36 months,0.0901,381.66,B,B2,0,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Bank of America Corp.,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,1.0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,2008-05-01,11202.55,N,NaT,,NaT,,,
35639,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,0,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,1.0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,2010-02-01,4259.11,N,NaT,,NaT,,,
35653,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,0,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,E.E. Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,1.0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,2008-08-01,3891.08,N,NaT,,NaT,,,
35664,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,0,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,1.0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,2010-05-01,1571.29,N,NaT,,NaT,,,


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term
0,5000,5000,4975.0,36 months,0.1065,162.87,B,B2,0,2011-12-01,Computer,I need to upgrade my business technologies.,credit_card,,10+ years,RENT,24000.0,Verified,860xx,AZ,27.65,0,1985-01-01,1,0.0,0.0,3,0,13648,0.837,9,2018-10-01,0,5863.155187,5833.84,5000.00,863.16,0.00,0.0,0.00,2015-01-01,171.62,N,NaT,,NaT,,,
1,2500,2500,2500.0,60 months,0.1527,59.83,C,C4,1,2011-12-01,bike,I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces,car,Ryder,< 1 year,RENT,30000.0,Source Verified,309xx,GA,1.00,0,1999-04-01,5,0.0,0.0,3,0,1687,0.094,4,2016-10-01,0,1014.530000,1014.53,456.46,435.17,0.00,122.9,1.11,2013-04-01,119.66,N,NaT,,NaT,,,
2,2400,2400,2400.0,36 months,0.1596,84.33,C,C5,0,2011-12-01,real estate business,,small_business,,10+ years,RENT,12252.0,Not Verified,606xx,IL,8.72,0,2001-11-01,2,0.0,0.0,2,0,2956,0.985,10,2017-06-01,0,3005.666844,3005.67,2400.00,605.67,0.00,0.0,0.00,2014-06-01,649.91,N,NaT,,NaT,,,
3,10000,10000,10000.0,36 months,0.1349,339.31,C,C1,0,2011-12-01,personel,"to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.",other,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,917xx,CA,20.00,0,1996-02-01,1,35.0,0.0,10,0,5598,0.210,37,2016-04-01,0,12231.890000,12231.89,10000.00,2214.92,16.97,0.0,0.00,2015-01-01,357.48,N,NaT,,NaT,,,
4,3000,3000,3000.0,60 months,0.1269,67.79,B,B5,0,2011-12-01,Personal,"I plan on combining three large interest bills together and freeing up some extra each month to pay toward other bills. I've always been a good payor but have found myself needing to make adjustments to my budget due to a medical scare. My job is very stable, I love it.",other,University Medical Group,1 year,RENT,80000.0,Source Verified,972xx,OR,17.94,0,1996-01-01,0,38.0,0.0,15,0,27783,0.539,38,2018-04-01,0,4066.908161,4066.91,3000.00,1066.91,0.00,0.0,0.00,2017-01-01,67.30,N,NaT,,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3973,7450,7450,7450.0,36 months,0.1269,249.91,B,B5,0,2011-11-01,wedding,,wedding,HUB International Insurance Services,1 year,MORTGAGE,95000.0,Source Verified,802xx,CO,6.61,0,1973-10-01,1,40.0,0.0,13,0,0,0.000,40,2012-06-01,0,7717.267751,7717.27,7450.00,267.27,0.00,0.0,0.00,2012-06-01,741.40,N,NaT,,NaT,,,
3974,22000,22000,22000.0,36 months,0.0751,684.44,A,A3,0,2011-11-01,small_business,,small_business,Swiss Re Holding,10+ years,MORTGAGE,115000.0,Verified,640xx,MO,0.31,0,1988-06-01,0,0.0,0.0,9,0,286,0.008,23,2014-11-01,0,24639.753202,24639.75,22000.00,2639.75,0.00,0.0,0.00,2014-11-01,722.26,N,NaT,,NaT,,,
3975,3250,3250,3250.0,36 months,0.1349,110.28,C,C1,0,2011-11-01,November,,credit_card,Applebees,10+ years,MORTGAGE,55000.0,Source Verified,560xx,MN,10.71,2,2000-10-01,1,10.0,0.0,17,0,5383,0.376,32,2018-01-01,0,3986.026361,3986.03,3250.00,721.03,15.00,0.0,0.00,2014-11-01,117.70,N,NaT,,NaT,,,
3976,10000,10000,10000.0,36 months,0.1065,325.74,B,B2,0,2011-11-01,debt_consolidation,,debt_consolidation,Pfizer,1 year,RENT,122500.0,Verified,111xx,NY,2.88,0,2004-03-01,2,0.0,0.0,8,0,9336,0.333,14,2015-03-01,0,11600.366416,11600.37,10000.00,1600.37,0.00,0.0,0.00,2014-02-01,3145.92,N,NaT,,NaT,,,


17537    < 1 year
17751    < 1 year
17758    < 1 year
17757    < 1 year
17756    < 1 year
           ...   
25875            
25868            
19351            
19354            
18598            
Name: Employment Length, Length: 35059, dtype: object

Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,1,2011-11-01,Restaurant Inventory,Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements,small_business,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,,490.0
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,0,2011-11-01,familyneeds my help,i need this money to help my family in Thailand due to flooding there...thank you,other,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,,126.0
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,1,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,,113.0
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,1,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,,91.0
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,1,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,,105.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35566,12000,12000,725.0,36 months,0.0901,381.66,B,B2,0,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Bank of America Corp.,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,1.0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,2008-05-01,11202.55,N,NaT,,NaT,,,,338.0
35639,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,0,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,1.0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,2010-02-01,4259.11,N,NaT,,NaT,,,,84.0
35653,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,0,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,E.E. Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,1.0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,2008-08-01,3891.08,N,NaT,,NaT,,,,92.0
35664,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,0,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,1.0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,2010-05-01,1571.29,N,NaT,,NaT,,,,225.0


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months
0,5000,5000,4975.0,36 months,0.1065,162.87,B,B2,0,2011-12-01,Computer,I need to upgrade my business technologies.,credit_card,,10+ years,RENT,24000.0,Verified,860xx,AZ,27.65,0,1985-01-01,1,0.0,0.0,3,0,13648,0.837,9,2018-10-01,0,5863.155187,5833.84,5000.00,863.16,0.00,0.0,0.00,2015-01-01,171.62,N,NaT,,NaT,,,,323.0
1,2500,2500,2500.0,60 months,0.1527,59.83,C,C4,1,2011-12-01,bike,I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces,car,Ryder,< 1 year,RENT,30000.0,Source Verified,309xx,GA,1.00,0,1999-04-01,5,0.0,0.0,3,0,1687,0.094,4,2016-10-01,0,1014.530000,1014.53,456.46,435.17,0.00,122.9,1.11,2013-04-01,119.66,N,NaT,,NaT,,,,152.0
2,2400,2400,2400.0,36 months,0.1596,84.33,C,C5,0,2011-12-01,real estate business,,small_business,,10+ years,RENT,12252.0,Not Verified,606xx,IL,8.72,0,2001-11-01,2,0.0,0.0,2,0,2956,0.985,10,2017-06-01,0,3005.666844,3005.67,2400.00,605.67,0.00,0.0,0.00,2014-06-01,649.91,N,NaT,,NaT,,,,121.0
3,10000,10000,10000.0,36 months,0.1349,339.31,C,C1,0,2011-12-01,personel,"to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.",other,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,917xx,CA,20.00,0,1996-02-01,1,35.0,0.0,10,0,5598,0.210,37,2016-04-01,0,12231.890000,12231.89,10000.00,2214.92,16.97,0.0,0.00,2015-01-01,357.48,N,NaT,,NaT,,,,190.0
4,3000,3000,3000.0,60 months,0.1269,67.79,B,B5,0,2011-12-01,Personal,"I plan on combining three large interest bills together and freeing up some extra each month to pay toward other bills. I've always been a good payor but have found myself needing to make adjustments to my budget due to a medical scare. My job is very stable, I love it.",other,University Medical Group,1 year,RENT,80000.0,Source Verified,972xx,OR,17.94,0,1996-01-01,0,38.0,0.0,15,0,27783,0.539,38,2018-04-01,0,4066.908161,4066.91,3000.00,1066.91,0.00,0.0,0.00,2017-01-01,67.30,N,NaT,,NaT,,,,191.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3973,7450,7450,7450.0,36 months,0.1269,249.91,B,B5,0,2011-11-01,wedding,,wedding,HUB International Insurance Services,1 year,MORTGAGE,95000.0,Source Verified,802xx,CO,6.61,0,1973-10-01,1,40.0,0.0,13,0,0,0.000,40,2012-06-01,0,7717.267751,7717.27,7450.00,267.27,0.00,0.0,0.00,2012-06-01,741.40,N,NaT,,NaT,,,,457.0
3974,22000,22000,22000.0,36 months,0.0751,684.44,A,A3,0,2011-11-01,small_business,,small_business,Swiss Re Holding,10+ years,MORTGAGE,115000.0,Verified,640xx,MO,0.31,0,1988-06-01,0,0.0,0.0,9,0,286,0.008,23,2014-11-01,0,24639.753202,24639.75,22000.00,2639.75,0.00,0.0,0.00,2014-11-01,722.26,N,NaT,,NaT,,,,281.0
3975,3250,3250,3250.0,36 months,0.1349,110.28,C,C1,0,2011-11-01,November,,credit_card,Applebees,10+ years,MORTGAGE,55000.0,Source Verified,560xx,MN,10.71,2,2000-10-01,1,10.0,0.0,17,0,5383,0.376,32,2018-01-01,0,3986.026361,3986.03,3250.00,721.03,15.00,0.0,0.00,2014-11-01,117.70,N,NaT,,NaT,,,,133.0
3976,10000,10000,10000.0,36 months,0.1065,325.74,B,B2,0,2011-11-01,debt_consolidation,,debt_consolidation,Pfizer,1 year,RENT,122500.0,Verified,111xx,NY,2.88,0,2004-03-01,2,0.0,0.0,8,0,9336,0.333,14,2015-03-01,0,11600.366416,11600.37,10000.00,1600.37,0.00,0.0,0.00,2014-02-01,3145.92,N,NaT,,NaT,,,,92.0


397

0

__Adding Unemployment Data__

We consulted the Bereau of Labor Statistics website and downloaded the montly unemployment data for each state. We then added to each loan application the statewide unemployment rate at the time of issuance.

In [160]:
%%time
## Add state-specfic unemployment rate during the issuance month
unemployment = pd.read_csv("../data/unemployment.csv")
unemployment.reset_index(drop=True, inplace=True)
unemployment["date"] = pd.date_range(start='2006-01-01', end='2015-12-01', freq="MS")
unemployment.set_index("date", inplace=True)


def get_unemployment(row):
    issued_date = row['Issued Date']
    state = row['State']
    return unemployment.loc[issued_date, state]

def add_unemployment(lendingClub_df):
    unemployment = []
    for i in range(lendingClub_df.shape[0]):
        unemployment.append(get_unemployment(lendingClub_df.iloc[i, :]))
    unemployment = np.array(unemployment)
    lendingClub_df["Statewide Unemployment at Issuance"] = unemployment
    return lendingClub_df

test_nonempty.pipe(add_unemployment)
training_nonempty.pipe(add_unemployment)

Wall time: 13.5 s


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months,Statewide Unemployment at Issuance
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,1,2011-11-01,Restaurant Inventory,Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements,small_business,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,,490.0,11.3
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,0,2011-11-01,familyneeds my help,i need this money to help my family in Thailand due to flooding there...thank you,other,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,,126.0,9.3
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,1,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,,113.0,8.5
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,1,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,,91.0,9.2
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,1,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,,105.0,9.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35566,12000,12000,725.0,36 months,0.0901,381.66,B,B2,0,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Bank of America Corp.,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,1.0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,2008-05-01,11202.55,N,NaT,,NaT,,,,338.0,5.4
35639,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,0,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,1.0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,2010-02-01,4259.11,N,NaT,,NaT,,,,84.0,3.3
35653,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,0,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,E.E. Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,1.0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,2008-08-01,3891.08,N,NaT,,NaT,,,,92.0,3.3
35664,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,0,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,1.0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,2010-05-01,1571.29,N,NaT,,NaT,,,,225.0,4.1


__Adding Quarterly GDP Growth__

We obtained our data from St.Lious Fed https://fred.stlouisfed.org/series/A191RL1Q225SBEA.

In [161]:
gdp = pd.read_csv("../data/us_quarterly_gdp_growth.csv")
dates = pd.to_datetime(gdp["DATE"])
gdp["Year"] = dates.dt.year
gdp["Quarter"] = dates.dt.quarter
gdp.drop(columns = ["DATE"], inplace=True)
gdp.columns = ["GDP Quarterly Growth at Issuance", "Year", "Quarter"]

def add_gdp(lendingClub_df):
    lendingClub_df["Year"] = lendingClub_df["Issued Date"].dt.year
    lendingClub_df["Quarter"] = lendingClub_df["Issued Date"].dt.quarter
    lendingClub_df = lendingClub_df.merge(gdp, on=["Year", "Quarter"], how='left')
    lendingClub_df.drop(columns=["Year", "Quarter"], inplace=True)
    return lendingClub_df

training_nonempty.pipe(add_gdp)
test_nonempty.pipe(add_gdp)

Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months,Statewide Unemployment at Issuance,GDP Quarterly Growth at Issuance
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,1,2011-11-01,Restaurant Inventory,Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements,small_business,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,,490.0,11.3,4.7
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,0,2011-11-01,familyneeds my help,i need this money to help my family in Thailand due to flooding there...thank you,other,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,,126.0,9.3,4.7
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,1,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,,113.0,8.5,4.7
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,1,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,,91.0,9.2,4.7
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,1,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,,105.0,9.3,4.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35054,12000,12000,725.0,36 months,0.0901,381.66,B,B2,0,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Bank of America Corp.,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,1.0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,2008-05-01,11202.55,N,NaT,,NaT,,,,338.0,5.4,2.5
35055,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,0,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,1.0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,2010-02-01,4259.11,N,NaT,,NaT,,,,84.0,3.3,2.5
35056,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,0,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,E.E. Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,1.0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,2008-08-01,3891.08,N,NaT,,NaT,,,,92.0,3.3,2.5
35057,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,0,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,1.0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,2010-05-01,1571.29,N,NaT,,NaT,,,,225.0,4.1,2.5


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months,Statewide Unemployment at Issuance,GDP Quarterly Growth at Issuance
0,5000,5000,4975.0,36 months,0.1065,162.87,B,B2,0,2011-12-01,Computer,I need to upgrade my business technologies.,credit_card,,10+ years,RENT,24000.0,Verified,860xx,AZ,27.65,0,1985-01-01,1,0.0,0.0,3,0,13648,0.837,9,2018-10-01,0,5863.155187,5833.84,5000.00,863.16,0.00,0.0,0.00,2015-01-01,171.62,N,NaT,,NaT,,,,323.0,8.8,4.7
1,2500,2500,2500.0,60 months,0.1527,59.83,C,C4,1,2011-12-01,bike,I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces,car,Ryder,< 1 year,RENT,30000.0,Source Verified,309xx,GA,1.00,0,1999-04-01,5,0.0,0.0,3,0,1687,0.094,4,2016-10-01,0,1014.530000,1014.53,456.46,435.17,0.00,122.9,1.11,2013-04-01,119.66,N,NaT,,NaT,,,,152.0,9.8,4.7
2,2400,2400,2400.0,36 months,0.1596,84.33,C,C5,0,2011-12-01,real estate business,,small_business,,10+ years,RENT,12252.0,Not Verified,606xx,IL,8.72,0,2001-11-01,2,0.0,0.0,2,0,2956,0.985,10,2017-06-01,0,3005.666844,3005.67,2400.00,605.67,0.00,0.0,0.00,2014-06-01,649.91,N,NaT,,NaT,,,,121.0,9.4,4.7
3,10000,10000,10000.0,36 months,0.1349,339.31,C,C1,0,2011-12-01,personel,"to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.",other,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,917xx,CA,20.00,0,1996-02-01,1,35.0,0.0,10,0,5598,0.210,37,2016-04-01,0,12231.890000,12231.89,10000.00,2214.92,16.97,0.0,0.00,2015-01-01,357.48,N,NaT,,NaT,,,,190.0,11.2,4.7
4,3000,3000,3000.0,60 months,0.1269,67.79,B,B5,0,2011-12-01,Personal,"I plan on combining three large interest bills together and freeing up some extra each month to pay toward other bills. I've always been a good payor but have found myself needing to make adjustments to my budget due to a medical scare. My job is very stable, I love it.",other,University Medical Group,1 year,RENT,80000.0,Source Verified,972xx,OR,17.94,0,1996-01-01,0,38.0,0.0,15,0,27783,0.539,38,2018-04-01,0,4066.908161,4066.91,3000.00,1066.91,0.00,0.0,0.00,2017-01-01,67.30,N,NaT,,NaT,,,,191.0,9.2,4.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3972,7450,7450,7450.0,36 months,0.1269,249.91,B,B5,0,2011-11-01,wedding,,wedding,HUB International Insurance Services,1 year,MORTGAGE,95000.0,Source Verified,802xx,CO,6.61,0,1973-10-01,1,40.0,0.0,13,0,0,0.000,40,2012-06-01,0,7717.267751,7717.27,7450.00,267.27,0.00,0.0,0.00,2012-06-01,741.40,N,NaT,,NaT,,,,457.0,8.2,4.7
3973,22000,22000,22000.0,36 months,0.0751,684.44,A,A3,0,2011-11-01,small_business,,small_business,Swiss Re Holding,10+ years,MORTGAGE,115000.0,Verified,640xx,MO,0.31,0,1988-06-01,0,0.0,0.0,9,0,286,0.008,23,2014-11-01,0,24639.753202,24639.75,22000.00,2639.75,0.00,0.0,0.00,2014-11-01,722.26,N,NaT,,NaT,,,,281.0,7.8,4.7
3974,3250,3250,3250.0,36 months,0.1349,110.28,C,C1,0,2011-11-01,November,,credit_card,Applebees,10+ years,MORTGAGE,55000.0,Source Verified,560xx,MN,10.71,2,2000-10-01,1,10.0,0.0,17,0,5383,0.376,32,2018-01-01,0,3986.026361,3986.03,3250.00,721.03,15.00,0.0,0.00,2014-11-01,117.70,N,NaT,,NaT,,,,133.0,6.0,4.7
3975,10000,10000,10000.0,36 months,0.1065,325.74,B,B2,0,2011-11-01,debt_consolidation,,debt_consolidation,Pfizer,1 year,RENT,122500.0,Verified,111xx,NY,2.88,0,2004-03-01,2,0.0,0.0,8,0,9336,0.333,14,2015-03-01,0,11600.366416,11600.37,10000.00,1600.37,0.00,0.0,0.00,2014-02-01,3145.92,N,NaT,,NaT,,,,92.0,8.5,4.7


__Adding Industry Section__

We consulted a NASDAQ database (https://public.opendatasoft.com/explore/dataset/us-companies-names-industries/export/) for mapping from company name to industry, and added each employer's industry based on a `Employer Title` name search. 

The result is less than ideal -- out more than 35000 samples in the training set, only 4800+ entries found their industry. Is this because there aren't that many applicants who work for publicly listed companies? According to a WSJ estimate(https://www.nysscpa.org/news/publications/the-trusted-professional/article/more-americans-work-at-big-firms-than-small-ones-040717), in 2014, small companies (used as a proxy for private companies) employ about 1.5 times as many people as large companies. 

The distortion in our data is clearly larger, which suggests our text processing on `Employer Title` might have been too coarse. For this project, we will move forward with the current result; <span style=color:red> for future iterations, we recommend doing keyword extraction on Employer Title before proceeding to search. </span>

In [162]:
## Adding industry according to company title
company_industry = pd.read_excel("../data/us_company_names_industries.xlsx")
company_industry.columns = ["Name", "Industry"]

In [163]:
# Remove punctuations and NAs from Employer Title
def format_emp_title(lendingClub_df):
    mask = (lendingClub_df["Employer Title"] == "")
    (lendingClub_df["Employer Title"])[mask] = "NOT PROVIDED"
    lendingClub_df["Employer Title"] = lendingClub_df["Employer Title"].str.replace('[{}]'.format(string.punctuation), "")
    lendingClub_df["Employer Title"].fillna("NOT PROVIDED", inplace=True)
    return lendingClub_df

training_nonempty.pipe(format_emp_title)
test_nonempty.pipe(format_emp_title)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months,Statewide Unemployment at Issuance,Year,Quarter
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,1,2011-11-01,Restaurant Inventory,Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements,small_business,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,,490.0,11.3,2011,4
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,0,2011-11-01,familyneeds my help,i need this money to help my family in Thailand due to flooding there...thank you,other,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,,126.0,9.3,2011,4
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,1,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,,113.0,8.5,2011,4
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,1,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,,91.0,9.2,2011,4
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,1,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,,105.0,9.3,2011,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35566,12000,12000,725.0,36 months,0.0901,381.66,B,B2,0,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Bank of America Corp,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,1.0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,2008-05-01,11202.55,N,NaT,,NaT,,,,338.0,5.4,2007,4
35639,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,0,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,1.0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,2010-02-01,4259.11,N,NaT,,NaT,,,,84.0,3.3,2007,4
35653,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,0,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,EE Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,1.0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,2008-08-01,3891.08,N,NaT,,NaT,,,,92.0,3.3,2007,4
35664,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,0,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,NOT PROVIDED,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,1.0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,2010-05-01,1571.29,N,NaT,,NaT,,,,225.0,4.1,2007,4


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months,Statewide Unemployment at Issuance,Year,Quarter
0,5000,5000,4975.0,36 months,0.1065,162.87,B,B2,0,2011-12-01,Computer,I need to upgrade my business technologies.,credit_card,NOT PROVIDED,10+ years,RENT,24000.0,Verified,860xx,AZ,27.65,0,1985-01-01,1,0.0,0.0,3,0,13648,0.837,9,2018-10-01,0,5863.155187,5833.84,5000.00,863.16,0.00,0.0,0.00,2015-01-01,171.62,N,NaT,,NaT,,,,323.0,8.8,2011,4
1,2500,2500,2500.0,60 months,0.1527,59.83,C,C4,1,2011-12-01,bike,I plan to use this money to finance the motorcycle i am looking at. I plan to have it paid off as soon as possible/when i sell my old bike.I only need this money because the deal im looking at is to good to pass up. I have finished college with an associates degree in business and its takingmeplaces,car,Ryder,< 1 year,RENT,30000.0,Source Verified,309xx,GA,1.00,0,1999-04-01,5,0.0,0.0,3,0,1687,0.094,4,2016-10-01,0,1014.530000,1014.53,456.46,435.17,0.00,122.9,1.11,2013-04-01,119.66,N,NaT,,NaT,,,,152.0,9.8,2011,4
2,2400,2400,2400.0,36 months,0.1596,84.33,C,C5,0,2011-12-01,real estate business,,small_business,NOT PROVIDED,10+ years,RENT,12252.0,Not Verified,606xx,IL,8.72,0,2001-11-01,2,0.0,0.0,2,0,2956,0.985,10,2017-06-01,0,3005.666844,3005.67,2400.00,605.67,0.00,0.0,0.00,2014-06-01,649.91,N,NaT,,NaT,,,,121.0,9.4,2011,4
3,10000,10000,10000.0,36 months,0.1349,339.31,C,C1,0,2011-12-01,personel,"to pay for property tax (borrow from friend, need to pay back) & central A/C need to be replace. I'm very sorry to let my loan expired last time.",other,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,917xx,CA,20.00,0,1996-02-01,1,35.0,0.0,10,0,5598,0.210,37,2016-04-01,0,12231.890000,12231.89,10000.00,2214.92,16.97,0.0,0.00,2015-01-01,357.48,N,NaT,,NaT,,,,190.0,11.2,2011,4
4,3000,3000,3000.0,60 months,0.1269,67.79,B,B5,0,2011-12-01,Personal,"I plan on combining three large interest bills together and freeing up some extra each month to pay toward other bills. I've always been a good payor but have found myself needing to make adjustments to my budget due to a medical scare. My job is very stable, I love it.",other,University Medical Group,1 year,RENT,80000.0,Source Verified,972xx,OR,17.94,0,1996-01-01,0,38.0,0.0,15,0,27783,0.539,38,2018-04-01,0,4066.908161,4066.91,3000.00,1066.91,0.00,0.0,0.00,2017-01-01,67.30,N,NaT,,NaT,,,,191.0,9.2,2011,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3973,7450,7450,7450.0,36 months,0.1269,249.91,B,B5,0,2011-11-01,wedding,,wedding,HUB International Insurance Services,1 year,MORTGAGE,95000.0,Source Verified,802xx,CO,6.61,0,1973-10-01,1,40.0,0.0,13,0,0,0.000,40,2012-06-01,0,7717.267751,7717.27,7450.00,267.27,0.00,0.0,0.00,2012-06-01,741.40,N,NaT,,NaT,,,,457.0,8.2,2011,4
3974,22000,22000,22000.0,36 months,0.0751,684.44,A,A3,0,2011-11-01,small_business,,small_business,Swiss Re Holding,10+ years,MORTGAGE,115000.0,Verified,640xx,MO,0.31,0,1988-06-01,0,0.0,0.0,9,0,286,0.008,23,2014-11-01,0,24639.753202,24639.75,22000.00,2639.75,0.00,0.0,0.00,2014-11-01,722.26,N,NaT,,NaT,,,,281.0,7.8,2011,4
3975,3250,3250,3250.0,36 months,0.1349,110.28,C,C1,0,2011-11-01,November,,credit_card,Applebees,10+ years,MORTGAGE,55000.0,Source Verified,560xx,MN,10.71,2,2000-10-01,1,10.0,0.0,17,0,5383,0.376,32,2018-01-01,0,3986.026361,3986.03,3250.00,721.03,15.00,0.0,0.00,2014-11-01,117.70,N,NaT,,NaT,,,,133.0,6.0,2011,4
3976,10000,10000,10000.0,36 months,0.1065,325.74,B,B2,0,2011-11-01,debt_consolidation,,debt_consolidation,Pfizer,1 year,RENT,122500.0,Verified,111xx,NY,2.88,0,2004-03-01,2,0.0,0.0,8,0,9336,0.333,14,2015-03-01,0,11600.366416,11600.37,10000.00,1600.37,0.00,0.0,0.00,2014-02-01,3145.92,N,NaT,,NaT,,,,92.0,8.5,2011,4


In [106]:
%%time
## This chunk will take approximately 13 minutes...
def index_of_first_true(lst):
    for i, v in enumerate(lst):
        if v is True:
            return i
    return -1

def get_industry(row):
    emp_title = row["Employer Title"]
    bools = company_industry["Name"].str.contains(emp_title, case=False)
    first_true = index_of_first_true(bools)
    if first_true == -1:
        return "Not Listed"
    else:
        return company_industry.loc[first_true, "Industry"]

def add_industry(lendingClub_df):
    industry = []
    for i in range(lendingClub_df.shape[0]):
        industry.append(get_industry(lendingClub_df.iloc[i, :]))
    industry = np.array(industry)
    lendingClub_df["Industry"] = industry
    return lendingClub_df

test_nonempty.pipe(add_industry)
training_nonempty.pipe(add_industry)

Wall time: 9min 10s


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months,Statewide Unemployment at Issuance,Industry
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,1,2011-11-01,Restaurant Inventory,Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements,small_business,US Department of Labor,10+ years,MORTGAGE,110000.00,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0.0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,,490.0,11.3,Not Listed
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,0,2011-11-01,familyneeds my help,i need this money to help my family in Thailand due to flooding there...thank you,other,costco wholesales,10+ years,RENT,54000.00,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0.0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,,126.0,9.3,Not Listed
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,1,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.00,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0.0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,,113.0,8.5,Not Listed
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,1,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.00,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0.0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,,91.0,9.2,Not Listed
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,1,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.00,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0.0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,,105.0,9.3,Not Listed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35566,12000,12000,725.0,36 months,0.0901,381.66,B,B2,0,2007-12-01,Debt Consolidation,To paydown credit cards at a more favorable rate.,credit_card,Bank of America Corp,6 years,MORTGAGE,100671.39,Not Verified,604xx,IL,6.64,0,1979-10-01,0,62.0,115.0,16,1,7606,0.186,39,2018-06-01,1.0,12347.219878,745.98,12000.00,347.22,0.0,0.00,0.00,2008-05-01,11202.55,N,NaT,,NaT,,,,338.0,5.4,Finance and Insurance
35639,12375,12375,1000.0,36 months,0.1091,404.62,C,C3,0,2007-12-01,no credit cards for me,"Simply looking to pay off credit cards, consolidating balances, and reduce the debt carrying costs. The banking industry and it's credit card cartel are insidious. I believe history will judge this business harshly, for their monopoly position in the electronic payments space with only the appearance of competition, and for the harm their business model inflicts on a massive scale. I'm clearly a little unhappy with these folks, their policies and my recent customer service interactions with them have driven me here. I'm considering liquidating some assets just to be done with them. The ways they have come up with to gouge their customers are absolutely unbelievable. At this point I plan on being done with them as soon as possible, one way or another, and I'd much rather pay interest to individual investors than these bottom feeders.",debt_consolidation,Fullmoon Software,2 years,RENT,80000.00,Not Verified,201xx,VA,9.23,0,2000-12-01,0,0.0,103.0,4,1,13126,0.965,4,2016-10-01,1.0,14370.922249,1161.29,12375.00,1995.92,0.0,0.00,0.00,2010-02-01,4259.11,N,NaT,,NaT,,,,84.0,3.3,Not Listed
35653,4800,4800,1100.0,36 months,0.1028,155.52,C,C1,0,2007-11-01,Want to pay off high intrest cards,"Need loan to pay off high intrest credit cards, so I can improve my credit score",debt_consolidation,EE Wine Inc,1 year,RENT,35000.00,Not Verified,226xx,VA,7.51,0,2000-03-01,0,52.0,114.0,11,1,5836,0.687,12,2008-08-01,1.0,5134.085288,1176.56,4800.00,334.09,0.0,0.00,0.00,2008-08-01,3891.08,N,NaT,,NaT,,,,92.0,3.3,Not Listed
35664,7000,7000,1000.0,36 months,0.1059,227.82,C,C2,0,2007-11-01,Taking the First Step by Consolidating,"I want to pay off 3 of my credit cards with high interest and do it in a reasonable amount of time, 3 years or sooner. It would take me 10 years or more of minimum payments to accomplish this without consolidating them and paying them off in a shorter period of time. I want to be debt free by the time my son, who is 11 goes to college, so I can help him out if he needs me to with tuition and books. This gives me about 6 years to become completely debt free and this is my first real stabb at getting there. My credit is good, I pay things on time since a bankruptcy I had in 1988, due to a business partnership breaking up. This will be completely off my record in September of 2008, and I am very committed to keeping my good credit and increasing my credit score as much as possible.",debt_consolidation,NOT PROVIDED,3 years,MORTGAGE,63500.00,Not Verified,853xx,AZ,8.50,0,1989-02-01,1,0.0,113.0,9,1,14930,0.790,21,2017-06-01,1.0,8174.021910,1167.72,7000.00,1174.02,0.0,0.00,0.00,2010-05-01,1571.29,N,NaT,,NaT,,,,225.0,4.1,Not Listed


In [107]:
test_nonempty["Industry"].value_counts().to_frame()
training_nonempty["Industry"].value_counts().to_frame()

Unnamed: 0,Industry
Not Listed,3484
Finance and Insurance,125
Manufacturing,105
Retail Trade,74
Information,49
Transportation and Warehousing,23
Accommodation and Food Services,21
"Professional, Scientific, and Technical Services",15
Other Services (except Public Administration),14
Health Care and Social Assistance,13


Unnamed: 0,Industry
Not Listed,30438
Finance and Insurance,1184
Manufacturing,1097
Retail Trade,592
Information,509
"Professional, Scientific, and Technical Services",212
Transportation and Warehousing,183
Other Services (except Public Administration),154
Accommodation and Food Services,131
Utilities,92


In [134]:
# %%time
# training_nonempty.to_excel("../data/training_cleaned.xlsx", index=False)
# test_nonempty.to_excel("../data/test_cleaned.xlsx", index=False)

Wall time: 51.7 s


### 1.5 A Note on Automation

One caveat: the new dataset has to have exactly the same columns where there's no variation. This is because if the dataset contains too few data points, the `Issued Date` for all entries will be the same, and we will have an extra column removed in the `drop_columns_no_variation` step. 

As a rule of thumb for using the current iteration, make sure the new dataset contains at least 2000 entries.

In [169]:
def get_non_empty_columns(lendingClub_df_raw):
    training_nonempty = lendingClub_df_raw[list(non_empty_columns(rawTraining))]
    training_nonempty.columns = variable_name_tidy
    training_nonempty = training_nonempty.reindex(columns=variable_names_grouped_list)
    return training_nonempty

def missing_value_handling(lendingClub_df):
    lendingClub_df.pipe(replace_with_empty_string).pipe(fill_blank_months_with_zero).pipe(replace_empty_last_pmt_d_with_issuance_d).pipe(drop_no_data_entries);
    return lendingClub_df

def variable_transformation(lendingClub_df):
    lendingClub_df.pipe(drop_columns_no_variation).pipe(chargeoff_indicator).pipe(format_description).pipe(add_credit_history_in_months)
    return lendingClub_df

def add_external_data(lendingClub_df):
    lendingClub_df.pipe(add_unemployment).pipe(add_gdp).pipe(format_emp_title).pipe(add_industry)
    return lendingClub_df
    
def clean_data(lendingClub_df_raw):
    lendingClub_df_raw = lendingClub_df_raw.pipe(get_non_empty_columns)
    lendingClub_df_raw.pipe(missing_value_handling).pipe(variable_transformation).pipe(add_external_data)
    return lendingClub_df_raw

In [179]:
new_data = pd.read_excel("../data/LendingClubData_new_training.xlsx")
new_data.pipe(clean_data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Interest Rate,Installment,Grade,Sub Grade,Loan Status,Issued Date,Title,Description,Purpose,Employer Title,Employment Length,Home Ownership,Annual Income,Verification Status,Zip Code,State,Debt-to-income Ratio,Delinquencies in 2 years,Earliest Credit Line,Inquiries in 6 months,Months Since Last Delinquency,Months Since Last Public Record,Open Accounts,Derogatory Public Records,Revolving Balance,Revolving Balance Utilization,Number of Total Accounts,Last Credit Pulled Date,Public Bankcruptcy Records,Total Payment,Total Payment Investor,Total Received Principle,Total Received Interest,Total Received Late Fee,Recoveries,Collection Recovery Fee,Last Payment Date,Last Payment Amount,Debt Settlement Flag,Debt Settlement Flag Date,Settlement Status,Settlement Date,Settlement Amount,Settlement Percentage,Settlement Term,Credit History Length in Months,Statewide Unemployment at Issuance,Year,Quarter
0,35000,35000,34975.0,60 months,0.1171,773.44,B,B3,1,2011-11-01,Restaurant Inventory,Loan proceeds will be used to partially fund asset purchase for a restaurant and to keep cash on had. Asset purchase includes restaurant equipment and leasehold improvements,small_business,US Department of Labor,10+ years,MORTGAGE,110000.0,Verified,945xx,CA,1.06,0,1971-01-01,0,0.0,0.0,10,0,4142,0.064,27,2017-07-01,0,11601.600000,11593.34,6926.82,4652.28,0.0,22.50,0.00,2013-02-01,773.44,N,NaT,,NaT,,,,490.0,11.3,2011,4
1,9500,9500,9500.0,36 months,0.1465,327.70,C,C3,0,2011-11-01,familyneeds my help,i need this money to help my family in Thailand due to flooding there...thank you,other,costco wholesales,10+ years,RENT,54000.0,Verified,334xx,FL,17.69,0,2001-05-01,1,0.0,0.0,6,0,5460,0.853,11,2018-10-01,0,9616.540000,9616.54,9500.00,116.54,0.0,0.00,0.00,2011-12-01,9616.95,N,NaT,,NaT,,,,126.0,9.3,2011,4
2,3800,3800,3800.0,36 months,0.0751,118.23,A,A3,1,2011-11-01,Motorcycle Loan,,car,Five Guys,< 1 year,MORTGAGE,47000.0,Source Verified,132xx,NY,22.52,0,2002-06-01,3,0.0,0.0,10,0,8100,0.393,41,2012-08-01,0,1064.070000,1064.07,869.42,191.95,0.0,2.70,0.00,2012-08-01,118.23,N,NaT,,NaT,,,,113.0,8.5,2011,4
3,12400,12400,12400.0,60 months,0.2206,342.90,F,F4,1,2011-11-01,Debt Consolidation Loan,,debt_consolidation,carmelo policaro construction,9 years,OWN,65004.0,Source Verified,077xx,NJ,6.26,0,2004-04-01,3,78.0,0.0,11,0,8990,0.775,21,2016-10-01,0,2127.630000,2127.63,595.32,1116.48,0.0,415.83,4.43,2012-04-01,342.90,N,NaT,,NaT,,,,91.0,9.2,2011,4
4,4000,4000,4000.0,60 months,0.1727,100.00,D,D3,1,2011-11-01,Medical,,other,Tax Return Center,4 years,RENT,45000.0,Source Verified,331xx,FL,7.37,0,2003-02-01,0,50.0,0.0,10,0,4786,0.825,13,2016-10-01,0,829.140000,829.14,309.36,388.82,0.0,130.96,1.39,2012-06-01,100.00,N,NaT,,NaT,,,,105.0,9.3,2011,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
711,12800,12800,12800.0,60 months,0.1727,319.98,D,D3,0,2011-10-01,pay off old debts,,credit_card,Little Elm I.S.D,2 years,RENT,46200.0,Source Verified,760xx,TX,15.12,0,2002-01-01,0,44.0,0.0,6,0,5523,0.339,15,2014-01-01,0,16875.664594,16875.66,12800.00,4075.66,0.0,0.00,0.00,2014-01-01,8893.15,N,NaT,,NaT,,,,117.0,7.6,2011,4
712,10600,10600,10600.0,36 months,0.1596,372.46,C,C5,0,2011-10-01,Debt Consolidation,I am looking to consolidate some higher interest rate cards into an easier to manage monthly payment. I have a good credit history and a stable career.,debt_consolidation,Johnson County Fire District # 2,7 years,MORTGAGE,47000.0,Source Verified,660xx,KS,16.03,0,2003-08-01,1,42.0,0.0,12,0,14713,0.547,28,2016-04-01,0,12856.986244,12856.99,10600.00,2256.99,0.0,0.00,0.00,2013-08-01,5417.62,N,NaT,,NaT,,,,98.0,6.3,2011,4
713,1500,1500,1500.0,36 months,0.0603,45.66,A,A1,0,2011-10-01,Getting Me Back On My Feet Financially,,debt_consolidation,Kern Health Systems,4 years,RENT,22968.0,Not Verified,933xx,CA,28.34,0,1993-09-01,3,0.0,0.0,10,0,3089,0.094,18,2018-10-01,0,1568.920000,1568.92,1500.00,68.92,0.0,0.00,0.00,2012-10-01,763.19,N,NaT,,NaT,,,,217.0,11.5,2011,4
714,6250,6250,6250.0,36 months,0.0991,201.41,B,B1,0,2011-10-01,Credit Card Refinancing,Feel free to ask for more details about my financial situation and expenses. I have them all here in my &quot;Make it to the End of the Year&quot; plan.,credit_card,,,RENT,16000.0,Source Verified,209xx,MD,11.85,0,2000-04-01,0,0.0,0.0,5,0,7529,0.886,11,2016-12-01,0,7240.750239,7240.75,6250.00,990.75,0.0,0.00,0.00,2014-08-01,798.76,N,NaT,,NaT,,,,138.0,7.2,2011,4
