In [1]:
import pandas as pd

Importing data that was the result of the Data Cleaning Python script. This script focuses on preparing the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.

In [2]:
loans= pd.read_csv("data/filtered_loans_2007.csv")

In [3]:
loans.shape

(39786, 25)

In [4]:
# Checking to see number of missing values for all columns
null_counts= loans.isnull().sum()
print(null_counts)

Unnamed: 0                 0
loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1078
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
debt_settlement_flag       0
dtype: int64


We'll remove all the columns with more than 1% missing data (Except the emp_length columns because that is a relevant column) and for the rest, we will only delete the rows with missing values.

In [5]:
loans= loans.drop(["pub_rec_bankruptcies"], axis=1)
loans= loans.dropna(axis=0)
print(loans.dtypes.value_counts())

object     12
float64    10
int64       2
dtype: int64


In [6]:
# Exploring all the columns of object datatype
object_columns_df= loans.select_dtypes(include=['object'])
object_columns_df.head()

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d,debt_settlement_flag
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Apr-2018,N
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Oct-2016,N
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%,Jun-2017,N
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,21%,Apr-2016,N
4,60 months,12.69%,1 year,RENT,Source Verified,other,Personal,OR,Jan-1996,53.9%,Apr-2018,N


Following could be some the changes to be made for  preparing these  columns for modeling:
- The int_rate and revol_util have percentage signs. And for the latter one, this rate or the amount of credit the borrower is using relative to all available credit. [More info.](http://blog.credit.com/2013/04/what-is-revolving-utilization-65530/)
- The earliest_cr_line and last_credit_pull_d, are dates and require a lot of preparation. However, since these are not very relevant to our model, we can remove them.
- Fro the rest of the columns, we can check if come of them are categorical and can be converted to that type.

In [7]:
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']

for each in cols:
    print(loans[each].value_counts())

RENT        18471
MORTGAGE    17242
OWN          2837
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16468
Verified           12377
Source Verified     9804
Name: verification_status, dtype: int64
10+ years    8897
< 1 year     4576
2 years      4389
3 years      4094
4 years      3435
5 years      3279
1 year       3240
6 years      2227
7 years      1771
8 years      1483
9 years      1258
Name: emp_length, dtype: int64
 36 months    28234
 60 months    10415
Name: term, dtype: int64
CA    6907
NY    3711
FL    2779
TX    2674
NJ    1825
IL    1487
PA    1481
VA    1378
GA    1358
MA    1313
OH    1190
MD    1034
AZ     832
WA     807
CO     769
NC     761
CT     734
MI     688
MO     661
MN     591
NV     482
SC     464
WI     445
OR     436
AL     433
LA     426
KY     323
OK     293
KS     260
UT     253
AR     235
DC     212
RI     197
NM     184
HI     169
WV     168
NH     162
DE     110
AK      79
MT      79
WY      79
SD      62
VT  

In [8]:
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

debt_consolidation    18262
credit_card            5004
other                  3824
home_improvement       2884
major_purchase         2109
small_business         1783
car                    1497
wedding                 934
medical                 668
moving                  557
house                   369
vacation                351
educational             312
renewable_energy         95
Name: purpose, dtype: int64
Debt Consolidation                             2149
Debt Consolidation Loan                        1695
Personal Loan                                   643
Consolidation                                   510
debt consolidation                              489
Credit Card Consolidation                       349
Home Improvement                                347
Debt consolidation                              324
Small Business Loan                             317
Credit Card Loan                                308
Personal                                        297
Consolid

- It seems like the purpose and title columns do contain overlapping information but we'll keep the purpose column since it contains a few discrete values. In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).
- Adding dummy variables for the addr_state column will make the dataset unnecessarily complex, so let's get rid of it.
- We can use mapping fot hte emp_length column. Things to assume here are that people with 10+ years of experience could just have exactly 10 years of it; We are considering <1 and n/a as the same.


In [9]:
remove_cols= ["last_credit_pull_d", "addr_state", "title", "earliest_cr_line"]
loans= loans.drop(remove_cols, axis=1)

def prepare_rates(x):
    xstrip= x.rstrip('%')
    xstrip= float(xstrip)
    return xstrip

loans["int_rate"]= loans["int_rate"].apply(prepare_rates)
loans["revol_util"]= loans["revol_util"].apply(prepare_rates)
    
    
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

loans= loans.replace(mapping_dict)

In [10]:
cols_to_encode= ["home_ownership", "verification_status", "purpose", "term"]

dummy_df= pd.get_dummies(loans[cols_to_encode])
loans= pd.concat([loans, dummy_df], axis=1)
loans= loans.drop(cols_to_encode, axis=1)

In [11]:
loans= loans.drop(["debt_settlement_flag"], axis=1)

In [12]:
loans.to_csv("data/cleaned_loans_2007.csv")