# [4]  EDA Part 1 - Cleaning

---
With all my data now stored in SQL, I can now begin exploratory data analysis. The size of the Lending Club dataset is very large and will therefore require extensive cleaning. I will also need to map the median household income (US Census data) to the zip code of the borrower.

---

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

engine = create_engine('postgresql://postgres:database@localhost:5432/Capstone')

In [2]:
# Read in loan dataset from SQL database.
SQL_STRING = '''

select * from "Lending Club"

'''

df = pd.read_sql(SQL_STRING, con=engine)

In [3]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,1,,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,,,Cash,N,,,,,,
1,2,,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,,,Cash,N,,,,,,
2,3,,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,,,Cash,N,,,,,,
3,4,,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,,,Cash,N,,,,,,
4,5,,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,,,Cash,N,,,,,,


In [4]:
df.shape

(1765427, 145)

## Target - Default / Fully Paid

---
The goal of the project is to predict whether or not someone will default on their loan. Therefore, I need to explore the loan_status column to check the unique classes in this column.

---

In [5]:
df.loan_status.value_counts()

Current                                                843754
Fully Paid                                             698690
Charged Off                                            182199
Late (31-120 days)                                      21742
In Grace Period                                         11812
Late (16-30 days)                                        4423
Does not meet the credit policy. Status:Fully Paid       1988
Does not meet the credit policy. Status:Charged Off       761
Default                                                    57
Name: loan_status, dtype: int64

---
Because I am trying to predict which loans default, I am only interested in the loans that have come to completion and have either been fully paid or defaulted.

The value counts shows that there are nearly 850,000 loans that are still 'Current', i.e. have not fully paid off their loan or defaulted but are still in the process of paying back the loan. There are a number of other categories, such as 'Late' and 'In Grace Period'. I do not know whether these loans will end up defaulting or being fully paid so I will drop the records for these categories.

There are only 57 records that are listed as 'Default'. However, there are more than 180,000 records listed as 'Charged Off'. Lending club defines 'Default' as 120+ days past due. 'Charged Off'  is defined as 150+ days past due with no expectation of the loan being paid. For the purpose of this project, I will group these two categories together as 'Default'.

---

In [6]:
# Slice dataframe to only include loans that are 'Fully Paid', 'Charged Off', or 'Default'.
loan = df[(df.loan_status == 'Fully Paid') | (df.loan_status == 'Charged Off') | (df.loan_status == 'Default')]

In [7]:
loan.loan_status.value_counts()

Fully Paid     698690
Charged Off    182199
Default            57
Name: loan_status, dtype: int64

In [8]:
loan.shape

(880946, 145)

---
I am now left with 880,000 records. I will group 'Charged Off' and 'Default' into a single 'Default' class.

---

In [9]:
# Rename 'Charged Off' records to 'Default'.
loan.loan_status = loan.loan_status.map(lambda x: 'Default' if x=='Charged Off' else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [10]:
loan.loan_status.value_counts()

Fully Paid    698690
Default       182256
Name: loan_status, dtype: int64

---
I now have to classes in my target variable. I will binarise this column whereby 'Default' will become 1 and 'Fully Paid' will be 0.

---

In [11]:
# Rename 'Default' records to 1, all others to 0.
loan.loan_status = loan.loan_status.map(lambda x: 1 if x=='Default' else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [12]:
loan.loan_status.value_counts()

0    698690
1    182256
Name: loan_status, dtype: int64

---
With the target column processed, I can now explore all of the other features.

---

## Dropping NaN Columns

---
Having looked at the data, there appear to be a lot of columns with many NaN/None values. These columns will not have enough (if any) data to act as a predictor so they will need to be dropped. I will start by removing columns that have at least 50% NaNs.

---

In [13]:
# Create dataframe with percentage of nulls per column.
nulls = (loan.isnull().sum()/loan.shape[0]).to_frame()  # Number of nulls over total number of records for each column.
nulls['column'] = loan.columns
nulls.columns = ['null_pcnt', 'column']

In [14]:
nulls.head(15)

Unnamed: 0,null_pcnt,column
id,0.0,id
member_id,1.0,member_id
loan_amnt,0.0,loan_amnt
funded_amnt,0.0,funded_amnt
funded_amnt_inv,0.0,funded_amnt_inv
term,0.0,term
int_rate,0.0,int_rate
installment,0.0,installment
grade,0.0,grade
sub_grade,0.0,sub_grade


In [15]:
# Now create a list of columns with more than 50% nulls
many_nulls = nulls[nulls.null_pcnt >= 0.5]
null_cols = list(many_nulls.column)
print(len(null_cols))

57


---
There are 57 columns that have at least 50% null values.

---

In [16]:
null_cols

['member_id',
 'url',
 'desc',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'next_pymnt_d',
 'mths_since_last_major_derog',
 'annual_inc_joint',
 'dti_joint',
 'verification_status_joint',
 'open_acc_6m',
 'open_act_il',
 'open_il_12m',
 'open_il_24m',
 'mths_since_rcnt_il',
 'total_bal_il',
 'il_util',
 'open_rv_12m',
 'open_rv_24m',
 'max_bal_bc',
 'all_util',
 'inq_fi',
 'total_cu_tl',
 'inq_last_12m',
 'mths_since_recent_bc_dlq',
 'mths_since_recent_revol_delinq',
 'revol_bal_joint',
 'sec_app_earliest_cr_line',
 'sec_app_inq_last_6mths',
 'sec_app_mort_acc',
 'sec_app_open_acc',
 'sec_app_revol_util',
 'sec_app_open_act_il',
 'sec_app_num_rev_accts',
 'sec_app_chargeoff_within_12_mths',
 'sec_app_collections_12_mths_ex_med',
 'sec_app_mths_since_last_major_derog',
 'hardship_type',
 'hardship_reason',
 'hardship_status',
 'deferral_term',
 'hardship_amount',
 'hardship_start_date',
 'hardship_end_date',
 'payment_plan_start_date',
 'hardship_length',
 'hardship_dpd',
 'h

In [17]:
# Now drop the columns with at least 50% nulls.
loan.drop(null_cols, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [18]:
loan.shape

(880946, 88)

## Datetime

---
Several columns relate to a date. I will convert these date columns to datetime objects.

---

In [19]:
print (loan.issue_d.dtype)
print (loan.earliest_cr_line.dtype)
print (loan.last_pymnt_d.dtype)
print (loan.last_credit_pull_d.dtype)

object
object
object
object


In [20]:
# Convert columns from object to datetime.
loan.issue_d = pd.to_datetime(loan.issue_d)
loan.earliest_cr_line = pd.to_datetime(loan.earliest_cr_line)
loan.last_pymnt_d = pd.to_datetime(loan.last_pymnt_d)
loan.last_credit_pull_d = pd.to_datetime(loan.last_credit_pull_d)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [21]:
print (loan.issue_d.dtype)
print (loan.earliest_cr_line.dtype)
print (loan.last_pymnt_d.dtype)
print (loan.last_credit_pull_d.dtype)

datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]


---
The columns where more than 50% of the values are NaN have now been removed. Date columns have also been converted to datetime. I will push this cleaned dataframe into a new SQL table. I will regularly push the dataframe into new SQL tables to save any changes as I continue to clean and preprocess the data.

---

In [22]:
loan.to_sql(name='LC_Cleaning', con=engine, if_exists='replace', index = False)

## Other Columns

---
Now I will explore and clean the other columns.

---

In [23]:
# Read in cleaned table.
SQL_STRING = '''

select * from "LC_Cleaning"

'''

df = pd.read_sql(SQL_STRING, con=engine)

In [24]:
df.head()

Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,disbursement_method,debt_settlement_flag
0,1,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,...,,0.0,0.0,,,,,N,Cash,N
1,2,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,...,,0.0,0.0,,,,,N,Cash,N
2,3,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,...,,0.0,0.0,,,,,N,Cash,N
3,4,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,...,,0.0,0.0,,,,,N,Cash,N
4,5,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,...,,0.0,0.0,,,,,N,Cash,N


In [25]:
df.shape

(880946, 88)

### Dropping Irrelevant Columns

---
There are still more than 80 features in the dataset. I will want to reduce this before modelling but dropping any irrelevant and unnecessary features. The problem I am trying to solve is to predict whether someone defaults on their loan. Therefore, any features related to what happens after a loan has defaulted, such as collection, hardship, and settlement plans, are irrelevant to the question. I will now drop columns that pertain to what happens after a loan has defaulted / been paid off.

---

In [26]:
df.drop(['recoveries', 'collection_recovery_fee', 'collections_12_mths_ex_med', 'hardship_flag',
         'debt_settlement_flag'], axis=1, inplace=True)

In [27]:
df.shape

(880946, 83)

---
There are also columns related to what happens during a loan, such as dates of payments, how much in payments the investor has received, the total amount of interest/principal that the borrower ended up paying. These are not important features for predicting whether someone will actually default. I will drop these columns too.

---

In [28]:
df.drop(['out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
        'total_rec_late_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d'], axis=1, inplace=True)

In [29]:
df.shape

(880946, 73)

---
For my model, I will only be looking at individual loans. I will not include joint loans that have a secondary applicant. I have already dropped most columns related to joint loans and the secondary applicant due to high proportions of NaNs. I will now drop any other columns related to this and also any rows that contain joint loans.



The column application_type indicates whether the loan is an individual application or a joint application. I will slice the dataframe to include only the individual applications and then drop the column.

---

In [30]:
df.application_type.value_counts()

Individual    875697
Joint App       5249
Name: application_type, dtype: int64

---
The vast majority are individual loans.

---

In [31]:
# Slice dataframe to only include individual loans.
df = df[df.application_type == 'Individual']

In [32]:
# Drop application_type column.
df.drop(['application_type'], axis=1, inplace=True)

In [33]:
df.shape

(875697, 72)

In [34]:
df.pymnt_plan.value_counts()

n    875697
Name: pymnt_plan, dtype: int64

---
All values are the same, 'n' (no payment plan), so I will drop the column.

---

In [35]:
df.drop(['pymnt_plan'], axis=1, inplace=True)

### Nulls in Other Columns

---
Having dropped features that are irrelevant to whether a loan defaults or not, I will now check the values of other columns that have some relevance to the question. I will start by seeing which columns still have null values and then decide what to do with them.

---

In [36]:
# Create dataframe to calculate percentage of nulls per column.
nulls = (df.isnull().sum()/df.shape[0]).to_frame()

In [37]:
nulls.columns = ['null_pcnt']

In [38]:
nulls.null_pcnt.sort_values(ascending=False).head(40)

mths_since_recent_inq         0.141687
num_tl_120dpd_2m              0.109628
mo_sin_old_il_acct            0.104376
pct_tl_nvr_dlq                0.077271
avg_cur_bal                   0.077127
mo_sin_rcnt_rev_tl_op         0.077113
mo_sin_old_rev_tl_op          0.077113
num_rev_accts                 0.077113
num_il_tl                     0.077112
mo_sin_rcnt_tl                0.077112
tot_hi_cred_lim               0.077112
tot_cur_bal                   0.077112
num_accts_ever_120_pd         0.077112
num_actv_bc_tl                0.077112
num_actv_rev_tl               0.077112
num_tl_op_past_12m            0.077112
num_bc_tl                     0.077112
total_il_high_credit_limit    0.077112
num_op_rev_tl                 0.077112
num_rev_tl_bal_gt_0           0.077112
tot_coll_amt                  0.077112
total_rev_hi_lim              0.077112
num_tl_30dpd                  0.077112
num_tl_90g_dpd_24m            0.077112
bc_util                       0.064378
percent_bc_gt_75         

---
A number of columns still have null values. mths_since_recent_inq (months since most recent inquiry) contains more than 14% NaN's. I have another column related to inquiries which is inq_last_6mths (inquiries last 6 months). The inquiries which these columns are referring to are credit inquiries. These are hard credit inquiries which show up on someone's credit report when they have applied for a loan or credit (Lending Club does not include mortgage and auto loan inquiries in the data). Therefore, a high number of inquiries in the preceding 6 months could indicate that that person is in high need of the money and, therefore, would likely be a better indicator of defaults than the number of months since last inquiry. Therefore, I will drop mths_since_recent_inq.

---

In [39]:
df.drop('mths_since_recent_inq', axis=1, inplace=True)

---
mo_sin_old_il_acct (months since oldest instalment account opened) has more than 10% NaN's. mo_sin_rcnt_rev_tl_op (months since most recent revolving account opened) and mo_sin_old_rev_tl_op (months since oldest revolving account opened) have more than 7% NaN's. As with inquiries, the total number of accounts, credit limits, utilisation rates, etc. are more likely to have an impact on default rate than the number of months since someone has recently opended an account. The number of months since someone first opened an instalment/revolving account (i.e. for how long they have been using loans / had a line of credit) could have an effect. However, I also have the column earliest_cr_line (earliest credit line). Therefore, I will use earliest_cr_line and drop the other columns related to date of opening an account.

---

In [40]:
df.drop(['mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
         'mths_since_recent_bc',], axis=1, inplace=True)

#### No Pre-2012 Data

---
A recent Lending Club blog (https://blog.lendingclub.com/additional-data-in-issued-loans-csv-download-file/) stated that they have recently included a number of new features in their datasets that were not previously available. However, many of these new features contain data that only goes as far back as 2012. My data goes back to 2007. Therefore, I cannot use features that do not include data for half the years that I have loan data for. I will therefore check the rest of these features that have more than 5% nulls and remove them if they do not have data from before 2012.

---

In [41]:
# Number of Accounts 120 Days Past Due Last 2 Months
SQL_STRING = '''

select issue_d, num_tl_120dpd_2m from "LC_Cleaning"
where num_tl_120dpd_2m is not null
order by issue_d

'''
# Select feature and loan issue date where feature does not contain nulls.
# Sort by issue date to find first date that has data for this feature.

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_tl_120dpd_2m
0,2012-08-01,0.0
1,2012-08-01,0.0
2,2012-08-01,0.0


In [42]:
# Percentage of Accounts Never Delinquent
SQL_STRING = '''

select issue_d, pct_tl_nvr_dlq from "LC_Cleaning"
where pct_tl_nvr_dlq is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,pct_tl_nvr_dlq
0,2012-08-01,100.0
1,2012-08-01,100.0
2,2012-08-01,100.0


In [43]:
# Average Current Balance
SQL_STRING = '''

select issue_d, avg_cur_bal from "LC_Cleaning"
where avg_cur_bal is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,avg_cur_bal
0,2012-08-01,18902.0
1,2012-08-01,3556.0
2,2012-08-01,3684.0


In [44]:
# Number of Revolving Accounts
SQL_STRING = '''

select issue_d, num_rev_accts from "LC_Cleaning"
where num_rev_accts is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_rev_accts
0,2012-08-01,27.0
1,2012-08-01,17.0
2,2012-08-01,11.0


In [45]:
# Number of Instalment Accounts
SQL_STRING = '''

select issue_d, num_il_tl from "LC_Cleaning"
where num_il_tl is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_il_tl
0,2012-08-01,4.0
1,2012-08-01,10.0
2,2012-08-01,11.0


In [46]:
# Total high credit/credit limit
SQL_STRING = '''

select issue_d, tot_hi_cred_lim from "LC_Cleaning"
where tot_hi_cred_lim is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,tot_hi_cred_lim
0,2012-08-01,51892.0
1,2012-08-01,83557.0
2,2012-08-01,58087.0


In [47]:
# Total Current Balance
SQL_STRING = '''

select issue_d, tot_cur_bal from "LC_Cleaning"
where tot_cur_bal is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,tot_cur_bal
0,2012-08-01,147739.0
1,2012-08-01,24351.0
2,2012-08-01,4879.0


In [48]:
# Number of Accounts Ever 120 Days Past Due
SQL_STRING = '''

select issue_d, num_accts_ever_120_pd from "LC_Cleaning"
where num_accts_ever_120_pd is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_accts_ever_120_pd
0,2012-08-01,1.0
1,2012-08-01,0.0
2,2012-08-01,0.0


In [49]:
# Number of Active Bankcard Accounts
SQL_STRING = '''

select issue_d, num_actv_bc_tl from "LC_Cleaning"
where num_actv_bc_tl is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_actv_bc_tl
0,2012-08-01,3.0
1,2012-08-01,2.0
2,2012-08-01,4.0


In [50]:
# Number of Active Revolving Accounts
SQL_STRING = '''

select issue_d, num_actv_rev_tl from "LC_Cleaning"
where num_actv_rev_tl is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_actv_rev_tl
0,2012-08-01,4.0
1,2012-08-01,6.0
2,2012-08-01,7.0


In [51]:
# Number of Accounts Open Past 12 Months
SQL_STRING = '''

select issue_d, num_tl_op_past_12m from "LC_Cleaning"
where num_tl_op_past_12m is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_tl_op_past_12m
0,2012-08-01,0.0
1,2012-08-01,2.0
2,2012-08-01,1.0


In [52]:
# Number of Bankcard Accounts
SQL_STRING = '''

select issue_d, num_bc_tl from "LC_Cleaning"
where num_bc_tl is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_bc_tl
0,2012-08-01,16.0
1,2012-08-01,9.0
2,2012-08-01,8.0


In [53]:
# Total instalment high credit/credit limit
SQL_STRING = '''

select issue_d, total_il_high_credit_limit from "LC_Cleaning"
where total_il_high_credit_limit is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,total_il_high_credit_limit
0,2012-08-01,32675.0
1,2012-08-01,0.0
2,2012-08-01,8192.0


In [54]:
# Number of Open Revolving Accounts
SQL_STRING = '''

select issue_d, num_op_rev_tl from "LC_Cleaning"
where num_op_rev_tl is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_op_rev_tl
0,2012-08-01,7.0
1,2012-08-01,10.0
2,2012-08-01,7.0


In [55]:
# Number of Revolving Accounts with Balance Greater than 0
SQL_STRING = '''

select issue_d, num_rev_tl_bal_gt_0 from "LC_Cleaning"
where num_rev_tl_bal_gt_0 is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_rev_tl_bal_gt_0
0,2012-08-01,4.0
1,2012-08-01,6.0
2,2012-08-01,7.0


In [56]:
# Total Collection Amount
SQL_STRING = '''

select issue_d, tot_coll_amt from "LC_Cleaning"
where tot_coll_amt is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,tot_coll_amt
0,2012-08-01,0.0
1,2012-08-01,0.0
2,2012-08-01,0.0


In [57]:
# Total Revolving High Limit
SQL_STRING = '''

select issue_d, total_rev_hi_lim from "LC_Cleaning"
where total_rev_hi_lim is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,total_rev_hi_lim
0,2012-08-01,5900.0
1,2012-08-01,28000.0
2,2012-08-01,42300.0


In [58]:
# Number of Accounts 30 Days Past Due
SQL_STRING = '''

select issue_d, num_tl_30dpd from "LC_Cleaning"
where num_tl_30dpd is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_tl_30dpd
0,2012-08-01,0.0
1,2012-08-01,0.0
2,2012-08-01,0.0


In [59]:
# Number of Accounts 90+ Days Past Due Last 24 Months
SQL_STRING = '''

select issue_d, num_tl_90g_dpd_24m from "LC_Cleaning"
where num_tl_90g_dpd_24m is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_tl_90g_dpd_24m
0,2012-08-01,0.0
1,2012-08-01,0.0
2,2012-08-01,0.0


In [60]:
# Bankcard Utilisation Rate
SQL_STRING = '''

select issue_d, bc_util from "LC_Cleaning"
where bc_util is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,bc_util
0,2012-03-01,48.2
1,2012-03-01,94.6
2,2012-03-01,89.6


In [61]:
# Percentage Bankcards Greater than 75% Limit
SQL_STRING = '''

select issue_d, percent_bc_gt_75 from "LC_Cleaning"
where percent_bc_gt_75 is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,percent_bc_gt_75
0,2012-03-01,75.0
1,2012-03-01,100.0
2,2012-03-01,54.5


In [62]:
# Number of Satisfactory Bankcards
SQL_STRING = '''

select issue_d, num_bc_sats from "LC_Cleaning"
where num_bc_sats is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_bc_sats
0,2012-06-01,2.0
1,2012-06-01,2.0
2,2012-06-01,7.0


In [63]:
# Number of Satisfactory Accounts
SQL_STRING = '''

select issue_d, num_sats from "LC_Cleaning"
where num_sats is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,num_sats
0,2012-06-01,4.0
1,2012-06-01,5.0
2,2012-06-01,13.0


In [64]:
# Bankcards Open to Buy
SQL_STRING = '''

select issue_d, bc_open_to_buy from "LC_Cleaning"
where bc_open_to_buy is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,bc_open_to_buy
0,2012-03-01,3668.0
1,2012-03-01,1567.0
2,2012-03-01,9459.0


In [65]:
# Employment Title
SQL_STRING = '''

select issue_d, emp_title from "LC_Cleaning"
where emp_title is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,emp_title
0,2007-06-01,Evergreen Center
1,2007-07-01,The Dartmouth Company
2,2007-07-01,FDA


In [66]:
# Accounts Opened Past 24 Months
SQL_STRING = '''

select issue_d, acc_open_past_24mths from "LC_Cleaning"
where acc_open_past_24mths is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,acc_open_past_24mths
0,2012-03-01,7.0
1,2012-03-01,2.0
2,2012-03-01,3.0


In [67]:
# Total bankcard high credit/credit limit
SQL_STRING = '''

select issue_d, total_bc_limit from "LC_Cleaning"
where total_bc_limit is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,total_bc_limit
0,2012-03-01,8200.0
1,2012-03-01,32000.0
2,2012-03-01,34129.0


In [68]:
# Number of Mortgage Accounts
SQL_STRING = '''

select issue_d, mort_acc from "LC_Cleaning"
where mort_acc is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,mort_acc
0,2012-03-01,0.0
1,2012-03-01,0.0
2,2012-03-01,3.0


In [69]:
# Total credit balance excluding mortgage
SQL_STRING = '''

select issue_d, total_bal_ex_mort from "LC_Cleaning"
where total_bal_ex_mort is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,total_bal_ex_mort
0,2012-03-01,27219.0
1,2012-03-01,81607.0
2,2012-03-01,23214.0


In [70]:
# Employment Length
SQL_STRING = '''

select issue_d, emp_length from "LC_Cleaning"
where emp_length is not null
order by issue_d

'''

df2 = pd.read_sql(SQL_STRING, con=engine)
df2.head(3)

Unnamed: 0,issue_d,emp_length
0,2007-06-01,< 1 year
1,2007-07-01,< 1 year
2,2007-07-01,< 1 year


---
Of these columns, only emp_title (employment title) and emp_length (employment length) have data going back to 2007. None of the others have data for the previous 5 years before 2012. Therefore, I will remove the columns without pre-2012 data.

---

In [71]:
df.drop(['num_tl_120dpd_2m', 'pct_tl_nvr_dlq', 'avg_cur_bal', 'num_rev_accts', 'num_il_tl', 'tot_hi_cred_lim',
        'tot_cur_bal', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_tl_op_past_12m',
        'num_bc_tl', 'total_il_high_credit_limit', 'num_op_rev_tl', 'num_rev_tl_bal_gt_0', 'tot_coll_amt',
        'total_rev_hi_lim', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'bc_util', 'percent_bc_gt_75', 'num_bc_sats',
        'num_sats', 'bc_open_to_buy', 'acc_open_past_24mths', 'total_bc_limit', 'mort_acc', 'total_bal_ex_mort'],
        axis=1, inplace=True)

#### Null Rows

In [72]:
nulls = (df.isnull().sum()/df.shape[0]).to_frame()
nulls.columns = ['null_pcnt']
nulls.null_pcnt.sort_values(ascending=False).head(10)

emp_title                   0.059465
emp_length                  0.052186
title                       0.009844
pub_rec_bankruptcies        0.000796
revol_util                  0.000596
chargeoff_within_12_mths    0.000064
tax_liens                   0.000045
inq_last_6mths              0.000001
disbursement_method         0.000000
sub_grade                   0.000000
Name: null_pcnt, dtype: float64

---
There are still some columns that have NaN's, however, apart from employment title and employment length, the percentage of nulls is less than 1%. Because I have so many records, I can now remove any rows that contain null values. I will still have more than 800,000 records left once this is done.

---

In [73]:
df.dropna(axis=0, how='any', inplace=True)
# Drop rows that contain any null values.

In [74]:
df.shape

(814617, 37)

In [75]:
df.to_sql(name='LC_Cleaning2', con=engine, if_exists='replace', index = False)
# Push to SQL to save changes.

### Categorical Columns

---
Now that I have dropped all records containing null values, I will check the values in the categorical columns to see if they need cleaning.

---

In [76]:
SQL_STRING = '''

select * from "LC_Cleaning2"

'''

df = pd.read_sql(SQL_STRING, con=engine)

In [77]:
df.shape

(814617, 37)

In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 814617 entries, 0 to 814616
Data columns (total 37 columns):
id                          814617 non-null int64
loan_amnt                   814617 non-null float64
funded_amnt                 814617 non-null float64
funded_amnt_inv             814617 non-null float64
term                        814617 non-null object
int_rate                    814617 non-null object
installment                 814617 non-null float64
grade                       814617 non-null object
sub_grade                   814617 non-null object
emp_title                   814617 non-null object
emp_length                  814617 non-null object
home_ownership              814617 non-null object
annual_inc                  814617 non-null float64
verification_status         814617 non-null object
issue_d                     814617 non-null datetime64[ns]
loan_status                 814617 non-null int64
purpose                     814617 non-null object
title      

---
I will be looking at the columns that are of type object.

---

In [79]:
df.term.value_counts()

 36 months    614376
 60 months    200241
Name: term, dtype: int64

In [80]:
df.int_rate.value_counts().head()

 10.99%    26101
 12.99%    19553
 11.99%    18491
 15.61%    17063
 13.99%    16253
Name: int_rate, dtype: int64

---
The values in the int_rate (interest rate) column have a percentage sign (%) at the end. I will need to remove these percentage signs and then convert the column to float type.

---

In [81]:
df.int_rate = df.int_rate.str.replace('%', '')    # Remove '%' from records.

In [82]:
df.int_rate.value_counts().head()

 10.99    26101
 12.99    19553
 11.99    18491
 15.61    17063
 13.99    16253
Name: int_rate, dtype: int64

In [83]:
df.int_rate = df.int_rate.map(lambda x: float(x))   # Convert to float.
df.int_rate.dtype

dtype('float64')

In [84]:
df.grade.value_counts()

B    235267
C    226612
A    135900
D    125627
E     62623
F     22627
G      5961
Name: grade, dtype: int64

In [85]:
df.sub_grade.value_counts()

B3    51013
B4    50982
C1    50335
C2    47853
B5    47844
C3    44911
B2    44586
C4    43997
B1    40842
C5    39516
A5    38614
D1    32042
A4    31776
D2    27753
D3    24139
D4    22781
A1    22759
A3    21810
A2    20941
D5    18912
E1    16072
E2    14542
E3    12262
E4    10574
E5     9173
F1     6928
F2     5160
F3     4361
F4     3455
F5     2723
G1     1965
G2     1480
G3     1054
G4      818
G5      644
Name: sub_grade, dtype: int64

---
This page, https://www.lendingclub.com/foliofn/rateDetail.action, shows how grade and sub_grade are used to calculate the interest rate on a loan. Lending Club assigns a grade and sub_grade based on credit risk indicators in the credit report. This sub_grade then determines the interest rate of the loan. Therefore, all of these features would be highly correlated so I will drop grade and sub_grade and keep interest rate.

---

In [86]:
df.drop(['grade', 'sub_grade'], axis=1, inplace=True)

In [87]:
df.emp_title.value_counts()

Teacher                                    12305
Manager                                    11400
Registered Nurse                            5217
RN                                          5103
Supervisor                                  4965
Owner                                       4952
Sales                                       4434
Driver                                      4050
Project Manager                             3911
Office Manager                              3224
General Manager                             3054
Director                                    3026
manager                                     2863
Engineer                                    2639
teacher                                     2545
owner                                       2380
driver                                      2289
President                                   2264
Vice President                              2144
Operations Manager                          2086
Accountant          

---
There are many unique values for employment title so I will group them into new categories at a later stage.

---

In [88]:
df.emp_length.value_counts()

10+ years    279522
2 years       77997
3 years       68691
< 1 year      67892
1 year        56357
5 years       54694
4 years       51514
6 years       42246
7 years       41125
8 years       40938
9 years       33641
Name: emp_length, dtype: int64

In [89]:
df.home_ownership.value_counts()

MORTGAGE    406408
RENT        328556
OWN          79376
OTHER          136
ANY            100
NONE            41
Name: home_ownership, dtype: int64

---
The data dictionary lists the values in home_ownership as being 'RENT', 'OWN', 'MORTGAGE', and 'OTHER'. However, my data has two other categories: 'ANY' and 'NONE'. No description has been provided as to what these two categories mean compared with 'OTHER'. Between 'OTHER', 'ANY', and 'NONE', there are only 277 records with 'OTHER' being the largest category. Therefore, I will group 'ANY' and 'NONE' with 'OTHER' into one category.

---

In [90]:
df.home_ownership = df.home_ownership.map(lambda x: 'OTHER' if x in ['ANY', 'NONE'] else x)
# Rename 'ANY' and 'NONE' to 'OTHER'.

In [91]:
df.home_ownership.value_counts()

MORTGAGE    406408
RENT        328556
OWN          79376
OTHER          277
Name: home_ownership, dtype: int64

In [92]:
df.verification_status.value_counts()

Source Verified    304662
Not Verified       259225
Verified           250730
Name: verification_status, dtype: int64

In [93]:
df.purpose.value_counts()

debt_consolidation    483362
credit_card           175243
home_improvement       50060
other                  43716
major_purchase         17431
small_business          9363
car                     8984
medical                 8572
moving                  5683
vacation                5044
house                   4109
wedding                 2179
renewable_energy         592
educational              279
Name: purpose, dtype: int64

In [94]:
df.title.value_counts().head(40)

Debt consolidation           383478
Credit card refinancing      137612
Home improvement              39851
Other                         34280
Debt Consolidation            14585
Major purchase                12780
Medical expenses               6935
Business                       6679
Car financing                  5994
Consolidation                  4927
Moving and relocation          4362
debt consolidation             4276
Vacation                       4224
Debt Consolidation Loan        3549
Home buying                    2942
Credit Card Consolidation      2175
consolidation                  1938
Personal Loan                  1937
Consolidation Loan             1650
Home Improvement               1579
Credit Card Refinance          1361
Credit Card Payoff             1268
Consolidate                    1146
Personal                       1091
Loan                            963
Freedom                         719
Credit Card Loan                713
consolidate                 

---
The title column is the title/name of the loan that was chosen by the borrower. Most of the loans have names that are related to the purpose of the loan. Because I already have a feature for loan purpose, I can drop the title column.

---

In [95]:
df.drop('title', axis=1, inplace=True)

In [96]:
df.zip_code.value_counts().head(20)

945xx    9538
750xx    9168
112xx    8473
606xx    7650
300xx    7306
331xx    7142
100xx    6803
900xx    6789
070xx    6723
770xx    6369
891xx    6264
917xx    6083
330xx    6060
104xx    5580
117xx    5498
921xx    5437
852xx    5315
926xx    5129
913xx    5022
925xx    4758
Name: zip_code, dtype: int64

---
I have downloaded median household income per zip code from the US Census Bureau. I will be able to join this to my loan data on this zip column. However, I first need to remove the 'xx' at then end of each record.

---

In [97]:
df.zip_code = df.zip_code.str.replace('xx', '')   # Remove 'xx' from records.

In [98]:
df.zip_code.value_counts().head(10)

945    9538
750    9168
112    8473
606    7650
300    7306
331    7142
100    6803
900    6789
070    6723
770    6369
Name: zip_code, dtype: int64

In [99]:
df.addr_state.unique()

array(['GA', 'CA', 'OR', 'AZ', 'NC', 'TX', 'OH', 'FL', 'VA', 'IL', 'MO',
       'CT', 'UT', 'SC', 'NY', 'PA', 'MN', 'NJ', 'KY', 'RI', 'LA', 'MA',
       'WA', 'WI', 'AL', 'CO', 'KS', 'NV', 'AK', 'MD', 'WV', 'VT', 'MI',
       'DC', 'NH', 'NM', 'AR', 'MT', 'HI', 'WY', 'OK', 'SD', 'DE', 'MS',
       'TN', 'IA', 'NE', 'ID', 'IN', 'ME', 'ND'], dtype=object)

In [100]:
len(df.addr_state.unique())

51

---
addr_state contains the 50 US states plus the federal district DC (District of Columbia)

---

In [101]:
df.revol_util.value_counts().head(10)

0%     3871
53%    1592
58%    1587
57%    1584
54%    1573
61%    1558
48%    1547
59%    1546
52%    1541
47%    1514
Name: revol_util, dtype: int64

---
Like int_rate, the values in revol_util (revolving utilisation rate) are in the form of percentages. However, the records contain the percentage sign (%) which will need to be removed so the column can be converted to float.

---

In [102]:
df.revol_util = df.revol_util.str.replace('%', '')   # Remove '%'.

In [103]:
df.revol_util.value_counts().head(10)

0     3871
53    1592
58    1587
57    1584
54    1573
61    1558
48    1547
59    1546
52    1541
47    1514
Name: revol_util, dtype: int64

In [104]:
df.revol_util = df.revol_util.map(lambda x: float(x))     # Convert column to float.
df.revol_util.dtype

dtype('float64')

In [105]:
df.initial_list_status.value_counts()

w    409434
f    405183
Name: initial_list_status, dtype: int64

---
The initial_list_status values are 'w' and 'f'. This refers to whole loans (funded by one investor) and fractional loans (funded by multiple investors). For ease of interpretation, I will change the values from 'w' and 'f' to 'whole' and 'fractional', respectively.

---

In [106]:
df.initial_list_status = df.initial_list_status.str.replace('w', 'whole')
df.initial_list_status = df.initial_list_status.str.replace('f', 'fractional')

In [107]:
df.initial_list_status.value_counts()

whole         409434
fractional    405183
Name: initial_list_status, dtype: int64

In [108]:
df.disbursement_method.value_counts()

Cash         813534
DirectPay      1083
Name: disbursement_method, dtype: int64

In [109]:
df.policy_code.value_counts()

1.0    814617
Name: policy_code, dtype: int64

---
Even though the policy_code column is a float, the feature itself is categorical. Values of 1 indicate publicly available policy codes whereas values of 2 are meant to indicate policy codes not publicy available. However, my data contains only one value: 1 (publicly available policy codes). Therefore, I can drop this column.

---

In [110]:
df.drop('policy_code', axis=1, inplace=True)

### Date to Age

---
The column earliest_cr_line indicates the date that the borrower's earliest reported line of credit (LOC) was opened. In order to include it in a machine learning model, I will change it from a date to an age, i.e. approximate number of days since the earliest LOC was opened. The age will be relative to the issue date as I want the feature to represent the amount of time between the earliest LOC and the issue date of the Lending Club loan.

---

In [111]:
df['time_since_1st_LOC'] = df.issue_d - df.earliest_cr_line
# New column to represent difference between issue data and date of earliest credit line.

In [112]:
df.time_since_1st_LOC.head(10)

0    4627 days
1    5782 days
2    5813 days
3    2586 days
4    2344 days
5    1795 days
6    2647 days
7    5327 days
8   10530 days
9    5082 days
Name: time_since_1st_LOC, dtype: timedelta64[ns]

---
The difference has been calculated in days. As the both issue_d and earliest_cr_line only listed the month and year, the datetime object listed the day of all records as 1st of the month. Therefore, time_since_1st_LOC is an approximate calculation of number of days rather than the exact number of days.

---

In [113]:
df.time_since_1st_LOC = df.time_since_1st_LOC / np.timedelta64(1, 'D')  # Converts n days to n.

In [114]:
df.time_since_1st_LOC = df.time_since_1st_LOC.map(lambda x: float(x))    # Convert type to float.

In [115]:
df.time_since_1st_LOC.head(10)

0     4627.0
1     5782.0
2     5813.0
3     2586.0
4     2344.0
5     1795.0
6     2647.0
7     5327.0
8    10530.0
9     5082.0
Name: time_since_1st_LOC, dtype: float64

In [116]:
df.drop('earliest_cr_line', axis=1, inplace=True)  # Drop original earliest_cr_line feature.

In [117]:
df.shape

(814617, 33)

---
I have now reduced the number of features to just over 30.

---

In [118]:
df.to_sql(name='LC_Cleaning3', con=engine, if_exists='replace', index = False)

### Match Median Income to Zip Code

---
Now I will map the median household income data to the loan data, joining on zip code.

---

#### Clean Census Zip Codes

---
Firstly, I need to clean the ZCTA's in the census income data. I will need to remove the 'ZCTA5' and the last two digits from the zip code in order to join it to the loan data.

---

In [119]:
SQL_STRING = '''

select * from "all_incomes"

'''

df = pd.read_sql(SQL_STRING, con=engine)

In [120]:
df.head()

Unnamed: 0,zip,income_2011,income_2012,income_2013,income_2014,income_2015,income_2016,mean_median_household_income
0,ZCTA5 00602,14947.0,15106.0,15663.0,16353.0,16079.0,15106.0,15542.333333
1,ZCTA5 00610,16367.0,16923.0,16707.0,17265.0,17475.0,16923.0,16943.333333
2,ZCTA5 00622,11871.0,12059.0,14281.0,14993.0,15689.0,12059.0,13492.0
3,ZCTA5 00623,16163.0,16447.0,17389.0,17044.0,16593.0,16447.0,16680.5
4,ZCTA5 00624,14475.0,15500.0,14768.0,15467.0,15573.0,15500.0,15213.833333


In [121]:
df.shape

(33120, 8)

In [122]:
df.zip = df.zip.str.replace('ZCTA5 ', '')  # Remove 'ZCTA5'.

In [123]:
df.head()

Unnamed: 0,zip,income_2011,income_2012,income_2013,income_2014,income_2015,income_2016,mean_median_household_income
0,602,14947.0,15106.0,15663.0,16353.0,16079.0,15106.0,15542.333333
1,610,16367.0,16923.0,16707.0,17265.0,17475.0,16923.0,16943.333333
2,622,11871.0,12059.0,14281.0,14993.0,15689.0,12059.0,13492.0
3,623,16163.0,16447.0,17389.0,17044.0,16593.0,16447.0,16680.5
4,624,14475.0,15500.0,14768.0,15467.0,15573.0,15500.0,15213.833333


In [124]:
df.zip = df.zip.map(lambda x: x[0:3])    # Slice first 3 digits of zip code.

In [125]:
df.head()

Unnamed: 0,zip,income_2011,income_2012,income_2013,income_2014,income_2015,income_2016,mean_median_household_income
0,6,14947.0,15106.0,15663.0,16353.0,16079.0,15106.0,15542.333333
1,6,16367.0,16923.0,16707.0,17265.0,17475.0,16923.0,16943.333333
2,6,11871.0,12059.0,14281.0,14993.0,15689.0,12059.0,13492.0
3,6,16163.0,16447.0,17389.0,17044.0,16593.0,16447.0,16680.5
4,6,14475.0,15500.0,14768.0,15467.0,15573.0,15500.0,15213.833333


In [126]:
df.tail()

Unnamed: 0,zip,income_2011,income_2012,income_2013,income_2014,income_2015,income_2016,mean_median_household_income
33115,998,75500.0,56667.0,56250.0,56250.0,82813.0,56667.0,64024.5
33116,998,73000.0,71583.0,71667.0,72868.0,69318.0,71583.0,71669.833333
33117,998,46875.0,49167.0,62813.0,70625.0,54375.0,49167.0,55503.666667
33118,999,54271.0,50875.0,54554.0,57083.0,61908.0,50875.0,54927.666667
33119,999,17639.0,19107.0,19000.0,17946.0,19861.0,19107.0,18776.666667


In [128]:
df.to_sql(name='all_incomes2', con=engine, if_exists='replace', index = False)
# Save to SQL.

#### Join Census to Loans

---
In order to join my loan and census data, I must group the census zip codes to be able to map the incomes to the three digit zip code in the loan data. When grouping the census data by zip code, I will be taking an average of the mean median household income. In order to be able to join the grouped census data to my loan data, I will create an SQL view.

I created a view in pgAdmin4 using the following code:

`create view myview as 
 select zip, avg(mean_median_household_income) as zip_income from "all_incomes2"
 group by zip`
 
This groups by zip so that I have a mean income value for each 3 digit zip code. I will then be able to join this view onto my main dataframe to give me a new feature. The new feature is called zip_income.

---

In [129]:
# Select loan data and view, joining on zip code.
SQL_STRING = '''

select lc.*, mv.zip_income from "LC_Cleaning3" as lc
inner join myview as mv on lc.zip_code = mv.zip

'''

df = pd.read_sql(SQL_STRING, con=engine)

In [130]:
df.head()

Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,emp_title,emp_length,home_ownership,...,total_acc,initial_list_status,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens,disbursement_method,time_since_1st_LOC,zip_income
0,2,2500.0,2500.0,2500.0,60 months,15.27,59.83,Ryder,< 1 year,RENT,...,4.0,fractional,0.0,0.0,0.0,0.0,0.0,Cash,4627.0,45064.0
1,4,10000.0,10000.0,10000.0,36 months,13.49,339.31,AIR RESOURCES BOARD,10+ years,RENT,...,37.0,fractional,0.0,0.0,0.0,0.0,0.0,Cash,5782.0,67071.007246
2,5,3000.0,3000.0,3000.0,60 months,12.69,67.79,University Medical Group,1 year,RENT,...,38.0,fractional,0.0,0.0,0.0,0.0,0.0,Cash,5813.0,55278.30303
3,6,5000.0,5000.0,5000.0,36 months,7.9,156.46,Veolia Transportaton,3 years,RENT,...,12.0,fractional,0.0,0.0,0.0,0.0,0.0,Cash,2586.0,69793.462121
4,7,7000.0,7000.0,7000.0,60 months,15.96,170.08,Southern Star Photography,8 years,RENT,...,11.0,fractional,0.0,0.0,0.0,0.0,0.0,Cash,2344.0,45574.417829


In [131]:
df.shape

(814379, 34)

In [132]:
# Round the averaged median income to the nearest cent.
df.zip_income = df.zip_income.round(2)

In [133]:
# Drop zip code and keep zip_income.
df.drop('zip_code', axis=1, inplace=True)

In [134]:
# Check number of null rows in zip_income.
df.zip_income.isnull().sum()

76

---
There are 76 rows that have null values in zip_income. I will remove these rows.

---

In [135]:
df.dropna(axis=0, how='any', inplace=True)

In [136]:
df.to_sql(name='LC_Cleaning3', con=engine, if_exists='replace', index = False)

### Job Title

---
The emp_title column contains many unique values. I will group these into new categories by looking at some of the most common job titles in the feature.

---

In [137]:
SQL_STRING = '''

select * from "LC_Cleaning3"

'''

df = pd.read_sql(SQL_STRING, con=engine)

In [138]:
df.shape

(814303, 33)

In [139]:
len(df.emp_title.unique())

290645

---
There are more than 290,000 unique job titles in this feature.

---

In [140]:
df.emp_title = df.emp_title.map(lambda x: x.lower())   # Make values lower case.

In [141]:
# Top 40 most common categories in emp_title.
df.emp_title.value_counts().head(40)

teacher                     15042
manager                     14838
owner                        7639
registered nurse             7299
supervisor                   6788
driver                       6561
sales                        6329
rn                           5755
project manager              4573
office manager               4351
general manager              4100
truck driver                 3546
director                     3284
engineer                     3158
sales manager                2742
president                    2708
police officer               2613
operations manager           2571
store manager                2437
vice president               2398
accountant                   2340
administrative assistant     2175
technician                   2092
nurse                        2083
account manager              2065
attorney                     2057
mechanic                     1921
analyst                      1819
assistant manager            1770
executive assi

In [142]:
# Create a new column for the new job title categories.
df['job'] = df['emp_title']

---
The following function will find all rows that contain a particular string. This will allow me to rename jobs based on index using df.loc.

---

In [143]:
def search(df, *words):
    return df[np.logical_and.reduce([df.str.contains(word) for word in words])] 

In [144]:
# Call function to search for records containing 'teacher'.
search(df.job, 'teacher')

903         global teachers research and resources
4677         teachers college, columbia university
8828                               source4teachers
14806     state teachers retirement system of ohio
23401                teachers federal credit union
23634     state teachers retirement system of ohio
24552                             teachers college
32687                     vibe teacher recruitment
36564                                      teacher
36572                                teacher coach
36580                                      teacher
36584                                      teacher
36644                                      teacher
36685                                      teacher
36698                                      teacher
36722                   teacher/guidance counselor
36725                                      teacher
36736                                      teacher
36747                                      teacher
36783                          

---
There are more than 18,905 records with 'teacher' in the job title. There are some 'teacher' records that are recruitment jobs. I will do a separate search for 'recruit'.

---

In [145]:
search(df.job, 'recruit')   # Check recruitment jobs.

32687                   vibe teacher recruitment
37640                         recruiting manager
37797                             army recruiter
37940               vice president of recruiting
38388                                  recruiter
38493                           senior recruiter
38840                           senior recruiter
39126                        technical recruiter
39156        deputy chief, recruiting operations
39690                    sr. technical recruiter
40285                                  recruiter
40562                     recruiting coordinator
40915                     recruiting coordinator
41226                                  recruiter
43988                               sr recruiter
44101                                  recruiter
44522                        corporate recruiter
45923                        territory recruiter
46090           lead customer research recruiter
47086                           senior recruiter
47215               

---
I now have the index of the 'recruit' jobs. I can rename these records in the job column to 'recruiter' so that they are all grouped as a single category. This will also remove 'teacher' from the recruitment records so that I can group the 'teacher' records as 'teacher' without including recruitment jobs in the teacher category.

---

In [146]:
recruiter = search(df.job, 'recruit').index  # Create object of indexes of 'recruit' jobs.
df.loc[recruiter, 'job'] = 'recruiter'
# Use df.loc to rename job records to 'recruiter' based on 'recruiter' index.

---
With the 'recruit' jobs all renamed to 'recruiter', they have now been removed from the 'teacher' search. I can now rename 'teacher' records to 'teacher' without including any recruitment jobs.

---

In [147]:
teacher = search(df.job, 'teacher').index
df.loc[teacher, 'job'] = 'teacher'

In [148]:
search(df.job, 'assistant')   # Search for assistants.

18643                               call assistant
36554         assistant director - human resources
36560                            medical assistant
36602                     administrative assistant
36710                      sr. executive assistant
36714                             office assistant
36826                          assistant professor
36828                              legal assistant
36847                      assistant store manager
36848     executive assistant to board of director
36855                            medical assistant
36875                          executive assistant
36878                         assistant controller
36883                emergency physician assistant
36937                     administrative assistant
37022                       hs assistant principal
37034     assistant director of document managemen
37086                         front desk assistant
37107                          executive assistant
37128                      assi

In [149]:
assistant = search(df.job, 'assistant').index
df.loc[assistant, 'job'] = 'assistant'

In [150]:
search(df.job, 'manager')     # Search for managers.

2784                       full circle manager
14365     building owner's manager association
27253          monitor liability managers  llc
36177               jmb financial managers inc
36558                       operations manager
36562                          on road manager
36568                       area sales manager
36569                            parts manager
36574                          project manager
36575             manager information delivery
36591                                  manager
36607                                  manager
36615                             case manager
36624                           senior manager
36631                         category manager
36635                                  manager
36639           project manager & web marketer
36641                                  manager
36646                          general manager
36647                               pr manager
36653                inventory control manager
36659        

In [151]:
manager = search(df.job, 'manager').index
df.loc[manager, 'job'] = 'manager'

In [152]:
search(df.job, 'nurse')    # Search for nurses.

8201                                 nurse exchange
9911                                  medsol nurses
15126                                 nurse on call
15619                    visiting nurse association
18745           geriatric nurse practitioners, inc.
19583                             village nurseries
19809             visting nurse service of new york
19820                    nurse rx / american mobile
21851                                 nurse on call
22772         visiting nurse association of vt & nh
24970                        calloways nursery, inc
25181                            flowerwood nursery
26402                                 nurse on call
27180                 visiting nurse services of ny
27289                                  pike nursery
28064                                 nurse finders
29370      southern colorado critical care nurses, 
32302                              bordines nursery
32370     orlando health visiting nurse association
35159       

---
There are a number of nurseries included in this search. I will need to do a separate search for 'nursury' and rename them to remove them from the nursury search.

---

In [153]:
search(df.job, 'nursery')    # Search for nursury.

24970                      calloways nursery, inc
25181                          flowerwood nursery
27289                                pike nursery
32302                            bordines nursery
35989                            fraleigh nursery
80972                        redlands day nursery
87151                  hopewell wholesale nursery
97084           elderberry creek farm and nursery
99474                                pike nursery
102115                            wilsons nursery
102458                      community day nursery
102609                       san jose day nursery
107727                        dupont nursery inc.
124282                         newark day nursery
131039                  a. ferrucci & son nursery
140347                         crisis nursery inc
143616            mesquite valley growers nursery
144935    weecare nursery school and kindergarten
150245                           raintree nursery
153460                     williams plant nursery


---
There are only a few records with 'nursery'. I will rename these to 'other'.

---

In [154]:
nursery = search(df.job, 'nursery').index
df.loc[nursery, 'job'] = 'other'      # Rename to 'other'.

In [155]:
nurse = search(df.job, 'nurse').index
df.loc[nurse, 'job'] = 'nurse'

In [156]:
search(df.job, 'owner')   # Search for owners.

3455                       wyndham vacation ownership
13937                      wyndham vacation ownership
18096                      homeowners financial group
19132                           auto-owners insurance
30432                               resales buy owner
33178     community homeownership counseling services
36810                                           owner
39562                                           owner
40197                                owner / operator
40526                                           owner
41109               content owner, business sales ops
41421                                           owner
41679                                   process owner
43086                                owner/worker bee
43327                                           owner
43841                               president / owner
44341                                           owner
44632                                    agency owner
44748                       

In [157]:
owner = search(df.job, 'owner').index
df.loc[owner, 'job'] = 'owner'

In [158]:
search(df.job, 'driver')    # Search for drivers.

4992                             driver
19991     redriver federal credit union
36571                            driver
36658                      truck driver
36673                            driver
36677                            driver
36679                   forklift driver
36689                        p&d driver
36700                      truck driver
36748                      truck driver
36760                 school bus driver
36936          therapist/driver trainer
36972                            driver
37038             transportation driver
37078                            driver
37080                package car driver
37137               professional driver
37189                    company driver
37204                            driver
37219                      truck driver
37240                       semi driver
37260                      truck driver
37288                      relay driver
37308                            driver
37383                            driver


In [159]:
driver = search(df.job, 'driver').index
df.loc[driver, 'job'] = 'driver'

In [160]:
search(df.job, 'supervisor')     # Search for supervisors.

1550                                 supervisor
36573              street operations supervisor
36634                           dept supervisor
36657                    gas station supervisor
36676               customer service supervisor
36695                                supervisor
36777                                supervisor
36804                               supervisor 
36838                                supervisor
36873               customer service supervisor
36893                            evs supervisor
36992                             ap supervisor
37004                    supervisory k9 officer
37014                          admin supervisor
37120                                supervisor
37126                                supervisor
37158                            job supervisor
37186                     purchasing supervisor
37267                                supervisor
37320     electrical engineer/signal supervisor
37326                         safety sup

In [161]:
supervisor = search(df.job, 'supervisor').index
df.loc[supervisor, 'job'] = 'supervisor'

In [162]:
search(df.job, 'director')    # Search for directors.

22304                              white directory
23327                         all star directories
36617                  associate director of sales
36627                           executive director
36632               director, crisis response team
36661                            director of sales
36681                            director of sales
36711              director of strategic marketing
36712                                     director
36724                           marketing director
36764             rn director of surgical services
36790                            director of sales
36796               director of revenue management
36797        director and assoc teaching professor
36798              director, small business center
36805                       director of admissions
36807        planning & inventory control director
36840     senior director, organizational effectiv
36881                  director of human resources
36889                          

---
The 'director' search also included directory and directories jobs. I will rename 'directory' and 'directories' to 'other' to remove them from the 'director' search.

---

In [163]:
directory = search(df.job, 'directory').index
df.loc[directory, 'job'] = 'other'

In [164]:
directories = search(df.job, 'directories').index
df.loc[directories, 'job'] = 'other'

In [165]:
director = search(df.job, 'director').index
df.loc[director, 'job'] = 'director'

In [166]:
search(df.job, 'vice president')     # Search for vice presidents.

36640                              vice president
36772                              vice president
36802                              vice president
36949                              vice president
37030                              vice president
37094                    executive vice president
37209                              vice president
37562                              vice president
37570                              vice president
37822               vice president, due diligence
38070                    vice president, tech ops
38477                              vice president
38783           area vice president of operations
39252                              vice president
39355                 vice president - production
39412                              vice president
39555                              vice president
40281                              vice president
40655                vice president - investments
40984        vice president of technical services


In [167]:
vice_pres = search(df.job, 'vice president').index
df.loc[vice_pres, 'job'] = 'vice_pres'
# Rename to 'vice_pres' so that these records not included in future 'president' search.

In [168]:
search(df.job, 'president')      # Search for presidents.

30964                       presidential airways
37129                                  president
37900                                  president
38435                                  president
38533                             vice-president
38803                                  president
38834                                  president
39237                                  president
39539                                  president
40378                                  president
41795                             vice-president
41853                                  president
42040                                  president
42471                                  president
42506                                  president
42550                                  president
42813            chief of staff to the president
43731                                  president
44206                                  president
45040                                  president
45656               

---
The 'president' search also included 'vice-president' and 'vicepresident'. I will rename these to 'vice_pres' and then rename 'president' search records to 'president'.

---

In [169]:
vice_pres2 = search(df.job, 'vice-president').index
df.loc[vice_pres2, 'job'] = 'vice_pres'

In [170]:
vice_pres3 = search(df.job, 'vicepresident').index
df.loc[vice_pres3, 'job'] = 'vice_pres'

In [171]:
president = search(df.job, 'president').index
df.loc[president, 'job'] = 'president'

In [172]:
search(df.job, 'account')    # Search for accounting jobs.

5                                 mkc accounting 
1411               nd accounting & consulting, pc
2246            comprehensive accounting services
2892                         accountmate software
9893                   central accounting norfolk
11048                          ag accounting, llc
16196        texas comptroller of public accounts
18332                         cm2 accounting, llp
21626     defense finance and accounting services
26915                accounting consultants group
33037                          ag accounting, llc
35558               big 4 audit & accounting firm
35570          accounting/financial advisory firm
36622                                  accountant
36662                  department head accounting
36708                                  accountant
36867                     senior staff accountant
36887                           senior accountant
36913                            accounts payable
36915                           account executive


In [173]:
account = search(df.job, 'account').index
df.loc[account, 'job'] = 'accountant'

In [174]:
search(df.job, 'police')   # Search for police jobs.

1872         borough of prospect park police dept
2306              new york city police department
2720                          chicago police dept
3197            city of durango police department
3436                                dallas police
3483            pocono township police department
3827                       ucsf police department
3995              new york city police department
4311                               clayton police
4595                       south salt lake police
4751           st. louis county police department
4802              new york city police department
5165             clayton county police department
5193                   tuckahoe police department
5321              stevens point police department
5539                      st. louis county police
5666                          raleigh police dept
5969                 monterey airport dist police
6270                         demarest police dept
7245                       bridgeport police dept


In [175]:
police = search(df.job, 'police').index
df.loc[police, 'job'] = 'police'

In [176]:
search(df.job, 'law')    # Search for law jobs.

34                                sharp lawn inc.
499                             lawrence hospital
1144         wilson ford and lovelace law officce
1594                              lawson software
1916                        naturalawn of america
1984                   uop mcgeorge school of law
2142                                  ip law firm
2343                              toler law group
2442                                     law cash
2570                 the john marshall law school
2647                delaware river port authority
2703            heaven sent lawn maintenance inc.
2775                     secondary school for law
2851                               udm law school
3126       lawrence livermore national laboratory
3187                           lexington law firm
3551                               ramji law firm
3771                aqua sun lawn and landscaping
4224      federal law enforcement training center
4325            greater lawrence technical school


---
The 'law' search also included 'law enforcement' and 'law-enforcement'. I will rename those to 'police'. It also included 'law firm' and 'lawyer'. I will do separate searches for 'law firm' and 'lawyer' as well as 'solicitor' and 'attorney' to create a 'law' category.

---

In [177]:
law_enf = search(df.job, 'law enforcement').index
df.loc[law_enf, 'job'] = 'police'

In [178]:
law_enf2 = search(df.job, 'law-enforcement').index
df.loc[law_enf2, 'job'] = 'police'

In [179]:
search(df.job, 'law firm')   # Search for law firms.

2142                                        ip law firm
3187                                 lexington law firm
3551                                     ramji law firm
6780                                the gibson law firm
8422                           joyce and reyes law firm
8562                                  stelzner law firm
10282                            brooks pierce law firm
10383                                 daphneys law firm
11832                                          law firm
12099                              howard rice law firm
12191                                   o'neil law firm
12561                                          law firm
13009                               spottswood law firm
15027                                          law firm
15843                                          law firm
17086                                          law firm
17464                                    small law firm
24554                           washington, dc l

In [180]:
search(df.job, 'lawyer')   # Search for lawyers.

7390                  lawyers title of az, inc
9065            your jacksonville lawyer, p.a.
45549                                   lawyer
57298                                   lawyer
60726                                   lawyer
86137     los angeles dependency lawyers, inc.
89396                            rocket lawyer
95362                 minnesota lawyers mutual
126618          your jacksonville lawyer, p.a.
130735           international lawyers network
173153    san diego county bar lawyer referral
183705           international lawyers network
191335       texas lawyers' insurance exchange
202200                   american lawyer media
216337                                  lawyer
219540                                  lawyer
223762                          lawyer-partner
226916                                  lawyer
234620                                  lawyer
238691                                  lawyer
244164                                  lawyer
250707       

In [181]:
search(df.job, 'solicitor')    # Search for solicitors.

152295    9th circuit solicitor's office -berkeley
299968                                   solicitor
386428                    associate city solicitor
504562                              bail solicitor
532567                    associate city solicitor
562663                              loan solicitor
686916                                   solicitor
Name: job, dtype: object

In [182]:
search(df.job, 'attorney')    # Search for attorneys.

299          kings county district attorney's office
354                            ohio attorney general
1201              attorneys title fund services, llc
2143                          texas attorney general
4542                  utah attorney general's office
8366                    ct attorney general's office
11145       granowitz white & weber attorneys at law
11325                           leigh hart  attorney
13229                            us attorneys office
14420             richard w. gibson, attorney at law
15902                                 state attorney
19918        kings county district attorney's office
20063                         u.s. attorney's office
21261        ganowitz white & weber attorneys at law
22702                     attorney james golden, jr.
23781        6th judicial district attorney's office
24369           19th circuit state attorney's office
24720     connors and sullivan attorneys at law pllc
24907       dept. of justice/attorney general 

In [183]:
law_firm = search(df.job, 'law firm').index
df.loc[law_firm, 'job'] = 'law'

In [184]:
lawyer = search(df.job, 'lawyer').index
df.loc[lawyer, 'job'] = 'law'

In [185]:
solicitor = search(df.job, 'solicitor').index
df.loc[solicitor, 'job'] = 'law'

In [186]:
attorney = search(df.job, 'attorney').index
df.loc[attorney, 'job'] = 'law'

In [187]:
search(df.job, 'bank')   # Search for bank jobs.

22                             wells fargo bank
39                                citizens bank
139                            bank of the west
169                                     us bank
202                         jpmorgan chase bank
209                        jp morgan chase bank
244            first national bank of st. louis
275                kearny ferderal savings bank
362                       sterling savings bank
410                                m and t bank
433                 national bank of california
460                                  chase bank
464                             bank of america
486                             bank of america
494                                     us bank
705                             bank of america
784                            bank of oklahoma
786       coldwell banker residential brokerage
883          coldwell banker pacific properties
917                            food bank of cny
918                                     

---
The 'bank' search also included 'food bank' jobs which I will rename to 'other'. It also included 'coldwell banker' which is a real estate company. I will do separate searches for 'coldwell banker', 'real estate', 'property', and 'properties' and rename them to 'property'. I can then rename remaining 'banker' records to 'banker'.

---

In [188]:
search(df.job, 'food bank')  # Search for food banks.

917                           food bank of cny
8112                     north texas food bank
22465                     the food bank of wma
31152                southeast texas food bank
92302     food bank of contra costa and solano
99448                         placer food bank
100785              georgia mountain food bank
108534                    harvesters food bank
117076             atlanta community food bank
120118             atlanta community food bank
130528               blue ridge area food bank
140900                          utah food bank
176853                        oregon food bank
178271                            ct food bank
185118                     community food bank
Name: job, dtype: object

In [189]:
food_bank = search(df.job, 'food bank').index
df.loc[food_bank, 'job'] = 'other'

In [190]:
search(df.job, 'coldwell banker')  # Search for Coldwell Banker jobs.

786          coldwell banker residential brokerage
883             coldwell banker pacific properties
2791                          coldwell banker bain
30241            coldwell banker the aspen brokers
30568                              coldwell banker
30698                              coldwell banker
33216                              coldwell banker
35037                              coldwell banker
85423                 coldwell banker d'ann harper
95441                          coldwell banker one
101363                 coldwell banker west realty
103203                 coldwell banker real estate
103688                             coldwell banker
112662                             coldwell banker
115897                 coldwell banker real estate
116072                 coldwell banker west realty
126220                 coldwell banker real estate
127346            coldwell banker united, realtors
133178       coldwell banker residential brokerage
135248                         

In [191]:
search(df.job, 'real estate')    # Search for real estate jobs.

168                          edgestone real estate
2999                   commercial real estate firm
4270              greenstreet real estate partners
4834                        devonshire real estate
5513         american real estate associates, inc.
5526                       wells real estate funds
7124                            insite real estate
8082                            hecker real estate
9233                       wells real estate funds
10664                           simons real estate
11963                 green real estate group, llc
13353                 green real estate group  llc
13421      washington real estate investment trust
13680                 vintage real estate services
14757                            asset real estate
16301                          pinkham real estate
22579                              bnc real estate
24849          pcv murcor - real estate appraisals
26836           boston and north shore real estate
28700       grubb and ellis com

In [192]:
search(df.job, 'property')     # Search for property jobs.

733            gilchrist county property appraiser
1652           lighthouse property management inc.
5242                        simpson property group
5730      optimum professional property management
6479                  investment property vultures
8678                 schernecker property services
9719                west coast property management
10682                     chesapeake property cons
10818                      h&j property management
10852                         simon property group
14411                   lamden property management
15587                hmmy property management corp
15784                  assurant specialty property
15948                      sts property management
18003                               boxer property
19080                   celtic property management
19381                  g&k property management llc
21988                schernecker property services
22535                 lewis property management co
23249         palm beach county

In [193]:
search(df.job, 'properties')   # Search for properties.

547                       heritage properties inc.
883             coldwell banker pacific properties
2298                     usa properties fund, inc.
2822                           jl properties, inc.
3390                 liberty investment properties
3716      professional properties management, inc.
4423                       branch properties, inc.
5014                          southwind properties
5400                           st. john properties
6489                          tarantino properties
7095                          bre properties, inc.
7588                     starpoint properties, llc
7653                                avr properties
7665                          lion properties, llc
7923                      sudberry properties, inc
8321                             kohner properties
8846              national retail properties, inc.
12270                            gordon properties
15090                         fairfield properties
15265                         b

In [194]:
cold_bank = search(df.job, 'coldwell banker').index
df.loc[cold_bank, 'job'] = 'property'

In [195]:
real_estate = search(df.job, 'real estate').index
df.loc[real_estate, 'job'] = 'property'

In [196]:
properties = search(df.job, 'properties').index
df.loc[properties, 'job'] = 'property'

In [197]:
properties2 = search(df.job, 'property').index
df.loc[properties2, 'job'] = 'property'

In [198]:
banker = search(df.job, 'banker').index
df.loc[banker, 'job'] = 'banker'

In [199]:
search(df.job, 'consultant')   # Search for consultants.

24                 winfield pathology consultants
2525                           matrix consultants
2970             expert technical consultants inc
3762                        princeton consultants
4354                        geosyntec consultants
4505                 scheduling consultants, ltd.
4666                  texas inpatient consultants
6460      critical care and pulmonary consultants
6811                 icon information consultants
7751                          spectal consultants
8727           environ strategy consultants, inc.
9262               all world language consultants
10379                       geosyntec consultants
10601             coastal engineering consultants
10811                       geosyntec consultants
11982                monument medical consultants
12372               parish anesthesia consultants
15584              aegis construction consultants
15746          leighfisher management consultants
16733                consultant engineering, inc.


In [200]:
consultant = search(df.job, 'consultant').index
df.loc[consultant, 'job'] = 'consultant'

In [201]:
search(df.job, 'engineer')    # Search for engineers.

315                                  jacobs engineering
336                          us army corps of engineers
1066                         us army corps of engineers
2457                            pacific engineers group
2772                    milwaukee school of engineering
2932                  plant engineering and maintenance
3023                         us army corps of engineers
3101                nei contracting & engineering, inc.
3904           ebert & baumann consulting engineers, in
4000                                sunrise engineering
4027                     r. g. vanderweil engineers llp
4301                              truevance engineering
4510                       us army aviation engineering
4705                        becht engineering co., inc.
4865                    madison county engineers office
5160                       engineered machined products
5449      florence  and  hutcheson consulting engineers
5934                               ccjm engineer

---
The 'engineer' search also included army engineers. I will do a separate search for 'army' and rename 'army' search jobs to 'army'.

---

In [202]:
search(df.job, 'army')      # Search for army jobs.

70                       tx army national guard
153                                     us army
331                                     us army
336                  us army corps of engineers
644                                   u.s. army
684                         army national guard
724                      16th mp bde, u.s. army
922                          department of army
972                                     us army
1045                                  u.s. army
1066                 us army corps of engineers
1132                                  u.s. army
1179                                    us army
1262                                    us army
1371                                   us army 
1374                                    us army
1389                                    us army
1443                                    us army
1499                                  u.s. army
1585                         the salvation army
1751                         united stat

In [203]:
army = search(df.job, 'army').index
df.loc[army, 'job'] = 'army'

In [204]:
engineer = search(df.job, 'engineer').index
df.loc[engineer, 'job'] = 'engineer'

In [205]:
search(df.job, 'technician')     # Search for technicians.

36567                    surgical technician
36577          technician support specialist
36581                        hvac technician
36589                       alarm technician
36664                       field technician
36693                             technician
36751                     network technician
36794                  spacecraft technician
36817               mental health technician
36866                      supply technician
36900          regulatory affairs technician
36916                    pharmacy technician
36948                 electronics technician
36996               tire and lube technician
37045                     service technician
37327          industrial hygiene technician
37334                  board test technician
37336                        hvac technician
37491                  electronic technician
37571        certified ophthalmic technician
37602               human service technician
37660                             technician
37693     

In [206]:
technician = search(df.job, 'technician').index
df.loc[technician, 'job'] = 'technician'

In [207]:
search(df.job, 'mechanic')   # Search for mechanics.

2942                                    mechanical inc
3596                                forsyth mechanical
5095                   highland mechanical contractors
6100                                    rac mechanical
6720                     mdm mechanical services, inc.
7500                            performance mechanical
8413                            mechanical contractors
8528                                   city mechanical
10200     westmoreland mechanical testing and research
13621                             ppc mechanical seals
15016                                renkow mechanical
16166                                elvins mechanical
16869                                  hart mechanical
17152                       armistead mechanical, inc.
17547                                delta mechanical 
17922                     mdm mechanical services, inc
19739                                  lusk mechanical
20462                            interstate mechanical
23379     

In [208]:
mechanic = search(df.job, 'mechanic').index
df.loc[mechanic, 'job'] = 'mechanic'

In [209]:
search(df.job, 'sales')    # Search for sales jobs.

775                      aarons sales and lease
1030                   acosta sales & marketing
1953                  norris sales company inc.
2556                           hayes auto sales
2571                           hayes auto sales
3008              quaker sales and distribution
3741                          costco wholesales
3777                                  mvp sales
6737                            oasis sales inc
8728            advantage sales  and  marketing
9111                       harlequin sales corp
9354                 all-state ford truck sales
12161                uniform sales assoc., inc.
12563                       sheridan ford sales
15239                       walter's auto sales
17147                        chapman ford sales
17302                       comins lumber sales
18071                  aaa electric motor sales
18496                      get fresh sales, inc
18741               stainless sales corporation
19133                            cellula

In [210]:
sales = search(df.job, 'sales').index
df.loc[sales, 'job'] = 'sales'

In [211]:
search(df.job, 'analyst')    # Search for analysts.

35447                    analysts inernational
36576                    lead business analyst
36598                             data analyst
36599                               it analyst
36618                          systems analyst
36651             information security analyst
36729                             data analyst
36738                    reimbursement analyst
36753                    senior policy analyst
36813                    credit review analyst
36827                           equity analyst
36865                    sr. business analyst 
36922                               analyst ii
36924                                  analyst
36945                    financial analyst iii
36952                   senior systems analyst
36969                 meas & reporting analyst
36978                           credit analyst
36981                  sr supply chain analyst
37152                     applications analyst
37159                                analyst 2
37161        

In [212]:
search(df.job, 'scientist')    # Search for scientists.

36856                        biological scientist
36858                          forensic scientist
36993                    bioinformatics scientist
38339                          research scientist
38986     method development specialist-scientist
39205                          research scientist
40122                        research scientist i
40193                            senior scientist
41095        senior clinical laboratory scientist
41230                                   scientist
41339                            pharma scientist
41920               clinical laboratory scientist
42693                        biomedical scientist
43308                              data scientist
44344                      clinical lab scientist
44847                                   scientist
44905                        research scientist 3
45221                associate research scientist
45892                                   scientist
45947                          research scientist


In [213]:
analyst = search(df.job, 'analyst').index
df.loc[analyst, 'job'] = 'analyst'

In [214]:
scientist = search(df.job, 'scientist').index
df.loc[scientist, 'job'] = 'scientist'

In [215]:
search(df.job, 'customer service')    # Search for customer service jobs.

4238               new customer service companies
15205                      apac customer services
29290                      apac customer services
30852       governor's office of customer service
36603             customer service representative
36606                        customer service rep
36927                            customer service
37330                    customer service/drafter
37875                            customer service
38043                            customer service
38181                 team lead, customer service
38273                      customer service agent
38315                           customer service 
38735                            customer service
39192                customer service-call center
39203                        customer service rep
39509                            customer service
39569                            customer service
39676                            customer service
39734                        customer service rep


In [216]:
customer_service = search(df.job, 'customer service').index
df.loc[customer_service, 'job'] = 'customer service'

In [217]:
search(df.job, 'electrician')    # Search for electricians.

34518                        union electrician
36682                              electrician
36843                              electrician
36929                              electrician
37021                              electrician
37106                   electrician aprrentice
37199                       master electrician
37649                   journeyman electrician
37695                              electrician
37866                   journeyman electrician
38195                              electrician
38491                              electrician
38525                  maintanence electrician
39246                      nuclear electrician
39284                              electrician
39388                              electrician
39422                   industrial electrician
39935                              electrician
40119                     aircraft electrician
40156                              electrician
40641                    utilities electrician
40887        

In [218]:
electrician = search(df.job, 'electrician').index
df.loc[electrician, 'job'] = 'electrician'

In [219]:
search(df.job, 'operator')    # Search for operators.

36630                    plant operator
36728                        operator 3
36803                 bottomer operator
36868            manufacturing operator
36879                equipment operator
36903                  laborer/operator
36905                      operator/itr
36914                      bus operator
36959                       ip operator
37143                     line operator
37290                          operator
37331                        operator 2
37345                          operator
37370          heavy equipment operator
37555                  machine operator
37731                          operator
37741                equipment operator
37761               journeyman operator
37813                          operator
37823                          operator
37861        2nd class network operator
37931                equipment operator
37946                  machine operator
38062                       operator ii
38072              fire engine operator


---
I will not group the operators as these jobs are too diverse to be grouped into one category.

---

I now have a number of the most common jobs as new categories. These categories are: 'recruiter', 'teacher', 'assistant', 'manager', 'nurse', 'owner', 'driver', 'supervisor', 'director', 'vice_pres' [vice president], 'president', 'accountant', 'police', 'law', 'property', 'banker', consultant', 'army', 'engineer', 'technician', 'mechanic', 'sales', 'analyst', 'scientist', 'customer service', and 'electrician'.

I will set all other values to 'other'.

---

In [220]:
# Rename job titles to 'other' if it is not one of the new categories.
df.job = df.job.map(lambda x: 'other' if x not in ['recruiter', 'teacher', 'assistant', 'manager', 'nurse',
                                                   'owner', 'driver', 'supervisor', 'director', 'vice_pres',
                                                   'president', 'accountant', 'police', 'law', 'property',
                                                   'banker', 'consultant', 'army', 'engineer', 'technician',
                                                   'mechanic', 'sales', 'analyst', 'scientist', 'customer service',
                                                   'electrician'] else x)

In [221]:
df.job.value_counts()

other               457397
manager             101475
director             25911
assistant            24407
engineer             20852
supervisor           20335
analyst              19031
teacher              18904
sales                17324
driver               16056
nurse                14033
technician           13132
accountant           11612
owner                10407
consultant            7566
vice_pres             5106
police                5016
mechanic              4967
customer service      3888
law                   3452
president             3324
banker                2439
electrician           2177
army                  1787
recruiter             1369
property              1285
scientist             1051
Name: job, dtype: int64

In [222]:
# Drop original emp_title column and keep new job column.
df.drop('emp_title', axis=1, inplace=True)

In [223]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 814303 entries, 0 to 814302
Data columns (total 33 columns):
id                          814303 non-null int64
loan_amnt                   814303 non-null float64
funded_amnt                 814303 non-null float64
funded_amnt_inv             814303 non-null float64
term                        814303 non-null object
int_rate                    814303 non-null float64
installment                 814303 non-null float64
emp_length                  814303 non-null object
home_ownership              814303 non-null object
annual_inc                  814303 non-null float64
verification_status         814303 non-null object
issue_d                     814303 non-null datetime64[ns]
loan_status                 814303 non-null int64
purpose                     814303 non-null object
addr_state                  814303 non-null object
dti                         814303 non-null float64
delinq_2yrs                 814303 non-null float64
inq_last

In [224]:
df.to_sql(name='LC_Cleaning4', con=engine, if_exists='replace', index = False)

---
The dataset is now clean. I can now move onto the second part of the EDA, plotting and analysing the data to look at distributions and relationships between variables.

---