# Financial Data Analysis – Data Processing 1: Loan Eligibility Prediction

[Link](https://www.kdnuggets.com/2018/09/financial-data-analysis-loan-eligibility-prediction.html)

This notebook is from kdnuggets (link above) about Loan Eligibility Prediction using use Lending club loan data dataset. Lending Club is the world’s largest online marketplace connecting borrowers and investors. An inevitable outcome of lending is default by borrowers. The idea of this tutorial is to create a predictive model that **identifies applicants who are relatively risky for a loan.**

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings("ignore")

Data: 3 years of datasets (2014, 2015 and 2017(first-thrid quarter)) and stored in five separate CSV file.

In [35]:
df1 = pd.read_csv('./data/LoanEligibilityPrediction/2017Q1.csv', skiprows=[0])
df2 = pd.read_csv('./data/LoanEligibilityPrediction/2017Q2.csv', skiprows=[0])
df3 = pd.read_csv('./data/LoanEligibilityPrediction/2017Q3.csv', skiprows=[0])
df4 = pd.read_csv('./data/LoanEligibilityPrediction/2014.csv', skiprows=[0])
df5 = pd.read_csv('./data/LoanEligibilityPrediction/2015.csv', skiprows=[0])

In [36]:
df1.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,104046719,,14000.0,14000.0,14000.0,36 months,15.99%,492.13,C,C5,...,,,Cash,N,,,,,,
1,104048967,,5000.0,5000.0,5000.0,36 months,25.49%,200.1,E,E4,...,,,Cash,N,,,,,,
2,104028593,,4600.0,4600.0,4600.0,36 months,11.39%,151.45,B,B3,...,,,Cash,N,,,,,,
3,104046702,,14000.0,14000.0,14000.0,60 months,12.74%,316.69,C,C1,...,,,Cash,N,,,,,,
4,104280113,,15000.0,15000.0,15000.0,36 months,5.32%,451.73,A,A1,...,,,Cash,N,,,,,,


Since data are stored in seperate files, we have to make sure that we have the same number of features in each file. 

In [37]:
columns=np.dstack((list(df1.columns),list(df2.columns),list(df3.columns),list(df4.columns),list(df5.columns)))
# all the input array dimensions except for the concatenation axis must match exactly
coldf = pd.DataFrame(columns[0])
coldf

Unnamed: 0,0,1,2,3,4
0,id,id,id,id,id
1,member_id,member_id,member_id,member_id,member_id
2,loan_amnt,loan_amnt,loan_amnt,loan_amnt,loan_amnt
3,funded_amnt,funded_amnt,funded_amnt,funded_amnt,funded_amnt
4,funded_amnt_inv,funded_amnt_inv,funded_amnt_inv,funded_amnt_inv,funded_amnt_inv
...,...,...,...,...,...
146,settlement_status,settlement_status,settlement_status,settlement_status,settlement_status
147,settlement_date,settlement_date,settlement_date,settlement_date,settlement_date
148,settlement_amount,settlement_amount,settlement_amount,settlement_amount,settlement_amount
149,settlement_percentage,settlement_percentage,settlement_percentage,settlement_percentage,settlement_percentage


In [38]:
df1.columns == df2.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [39]:
df = pd.concat([df1, df2, df3, df4, df5])

In [40]:
df.shape

(981665, 151)

In [41]:
print(list(df.columns))

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'last_fico_range_high', 'last_fico_range_low', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq',

**We will get through every feature and then select the relevant features. Let's start with the target feature “loan_status”**

In [42]:
df.loan_status.value_counts()

Current               500937
Fully Paid            358629
Charged Off            99099
Late (31-120 days)     13203
In Grace Period         6337
Late (16-30 days)       3414
Default                   36
Name: loan_status, dtype: int64

In this tutorial, we are interested in two classes:  
1. Fully paid: those who paid the loan with interests and 
2. Charged off: those who could not pay and finally charged off. Therefore, we select the data sets for these two classes:

In [43]:
keep = df['loan_status'].isin(['Fully Paid' , 'Charged Off'])
df = df.loc[keep]

In [44]:
df.shape

(457728, 151)

Looking at the shape, we see that we now have **half of the data point than original data** and the same number of features. Before processing and cleaning manually, let’s do some general data processing steps first:  
1. Remove features associated with >90% missing values
2. Remove constant features
3. Remove duplicate featues
4. Remove duplicate rows
5. Remove highly collinear feaures

# 1. Remove features associated with >90% missing values: 
- isnull() to find the rows associated with missing values
- sum them up and count for each feature
- sort the features according to the number of missiing values and create a data frame for further analysis 

In [45]:
missing_df = df.isnull().sum(axis = 0).sort_values().to_frame('missing_value').reset_index()

In [46]:
miss_40000 = list(missing_df[missing_df.missing_value >= 400000]['index'])
print(len(miss_40000))

53


53  features  have 400000 missing values an with pandas’ drop method to remove these 53 features.

In [47]:
df.drop(miss_40000, axis = 1 , inplace = True)

# 2. Remove constant features:

At this step, we remove features that have a single unique value. A feature associated with one unique value does not help the model to generalize well since it’s variance is zero. A tree-based model cannot take advantage of these type of features since the model can not split these features. Constant features can lead to errors in some models and obviously provide no information in the training set that can be learned from. To identify features with a single unique value is relatively straightforward:
1. We can define a fucntion or
2. using sklearn.feature_selection

**It is important to mention here that, in order to avoid overfitting, feature selection should only be applied to the training set.**

In [48]:
from sklearn.feature_selection import VarianceThreshold

In [32]:
def find_constant_features(dataFrame):
    const_features = []
    for column in list(dataFrame.columns):
        if dataFrame[column].unique().size < 2:
            const_features.append(column)
    return const_features
const_features = find_constant_features(df)

In [34]:
df.drop(const_features, axis = 1, inplace = True)