# Introduction

You are working for a consumer finance company. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision -

*	If the applicant is likely to repay the loan, then not approving the loan to the person results in a loss of business to the company.
*	If the applicant is not likely to repay the loan i.e. default, then approving the loan to the person results in a financial loss to the company.

In this case study, we consider only consumers whose loan application is approved.  Here, our aim is to understand how consumer attributes and loan attributes influencing the tendency of defaulting.


## Business Understanding
This company is the largest online credit marketplace, facilitating personal loans, business loans, and financing for elective medical procedures. Borrowers can easily access lower interest rate loans through a fast online interface. Investors provide the capital to enable many of the loans in exchange for earning interest. 

## Data Understanding
The company has come across some important attributes in order to understand behaviour of their approved loan customers w.r.t. loan default. Thus, the lending company has decided to work only on these variables to mitigate the future risk. The driver variables you need to consider for this case study are :

| Attributes  | Definition    |
|------|------|
|annual_inc| Annual Income of applicant|
|loan_amnt| The listed amount of the loan applied for by the borrower|
|funded_amnt|The total amount committed to that loan at that point in time|
|int_rate|Interest Rate on the loan|
|grade|Lending Club assigned loan grade|
|dti|Debt to income ratio|
|emp_length|Employment length in years|
|purpose|A category provided by the borrower for the loan request.|
|home_ownership|The home ownership status provided by the borrower during registration|
|loan_status|Current status of loan|

# Business Objective

The business objectives and goals of data analysis are pretty simple. The company wants to understand the driving factors behind loan default (loan_status_1).  The company can utilise this knowledge for its portfolio and risk assessment. Specifically, the company wants to determine which driver variables are having the most influence on the tendency of loan default.

Your goal is divided into 3 main parts:
1.	Data Preparation
2.	Exploratory Data Analysis
3.	Hypothesis testing


## Data Prepration

In [153]:
import pandas as pd
import numpy as np

In [154]:
loan = pd.read_csv('loan.csv',
                   dtype={'annual_inc':float,'loan_amnt':float,'funded_amnt':float,'int_rate':object ,'grade':object ,
                         'dti':float,'emp_length':object ,'purpose':object ,'home_ownership':object ,'loan_status':object })

  interactivity=interactivity, compiler=compiler, result=result)


https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options

In [155]:
colsNeeded=['annual_inc', 'loan_amnt', 'funded_amnt', 'int_rate','grade','dti','emp_length','purpose','home_ownership','loan_status']

In [156]:
loan = loan[colsNeeded]

In [157]:
loan.shape

(42542, 10)

In [158]:
loan.head()

Unnamed: 0,annual_inc,loan_amnt,funded_amnt,int_rate,grade,dti,emp_length,purpose,home_ownership,loan_status
0,24000.0,5000.0,5000.0,10.65%,B,27.65,10+ years,credit_card,RENT,Fully Paid
1,30000.0,2500.0,2500.0,15.27%,C,1.0,< 1 year,car,RENT,Charged Off
2,12252.0,2400.0,2400.0,15.96%,C,8.72,10+ years,small_business,RENT,Fully Paid
3,49200.0,10000.0,10000.0,13.49%,C,20.0,10+ years,other,RENT,Fully Paid
4,80000.0,3000.0,3000.0,12.69%,B,17.94,1 year,other,RENT,Current


In [159]:
loan.isnull().sum()

annual_inc        11
loan_amnt          7
funded_amnt        7
int_rate           7
grade              7
dti                7
emp_length         7
purpose            7
home_ownership     7
loan_status        7
dtype: int64

In [160]:
loan[loan['annual_inc'].isnull()==True]

Unnamed: 0,annual_inc,loan_amnt,funded_amnt,int_rate,grade,dti,emp_length,purpose,home_ownership,loan_status
39786,,,,,,,,,,
39787,,,,,,,,,,
39788,,,,,,,,,,
42452,,5000.0,5000.0,7.43%,A,1.0,< 1 year,other,NONE,Does not meet the credit policy. Status:Fully ...
42453,,7000.0,7000.0,7.75%,A,1.0,< 1 year,other,NONE,Does not meet the credit policy. Status:Fully ...
42483,,6700.0,6700.0,7.75%,A,1.0,< 1 year,other,NONE,Does not meet the credit policy. Status:Fully ...
42536,,6500.0,6500.0,8.38%,A,4.0,< 1 year,other,NONE,Does not meet the credit policy. Status:Fully ...
42538,,,,,,,,,,
42539,,,,,,,,,,
42540,,,,,,,,,,


In [161]:
#Remove all rows having na's
loan.dropna(inplace=True)
loan.shape

(42531, 10)

In [162]:
loan.describe(include = ['O'])

Unnamed: 0,int_rate,grade,emp_length,purpose,home_ownership,loan_status
count,42531,42531,42531,42531,42531,42531
unique,394,7,12,14,5,9
top,10.99%,B,10+ years,debt_consolidation,RENT,Fully Paid
freq,970,12389,9369,19776,20181,32950


In [163]:
loan['loan_status'].unique()

array(['Fully Paid', 'Charged Off', 'Current', 'In Grace Period',
       'Late (31-120 days)', 'Late (16-30 days)', 'Default',
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'], dtype=object)

### We will be removing all the values having status = Fully Paid because we are not dealing with fully paid loans data

In [164]:
loan =loan.drop(loan[(loan['loan_status']=='Fully Paid') 
               | (loan['loan_status']=='Does not meet the credit policy. Status:Fully Paid')].index)

In [165]:
loan.shape

(7597, 10)

In [166]:
loan['loan_status'].unique()

array(['Charged Off', 'Current', 'In Grace Period', 'Late (31-120 days)',
       'Late (16-30 days)', 'Default',
       'Does not meet the credit policy. Status:Charged Off'], dtype=object)

### Remove % symbol from "int_rate" variable

In [167]:
loan['int_rate'] = loan['int_rate'].str.replace('%','0').apply(pd.to_numeric)

#### Convert grade variable to category

In [168]:
loan['grade'] = loan['grade'].astype('category')

#### Clean Employment Length

In [169]:
loan['emp_length'].unique()

array(['< 1 year', '1 year', '4 years', '3 years', '10+ years', '9 years',
       '2 years', '8 years', '7 years', '5 years', '6 years', 'n/a'], dtype=object)

In [170]:
loan['emp_length']=loan['emp_length'].replace('[^0-9]','',regex=True)

In [171]:
loan['emp_length'].unique()

array(['1', '4', '3', '10', '9', '2', '8', '7', '5', '6', ''], dtype=object)

In [172]:
loan['emp_length'].value_counts()

10    1898
1     1456
2      753
3      713
4      620
5      600
6      413
7      359
       288
8      285
9      212
Name: emp_length, dtype: int64

In [182]:
loan['emp_length'] = loan[loan['emp_length']!='']['emp_length']

In [184]:
loan['emp_length'].value_counts()

10    1898
1     1456
2      753
3      713
4      620
5      600
6      413
7      359
8      285
9      212
Name: emp_length, dtype: int64

In [130]:
loan['int_rate'].dtype

dtype('float64')

#### Convert "home_ownership", "loan_status" & "purpose"  variable to category

In [140]:
loan["home_ownership"]=loan['home_ownership'].astype('category')

In [142]:
loan["purpose"]=loan['purpose'].astype('category')

In [141]:
loan["loan_status"]=loan['loan_status'].astype('category')