In [1]:
import pandas as pd
import numpy as np
from scipy import stats

From the data exploratory exercise, we identified that loan purpose appears to have influence on the loans being paid off or not. To prove that the influence is statistically significant, we will run chi square test on it

In [2]:
lend = pd.read_csv('data/lending_clean.csv')
lend_post_2008 = lend[lend['loan_start_d'] >= '2009-1-1']

Null Hypothesis - different loan purposes do not affect loan paid off rate

Alternative Hypothesis - loan purposes affect paid off rate

In [3]:
table = pd.crosstab(lend_post_2008['purpose'], lend_post_2008['target'])
print(table)

target                 0      1
purpose                        
car                  140   1263
credit_card          499   4073
debt_consolidation  2648  14608
educational           32    127
home_improvement     346   2417
house                 56    288
major_purchase       214   1854
medical              103    524
moving                91    442
other                609   3018
renewable_energy      20     80
small_business       423   1120
vacation              54    306
wedding               87    740


In [4]:
stat, p, dof, expected = stats.chi2_contingency(table)

In [5]:
# interpret p-value, using 95% confident level
prob = 0.95
alpha = 1.0 - prob
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

Dependent (reject H0)


Conclude that loan_purpose has impact on the loan paid off rates

Additionally, the following attributes will be test as well per exploratory analysis

- Credit history
- DTI
- Number of credit lines 

In [6]:
# write a function to streamline the testing steps
def chi_square(data):
    table = pd.crosstab(lend_post_2008[data], lend_post_2008['target'])
    stat, p, dof, expected = stats.chi2_contingency(table)
    prob = 0.95
    alpha = 1.0 - prob
    if p <= alpha:
        print('Dependent (reject H0)')
    else:
        print('Independent (fail to reject H0)')

In [7]:
# Credit history
chi_square('yr_credit')

Dependent (reject H0)


In [8]:
# DTI
chi_square('dti')

Independent (fail to reject H0)


In [9]:
# number of credit line
chi_square('total_acc')

Dependent (reject H0)


Conclude that credit history and number of credit line have impact on the loan paid off rates, while DTI does not