About Data

- credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
- purpose: The purpose of the loan (takes values "creditcard", "debtconsolidation", "educational", "majorpurchase", "smallbusiness", and "all_other").
- int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
- installment: The monthly installments owed by the borrower if the loan is funded.
- log.annual.inc: The natural log of the self-reported annual income of the borrower.
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
- fico: The FICO credit score of the borrower.
- days.with.cr.line: The number of days the borrower has had a credit line.
- revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
- revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
- inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
- pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "darkgrid", context = 'notebook', palette = 'deep')
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator)

In [None]:
%matplotlib inline

In [None]:
data_file = '../input/lending-club-loan-data-analysis/loan_data.csv'

In [None]:
loan_data = pd.read_csv(data_file)

In [None]:
loan_data.head()

In [None]:
# Change the columns to have _ instead of .
loan_data.columns

### Data Wrangling 

#### Step1 : Rename columns - from credit.policy to credit_policy etc

In [None]:
def rename_columns(a):
    list_a = list(a)
    return [str(i).replace(".","_") for i in list_a]

    

In [None]:
loan_data.columns = rename_columns(loan_data.columns)
loan_data.columns

In [None]:
loan_data.head()

#### Step2 : Taking a look at the data. 

#### 2.1 Checking the datatypes of the columns

In [None]:
loan_data.head()

In [None]:
loan_data.dtypes


### Checking for missing values

In [None]:
loan_data.info()

Looks like there are no missing values in the dataset. Checking on column wise data

In [None]:
loan_data["credit_policy"].value_counts(normalize = True)*100

80% of the applicants meet the criteria of credit underwriting by LendingClub.com and 20% of the collected data does not comply with the criteria.
For loan underwriting, the False Negatives are more critical than the false positives. Meaning to say, it is ok for the club to evaluate a potential customer as "Risky" than evaulating a true risky customer as "Safe".

Studying the 20% of the Risky cases may give further insights on the "Risky"citeria

In [None]:
#Purpose in the Rejeted case
risky_loans = loan_data[loan_data["credit_policy"]==0]
risky_loans.head()

In [None]:
risky_loans["purpose"].value_counts(normalize = True)*100

Around 40% of risky loans are for Debt Consolidation urposes. Risk for Credit cards are at #3 with 13% of cases.
It would be interesting to see whether loan for debt consolidation are mostly risky.

In [None]:
# loan_data[loan_data["purpose"]=="debt_consolidation"]["credit_policy"].value_counts()

In [None]:
loan_data[loan_data["purpose"]=="debt_consolidation"]["credit_policy"].value_counts(normalize = True)*100

Just 18.5% of the debt consolidation loan applications are considered risky and still  81.5% cases are healthy loans.


We can extend this analysis to other purposes to analyse the # of risky/safe loan.

In [None]:
fig = plt.figure(figsize = (8,6))
ax1 = plt.subplot(1,1,1)

sns.countplot(x = "purpose", hue = "credit_policy", data = loan_data)
plt.xticks(rotation = 90, fontsize = 13)
plt.title("Credit Health by purpose", fontsize = 16)
plt.xlabel("Purpose", fontsize = 13)
plt.ylabel("Number of loans", fontsize = 13)

plt.show()


In [None]:
# Calculating Risky/Total count of loans by purpose
tot_count_of_loans_by_purpose = loan_data.groupby(["purpose"])["credit_policy"].count().reset_index()
tot_count_of_risky_loans_by_purpose = loan_data[loan_data["credit_policy"]==0].groupby(["purpose"])["credit_policy"].count().reset_index()

risky_to_total_ratio_by_purpose = pd.merge(tot_count_of_risky_loans_by_purpose,tot_count_of_loans_by_purpose, on = 'purpose', suffixes = ('_x', '_y'))
risky_to_total_ratio_by_purpose.columns = ["purpose", "Risky","Total"]
risky_to_total_ratio_by_purpose


In [None]:
risky_to_total_ratio_by_purpose["ratio"] = risky_to_total_ratio_by_purpose["Risky"]/risky_to_total_ratio_by_purpose["Total"]*100
risky_to_total_ratio_by_purpose.sort_values("ratio", ascending = False)

Around 26% of the Education purpose loans are termed "Risky" followed by small_businesses and all other purposes

#### Interest Rate

In [None]:
loan_data.groupby("credit_policy").agg({'int_rate':['min','max','mean']})

In [None]:
fig = plt.figure(figsize = (10,8))

ax1 = plt.subplot(211)

sns.histplot(loan_data[loan_data["credit_policy"]==0]["int_rate"], kde = False)
plt.axvline(x = loan_data[loan_data["credit_policy"]==0]["int_rate"].mean(), color = 'r') # Mean line
#plt.axvline(x = loan_data[loan_data["credit_policy"]==0]["int_rate"].median(), color = 'b') # median line
plt.ylabel("Counts")
plt.text(0.16,150,"For Risky Loans. Mean={}".format(round(loan_data[loan_data["credit_policy"]==0]["int_rate"].mean(),2)), fontsize =13)


ax2 = plt.subplot(212)
sns.histplot(loan_data[loan_data["credit_policy"]==1]["int_rate"], kde = False)
plt.axvline(x = loan_data[loan_data["credit_policy"]==1]["int_rate"].mean(), color = 'r') # Mean line
#plt.axvline(x = loan_data[loan_data["credit_policy"]==0]["int_rate"].median(), color = 'b') # median line
plt.ylabel("Counts")
plt.text(0.16,500,"For Healthy Loans. Mean={}".format(round(loan_data[loan_data["credit_policy"]==1]["int_rate"].mean(),2)), fontsize =13)

plt.show()

The average interest rate for the Risky loans is around 2% higher than the healthy loans.

In [None]:
fig = plt.figure(figsize = (10,8))

ax1 = plt.subplot(211)

sns.histplot(loan_data[loan_data["credit_policy"]==0]["dti"], kde = False)
plt.axvline(x = loan_data[loan_data["credit_policy"]==0]["dti"].mean(), color = 'r') # Mean line
#plt.axvline(x = loan_data[loan_data["credit_policy"]==0]["int_rate"].median(), color = 'b') # median line
plt.ylabel("Counts")
plt.xlabel("Debt-to-income Ratio", fontsize = 14)
plt.text(20,150,"For Risky Loans. Mean={}".format(round(loan_data[loan_data["credit_policy"]==0]["dti"].mean(),2)), fontsize =13)


ax2 = plt.subplot(212)
sns.histplot(loan_data[loan_data["credit_policy"]==1]["dti"], kde = False)
plt.axvline(x = loan_data[loan_data["credit_policy"]==1]["dti"].mean(), color = 'r') # Mean line
#plt.axvline(x = loan_data[loan_data["credit_policy"]==0]["int_rate"].median(), color = 'b') # median line
plt.ylabel("Counts")
plt.text(20,400,"For Healthy Loans. Mean={}".format(round(loan_data[loan_data["credit_policy"]==1]["dti"].mean(),2)), fontsize =13)

plt.show()

The Debt to Income ratio is higher for the Risky loans. We can also check the DTI ratio for the debt_consolidation loans to assess the risk apetite

In [None]:
fig = plt.figure(figsize = (10,8))

ax1 = plt.subplot(211)

sns.histplot(loan_data[(loan_data["credit_policy"]==0) & (loan_data["purpose"]=="debt_consolidation")]["dti"], kde = False)
plt.axvline(x = loan_data[(loan_data["credit_policy"]==0) & (loan_data["purpose"]=="debt_consolidation")]["dti"].mean(), color = 'r') # Mean line
#plt.axvline(x = loan_data[loan_data["credit_policy"]==0]["int_rate"].median(), color = 'b') # median line
plt.ylabel("Counts")
plt.xlabel("Debt-to-income Ratio", fontsize = 14)
plt.text(20,85,"For Risky Loans - Dbt Consolidation. Mean={}".format(round(loan_data[(loan_data["credit_policy"]==0) & (loan_data["purpose"]=="debt_consolidation")]["dti"].mean(),2)), fontsize =13)
plt.show()

In [None]:
mean_dti_by_purpose_policy= loan_data.groupby(["purpose","credit_policy"])["dti"].mean().reset_index()
mean_dti_by_purpose_policy

In [None]:
fig = plt.figure(figsize = (10,8))
ax1 = plt.subplot(1,1,1)

sns.barplot(data = mean_dti_by_purpose_policy,x ="purpose" , y="dti", hue = "credit_policy")
plt.xticks(rotation = 90, fontsize = 14)
plt.ylabel("Debt to Income Ratio", fontsize = 14)
plt.xlabel("Purpose", fontsize = 13)
plt.title("Purpose vs DTI for Healthy and Risky Loans", fontsize = 14)

ax2 = ax2.twinx().twiny()
plt.axhline(y = loan_data[loan_data["credit_policy"]==0]["dti"].mean(), color = 'blue', linestyle = '--', label = "Avg DTI for Risky loans")
plt.axhline(y = loan_data[loan_data["credit_policy"]==1]["dti"].mean(), color = 'orange', linestyle = '--', label = "Avg DTI for Healthy loans")
plt.legend(loc = 1)
plt.show()

Higher the DTI ratio, higher the chances of risk. The average DTI for a healthy loan is ~12.5. However, DTI with values as low as 10.5 can also be risky. Though it does not imply that the credit policy is corelated to the DTI.

It will be interesting to check the variables that are correlated to DTI. 

In [None]:
fig = plt.figure(figsize = (15,8))
ax1 = plt.subplot(121)
sns.heatmap(loan_data.corr('spearman'), annot = True, fmt = ".2f", cmap = "RdYlBu")
plt.title("Correlation Heat Map - Spearman Coeff", fontsize = 14)

ax2 = plt.subplot(122)
sns.heatmap(loan_data.corr('pearson'), annot = True, fmt = ".2f", cmap = "RdYlBu")
plt.title("Correlation Heat Map - Pearson Coeff", fontsize = 14)
plt.tight_layout()
plt.show()

The DTI is correlated with Revolving Balance and Revolving utilization of the credit amount(using Spearman correlation). However, DTI is onlu correlated to Revolving Utiliization with Pearson corelation coefficient. 

It will be better to determine the correct corelation alogithm to apply but looking the variation of DTI and Revol_Bal and Revol_util in a scatter plot

In [None]:
fig = plt.figure(figsize=(8,6))
ax1 = plt.subplot(211)

plt.title("Variation of Debt-to-Income Ratio vs Revolving Balance", fontsize = 14)
sns.scatterplot(data = loan_data[["dti", "revol_bal", "revol_util","credit_policy"]], x = "dti", y = "revol_bal", hue = "credit_policy")

ax1 = plt.subplot(212)
sns.scatterplot(data = loan_data[["dti", "revol_bal", "revol_util","credit_policy"]], x = "dti", y = "revol_util", hue = "credit_policy")


plt.show()

There appears to be no corelation between DTI and Revolving Balance/Utilization. However, from the correlation coeff, it appears that the fico - Credit Scope and the interest rates are negatively correlated.

In [None]:
from matplotlib import patches
fig = plt.figure(figsize = (8,6))
ax1 = plt.subplot(111)

sns.scatterplot(data = loan_data[["fico", "int_rate","credit_policy"]], x = "fico", y = "int_rate", color = 'b', hue = "credit_policy")
sns.regplot(data = loan_data[["fico", "int_rate"]], x = "fico", y = "int_rate", color = 'b', scatter = False)
plt.title("FICO vs Int. Rate ", fontsize = 14)
plt.ylabel("Int Rate", fontsize = 13)
plt.xlabel("FICO - Credit Score", fontsize = 13)
rect = patches.Rectangle((660,0.055),100,0.02,edgecolor = "r", fill = False, linewidth = 2)
ax1.add_patch(rect)
plt.text(625,0.08,"Poor FICO score,\nLow Interest Rates and healthy", fontsize = 12, color = "r", fontweight = "semibold", )
plt.show()

The higher the FICO score, the lower the interest rates. Also, the higher credit score have better risk credibility. However, there are a few cases where lower interest rates were offered even though the FICO score was very poor.
One possibility is if there hase been a compromise to adhere to the policy of loan disbursement and favours been granted.

For the sake of analysis, any FICO lower than 735 can be deened as poor credit score.

In [None]:
from matplotlib import patches
fig = plt.figure(figsize = (8,6))
ax1 = plt.subplot(111)

sns.scatterplot(data = loan_data[loan_data["credit_policy"]==0], x = "fico", y = "int_rate", color = 'b')
#sns.regplot(data = loan_data[loan_data["credit_policy"]==0], x = "fico", y = "int_rate", color = 'b', scatter = False)
plt.title("FICO vs Int. Rate for Risky loans", fontsize = 14)
plt.ylabel("Int Rate", fontsize = 13)
plt.xlabel("FICO - Credit Score", fontsize = 13)
#rect = patches.Rectangle((660,0.055),100,0.02,edgecolor = "r", fill = False, linewidth = 2)
#ax1.add_patch(rect)
#plt.text(625,0.08,"Poor FICO score,\nLow Interest Rates and healthy", fontsize = 12, color = "r", fontweight = "semibold", )
plt.show()

Identifying the loans that are termed "Healthy" in spite of having a low Credit Score and very low interest rates.
Listing all the loans with FICO < 725 but interest rates between 5% to 8%

In [None]:
low_Fico_Int_Rates = loan_data[(loan_data["credit_policy"]==1) & ((0.05<=loan_data["int_rate"]) & (0.08>=loan_data["int_rate"]))& (loan_data["fico"]<725)]
low_Fico_Int_Rates.head()

Things to check :
Relationship of other attributes with Credit Policy

In [None]:
loan_data[loan_data["credit_policy"]==0].describe()

In [None]:
loan_data[loan_data["credit_policy"]==1].describe()

### Checking the correlation between Credit Policy and the Days with Credit LIne

In [None]:
loan_data[["credit_policy", "days_with_cr_line"]].groupby("credit_policy").mean()

We will convert the days of credit line to years to see if the short term loans are Safer than the Longer term loans

In [None]:
df_cr_line_policy = loan_data.loc[:,("credit_policy", "days_with_cr_line")]
df_cr_line_policy["cr line duration in years"] = np.round(df_cr_line_policy["days_with_cr_line"].apply(lambda x: x/365.25),2)
df_cr_line_policy.head()

In [None]:
#PLotting a scatter plot with years and the credit policy
fig = plt.figure(figsize = (10,8))
ax1 = plt.subplot(111)
sns.scatterplot(data = df_cr_line_policy, x = "cr line duration in years", y = "credit_policy")
plt.title("Relationship Between Years of Credit and Health of a loan", fontsize = 14)
plt.show()

In [None]:
#Let's group the years in short term, medium term, long term and very long term loans
def duration_type(y):
    if (0<=y) & (y<=3) :
        return "Short Term"
    elif (4<=y) & (y<=10):
        return "Medium Term"
    elif (11<=y) & (y<=20):
        return "Long Term"
    elif y>=21:
        return "Very Long Term"

df_cr_line_policy["Duration_Type"] = df_cr_line_policy["cr line duration in years"].apply(lambda x : duration_type(x))
df_cr_line_policy.sort_values(["cr line duration in years"])

In [None]:
plt.figure(figsize = (8,6))
ax1 = plt.subplot(111)

sns.countplot(data =df_cr_line_policy, x = "Duration_Type", hue = "credit_policy" )

plt.title("Count of Loans with Year and credit policy", fontsize = 14)
plt.xlabel("Number of loans", fontsize = 13)
plt.ylabel("Duration of Loans and Cr.Policy", fontsize = 13)
plt.legend(loc  = 1)
plt.show()

From the above chart, it is very evident that the long duration loans are less riskier than the short termed loans. However, very long terms loans are also little riskier compared to medium and long term loans.



### From the correlation matrix in the heat map (from above) 

In [None]:
fig = plt.figure(figsize = (10,8))
ax2 = plt.subplot(111)
sns.heatmap(loan_data.corr('pearson'), annot = True, fmt = ".2f", cmap = "RdYlBu")
plt.title("Correlation Heat Map - Pearson Coeff", fontsize = 14)
plt.tight_layout()
plt.show()

The credit policy is a little negatively correlated with the inquiries made in the last 6 months. Let's us verify this with a scatter plot

In [None]:
loan_data["credit_policy"].map({0:"Risky", 1:"Healthy"}).head()

In [None]:
plt.figure(figsize=(6,6))
ax = plt.subplot(111)
plt.scatter( x=loan_data["credit_policy"].map({0:"Risky", 1:"Healthy"}), y = loan_data["inq_last_6mths"], color = "b")
plt.title("Credit Policy vs Inquiries in last 6 months", fontsize = 14)
plt.xlabel("Credit Policy", fontsize = 13)
plt.ylabel("# of Inquiries in last 6 months", fontsize = 13)
plt.text(0,10,"Average Inquiries = {} per loan".format(np.round(loan_data[loan_data["credit_policy"]==1]["inq_last_6mths"].mean(),0)), fontsize = 13, color = "r")
plt.text(0.25,30,"Average Inquiries = {} per loan".format(np.round(loan_data[loan_data["credit_policy"]==0]["inq_last_6mths"].mean(),0)), fontsize = 13, color = "r")
plt.show()

It is very evident that if the number of inquiries are more, than the loan will be classifed as Risky. FOr healthy loans, very few inquiries(<1 per loan) are made. Whereas, for risky loans, the average inquires are around 4 per loan.

This happens when the customer applies for a loan in multiple banks to get a quote of interst rates for comaprision and usually this approach works against the loan application.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_df = pd.DataFrame()
list_col = ['credit_policy', 'int_rate', 'installment', 'log_annual_inc',
       'dti', 'fico', 'days_with_cr_line', 'revol_bal', 'revol_util',
       'inq_last_6mths', 'delinq_2yrs', 'pub_rec', 'not_fully_paid'] # Added constant column to caluclate the VIF properly
X = loan_data[list_col]
X["constant"] = 1
vif_df["Features"] = X.columns

vif_df["VIF"] = [variance_inflation_factor(X.values,i) for i in range(len(X.columns))]
vif_df.sort_values("VIF", ascending = False)

## CONCLUSIONS 

### The data did not contain any missing values. Therefore, imputation was not required.

1. Around 40% of risky loans are for Debt Consolidation urposes. Risk for Credit cards are at #3 with 13% of cases. It would be interesting to see whether loan for debt consolidation are mostly risky.
2. Around 1/4th of the Education purpose loans are termed "Risky".
3. The average interest rate for the Risky loans is around 2% higher than the healthy loans.
4. Higher the DTI ratio, higher the chances of risk. The average DTI for a healthy loan is ~12.5. However, DTI with values as low as 10.5 can also be risky.
5. The higher the FICO score, the lower the interest rates. Also, the higher credit score have better risk credibility. However, there are a few cases where lower interest rates were offered even though the FICO score was very poor. One possibility is if there hase been a compromise to adhere to the policy of loan disbursement and favours been granted.
6. Long duration loans(11 to 20 yrs) are less riskier than the short termed loans(0 to 3 years). However, very long terms loans(>20 years) are also little riskier compared to medium(4 to 10 years) and long term loans.
7. The number of inquiries are more, than the loan will be classifed as Risky. FOr healthy loans, very few inquiries(<1 per loan) are made. Whereas, for risky loans, the average inquires are around 4 per loan.
8. The dataset was also tested for multicolinearity and found no features highly corelated with other features