# Lending Club Data Exploration

## Data Overview
The data given below contains the information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

## Business Objective
Lending Club wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment. 

## Preliminary Wrangling
This document explores a dataset containing loan data and attributes for approximately 40,000 loan application.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.float_format = '{:.2f}'.format

In [None]:
# load in the dataset into a pandas dataframe, print statistics
loan_df = pd.read_csv('../input/lending-club-loan-dataset-2007-2011/loan.csv',encoding = "ISO-8859-1", low_memory=False)
print(loan_df.shape)
print(loan_df.dtypes)
print(loan_df.head(10))

In [None]:
# percentage of null values in each column
round(100 * loan_df.isnull().sum()/loan_df['id'].count())

In [None]:
# Removing columns that have more that 50% nulls
threshold_number = loan_df['id'].count()/2
loan_df = loan_df.loc[:, loan_df.isnull().sum(axis=0) <= threshold_number]
loan_df.shape

In [None]:
# Checking number of unique values in each column
loan_df.nunique()

In [None]:
# Removing columns that has single value. Those columns will not give us any insights
loan_df = loan_df.loc[:, loan_df.nunique(axis=0) > 1]
loan_df.shape

In [None]:
loan_df.nunique().sort_values(ascending=False)

In [None]:
# Checking data in the columns with low variation
loan_df['term'].value_counts()

In [None]:
# converting term to int datatype , since term_months represents numeric
loan_df['term_months'] = loan_df['term'].str.lstrip().str.slice(stop=2).astype('int')

In [None]:
loan_df['term_months'].value_counts()

In [None]:
# dropping unused term column
loan_df = loan_df.drop('term', axis=1)

In [None]:
# check unique values for pub_rec_bankruptcies
loan_df['pub_rec_bankruptcies'].value_counts()
# looks like few values are missing

In [None]:
# check null value count
loan_df['pub_rec_bankruptcies'].isnull().sum()

In [None]:
# we dont want to be bias to bankruptcies. Removing rows with null values as it's safe to remove, since
# low percentage of null values
loan_df = loan_df[~loan_df['pub_rec_bankruptcies'].isnull()]
# verify null values have been removed
loan_df['pub_rec_bankruptcies'].isnull().sum() == 0

In [None]:
# check for unique values in loan_status column
loan_df['loan_status'].value_counts()

In [None]:
# check null value count
loan_df['loan_status'].isnull().sum()

In [None]:
# percentage of null values in each column
round(100 * loan_df.isnull().sum()/loan_df['id'].count(),2)

In [None]:
# removing description as it's not significant
loan_df = loan_df.drop('desc', axis=1)

In [None]:
# removing rows with null values(as they are low in percentage):
# employee title, employee length, title, revol_util, last_pymnt_d
loan_df = loan_df[~loan_df['emp_title'].isnull()]
loan_df = loan_df[~loan_df['emp_length'].isnull()]
loan_df = loan_df[~loan_df['title'].isnull()]
loan_df = loan_df[~loan_df['revol_util'].isnull()]
loan_df = loan_df[~loan_df['last_pymnt_d'].isnull()]

In [None]:
# percentage of null values in each column
round(100 * loan_df.isnull().sum()/loan_df['id'].count(),2)

None of the columns contains missing values.

In [None]:
# exploring data values in each column
loan_df.head()

In [None]:
loan_df.head()

## Other Quality Issues
- int_rate and revol_util are percentage strings. % value can be removed and the column datatype needs to be changed to float instead of string.
- emp_length can be numeric as well
- object date columns: last_pymnt_d, last_credit_pull_d, earliest_cr_line, issue_d
- splitting of month and year on issue date 
- Remove consumer behaviour columns
- Remove zip_code, addr_state, url, id, member_id, title not to be used for data analysis

In [None]:
# int_rate and revol_util are percentage strings
loan_df['int_rate'] = loan_df['int_rate'].str.strip('%').astype('float')
loan_df['revol_util'] = loan_df['revol_util'].str.strip('%').astype('float')

In [None]:
# emp_length can be numeric as well
loan_df['emp_length'].value_counts()

In [None]:
# can give values 0 to 10: 0 for < 1 year, 10, for 10+
# using replace method on dataframe
replace_dict = {
    '10+ years': 10,
    '2 years': 2,
    '< 1 year': 0,
    '3 years': 3,
    '4 years': 4,
    '5 years': 5,
    '6 years': 6,
    '1 year': 1,
    '7 years': 7,
    '8 years': 8,
    '9 years': 9
}
loan_df = loan_df.replace({"emp_length": replace_dict })
loan_df['emp_length'].value_counts()

In [None]:
# object date columns: last_pymnt_d, last_credit_pull_d, earliest_cr_line, issue_d
# converting them to datetime columns
loan_df['last_pymnt_d'] = pd.to_datetime(loan_df['last_pymnt_d'], format='%b-%y')
loan_df['last_credit_pull_d'] = pd.to_datetime(loan_df['last_credit_pull_d'], format='%b-%y')
loan_df['earliest_cr_line'] = pd.to_datetime(loan_df['earliest_cr_line'], format='%b-%y')
loan_df['issue_d'] = pd.to_datetime(loan_df['issue_d'], format='%b-%y')

In [None]:
# verify columns are converted to datetime
loan_df.info()

In [None]:
# splitting of month and year on issue date
loan_df['issue_d_month'] = loan_df['issue_d'].dt.month
loan_df['issue_d_year'] = loan_df['issue_d'].dt.year

## The aim is to identify patterns which indicate if a person is likely to default
EDA to understand how consumer attributes and loan attributes influence the tendency of default. <br>
Consumer behaviour might be irrelevant for our analysis. <br>
Target column will be loan_status = 'Fully Paid' or 'Charged Off'.

In [None]:
# Listing Consumer behaviour columns
behaviour_columns = ['last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d', 'delinq_2yrs', 
                     'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec',
                    'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
                    'revol_bal', 'revol_util', 'total_acc', 'out_prncp', 'out_prncp_inv',
                    'recoveries', 'collection_recovery_fee']
# Listing usued columns for analysis
# id and member_id are insignificant columns. can remove them.
# title can be ignored too as purpose column drives our analysis better
unused_columns = ['funded_amnt_inv', 'zip_code', 'addr_state', 'url', 'id', 'member_id', 'title']
to_drop_columns = behaviour_columns + unused_columns

In [None]:
# droping usused columns
loan_df = loan_df.drop(to_drop_columns, axis=1)
loan_df.shape

In [None]:
loan_df['loan_status'].value_counts()

In [None]:
# keeping the original column before converting to numeric
loan_df['loan_status_cat'] = loan_df['loan_status']

In [None]:
loan_df['loan_status_cat']

### Reason to change loan_status to numerical value
Loan status is categorical variable which is not suiatble for numerical computations. Hence we need to convert it to numeric variable, so that it could be used to calculate avergae and also used in various plots to display average rating. 

In [None]:
# Filtering only Fully Paid and Charged Off loans, converting them to numeric
loan_df = loan_df.loc[loan_df['loan_status'] != 'Current', :]
loan_df['loan_status'] = loan_df['loan_status'].apply(lambda x: 1 if x=='Charged Off' else 0)
loan_df['loan_status']

In [None]:
loan_df['loan_status'].value_counts()

In [None]:
loan_df.info()

### Copy Dataframe

In [None]:
df = loan_df.copy()

### What is the structure of your dataset?
There are 35367 diamonds in the dataset with 20 features (int_rate, installment, grade, emp_length, purpose, annual_inc, loan_status, term_months, dti etc). Most variables are numeric in nature, but the variables grade, term_months, purpose, and home_ownership are ordered factor variables with the following levels.

(worst) ——> (best)
- grade: A, B, C, D, E, F, G
- term_months: 30, 60
- purpose: credit_card, car, small_business, other, wedding, debt_consolidation, home_improvement, major_purchase, medical, moving, vacation, house, renewable_energy, educational
- home_ownership: RENT, OWN, MORTGAGE, OTHER

### What is/are the main feature(s) of interest in your dataset?
I'm most interested in figuring out what features are best for predicting the loan status (paid or default) in the dataset.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?
I expect that interest rate and income will have the strongest effect on each loan: higher the interest rate and low income, highest rate of default. I also think that the other variables purpose, grade, loan term, employee length and debt to income ratio will also have impact on loan status though to a smaller degree.

## Univariate Exploration

In [None]:
# let's check the proportion of loan status who are defaults
loan_df['loan_status'].describe()

## Distribution of Loan Status
Most of the loans are fully paid. <br> 
14% out of total loans are charged off.

In [None]:
# let's plot the countplot for loan status category
base_color = sns.color_palette()[0]
sns.countplot(data = df, x = 'loan_status_cat', color = base_color);
plt.title("Distribution of Loan Status")
plt.xlabel('Loan Status')
plt.show();

In [None]:
# let's check the distribution of interest rate charged to customers
df.int_rate.describe()

In [None]:
# let's also plot box plot for interest rate
df.int_rate.plot(kind='box');

Median interest rate changed is around 12.5%. High interest rates are charged to few clients max 24%. There're outliers at the higher end as well.

In [None]:
# let's check the distribution of Annual Income
df.annual_inc.describe()

In [None]:
# let's also plot box plot for interest rate
df.annual_inc.plot(kind='box');

The median income is 60k and minimum is 4k. Looks like people with very low income were offered loans. ALso, there're high income earners as well. 

I'll now move on to the other variables in the dataset: grade, term_months, and purpose.

In [None]:
# grade
level_order = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
sns.countplot(data = df, x = 'grade', color = base_color, order=level_order);

Grade A and B are the most common to apply loans. This resonates with best to worst category of grade where low category have less number of loans.

In [None]:
# term_month
sns.countplot(data = df, x = 'term_months', color = base_color);

In [None]:
# Let's check the how purpose looks like
sns.countplot(y='purpose', data=df, color = base_color);

In [None]:
#proportion of unique purpose in dataset
df.purpose.value_counts() / df.shape[0]

`debt_consolidation` makes up almost 48% of all data under Purpose category and would be interesting to see how it relates to other variables. 

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?<br>
14% of total Loans are paid off. Loan status is a categorical variable which is transformed into numeric variable for computation.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?<br>
Found few outliers in Interest rate and annual income. No transformation is done on these variables.

# Bivariate Exploration

### Interest Rate Vs Loan Status
High interest rates defaults more

In [None]:
# let's check interest rate descriptive stats
df['int_rate'].describe()

In [None]:
# binning int_rate
df['int_rate_bin'] = pd.cut(df['int_rate'], 
                                [0,5,10,15,20,25,30], 
                                labels=['0-5','5-10','10-15','15-20','20-25','25-30'])
df['int_rate_bin'].value_counts()

In [None]:
# Plot between interest rate and loan status
sns.barplot(x='int_rate_bin', y='loan_status', data=df, color = base_color)
plt.title('Interest Rate Status')
plt.xlabel('Interest Rate')
plt.ylabel('Loan Status')
plt.show()

### Annual Income Vs Loan Status
Clearly shows, low incomes has high default rates. Followed by Medium income earners.

In [None]:
# Continuous variable: annual_inc
df['annual_inc_raw'] = df['annual_inc']
df['annual_inc'].describe().astype('int')

In [None]:
# binning annual income
def annual_inc(inc):
    if inc <= 50000:
        return 'low'
    elif inc > 50000 and inc <=100000:
        return 'medium'
    elif inc > 100000 and inc <=150000:
        return 'high'
    else:
        return 'very high'

df['annual_inc'] = df['annual_inc'].apply(lambda x: annual_inc(x))
df['annual_inc'].value_counts()

In [None]:
# cross tab between annual_inc and loan_status
pd.crosstab(df.annual_inc, df.loan_status_cat, margins=True, margins_name="Total")

In [None]:
## bar plot on categorical variable : annual_inc
sns.barplot(x='annual_inc', y='loan_status', data=df, color = base_color)
plt.title('Annual Income Default Status')
plt.xlabel('Annual Income')
plt.ylabel('Loan Status')
plt.show()

### Grade Vs Loan Status 
 `Higher Grade Loans` have high percentage to default.

In [None]:
# crosstab between loan status and grade
pd.crosstab(df.grade, df.loan_status_cat, margins=True, margins_name="Total", normalize="index")

In [None]:
# bar plot on categorical variable : grade
sns.barplot(x='grade', y='loan_status', data=df, color = base_color, order = level_order)
plt.title('Loan status for Grades')
plt.xlabel('Grade')
plt.ylabel('Loan Status')
plt.show()

### Term Vs Loan Status 
Overall `60` months loan term tends to default more than twice as compared to `36` months.

In [None]:
# crosstab between month term and loan status. Showing percentage of defaults
pd.crosstab(df.term_months, df.loan_status_cat, margins=True, margins_name="Total", normalize="index")

In [None]:
# bar plot on categorical variable : term_months
plt.title('Loan status for Term')
sns.barplot(x='term_months', y='loan_status', data=df, color = base_color)
plt.xlabel('Terms in months')
plt.ylabel('Loan Status')
plt.show()

### Loan Purpose Vs Status
Clients having `small_business` as purpose defaults the most followed by `renewable_energy` and `house`.

In [None]:
# crosstab between purpose and loan_status
pd.crosstab(df.purpose, df.loan_status_cat, margins=True, margins_name="Total", normalize="index")

In [None]:
## bar plot on categorical variable : purpose
plt.figure(figsize = [12, 5])
sns.barplot(y='purpose', x='loan_status', data=df, color = base_color)
plt.title('Loan Purpose and Status')
plt.xlabel('Purpose')
plt.ylabel('Loan Status')
plt.show()

### Loan Year Vs Status

In [None]:
# crosstab between loan status and issue year
pd.crosstab(df.loan_status_cat, df.issue_d_year, margins=True, margins_name="Total")

In [None]:
## bar plot on categorical variable : issue_d_year
sns.barplot(x='issue_d_year', y='loan_status', data=df,  color = base_color)
plt.show()

The default rate had suddenly increased in 2011, inspite of reducing from 2008 till 2010

### Home Owners Vs Loan Status 

In [None]:
## bar plot on categorical variable : home_ownership
sns.barplot(x='home_ownership', y='loan_status', data=df, color = base_color)
plt.show()

Clients having `OTHER` as home ownership could be considered under High Risk Category.

### Employment Length Vs Loan Status 

In [None]:
# crosstab between emp_length and loan_status
pd.crosstab(df.loan_status_cat, df.emp_length, margins=True, margins_name="Total", normalize="index")

In [None]:
## bar plot on categorical variable : emp_length
plt.figure(figsize=(10,5))
sns.barplot(x='emp_length', y='loan_status', data=df, color = base_color)
plt.show()

`Employment length` is not much of a predictor of default

### Loan Amount Vs Loan Status
Very high and high amount loans tend to have more defaulters¶

In [None]:
df['loan_amnt'].describe().astype('int')

In [None]:
# binning loan_amnt
def loan_amnt(amt):
    if amt <= 5500:
        return 'low'
    elif amt > 5500 and amt <=10000:
        return 'medium'
    elif amt > 10000 and amt <=15000:
        return 'high'
    else:
        return 'very high'

df['loan_amnt_bin'] = df['loan_amnt'].apply(lambda x: loan_amnt(x))
df['loan_amnt_bin'].value_counts()

In [None]:
## bar plot on categorical variable : loan_amnt_bin
loan_order = ['low', 'medium', 'high', 'very high']
sns.barplot(x='loan_amnt_bin', y='loan_status', data=df, color = base_color, order = loan_order)
plt.title('Loan Amount Default Status')
plt.xlabel('Loan Amount')
plt.ylabel('Status')
plt.show()

### Debt To Income Ratio Vs Loan Status

In [None]:
df['dti'].describe()

In [None]:
# binning debt to income ratio
df['dti_bin'] = pd.cut(df['dti'], 
                                [0,5,10,15,20,25,30], 
                                labels=['0-5','5-10','10-15','15-20','20-25','25-30'])
df['dti_bin'].value_counts()

In [None]:
## bar plot on categorical variable : dti_bin
sns.barplot(x='dti_bin', y='loan_status', data=df, color = base_color)
plt.show()

Most of the defaulters have debt to income ratio between 20 to 25 %. But there's not much of difference between the lower and highest end of debts. There's an increase in trend from lower to higher end with an exception of decrease in default rate for highest level of debt-income-ratio.

### Installment Vs Loan Status

In [None]:
# installment
def installment(n):
    if n <= 200:
        return 'low'
    elif n > 200 and n <=400:
        return 'medium'
    elif n > 400 and n <=600:
        return 'high'
    else:
        return 'very high'
    
df['installment'] = df['installment'].apply(lambda x: installment(x))

In [None]:
## bar plot
sns.barplot(x='installment', y='loan_status', data=df, color = base_color)
plt.show();

Higher the installment amount, higher is the default rate. But the difference between low to very high defaults is not very significant it's approx 4% rate of default.

### Loan Purpose Vs Interest Rate

In [None]:
# loan purpose Vs Interest Rate
pd.crosstab(df.purpose, df.int_rate_bin, margins=True, margins_name="Total", normalize="index").apply(lambda r: round(100*(r/r.sum())), axis=1)

In [None]:
# Box plot between the loan purpose and interest rate offered
plt.figure(figsize=(20, 10))
sns.boxplot(x='purpose', y='int_rate', data=df, color = base_color);
plt.show();

`Small Business`, `House` and `Debt Consolidation` are considered as high risk loan purpose and hence offered more interest rates.

### Loan Amount Vs Interest Rate
Higher Interest rates are chanred for high loan amounts¶

In [None]:
## bar plot on categorical variable : loan_amnt_bin
sns.barplot(x='loan_amnt_bin', y='int_rate', data=df, color = base_color)
plt.show()

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Higher interest rates have high default rate. Low income earners tend to default more. There's an increasing trend observed for higher grades. Also, 60 month term has high defalut rate. Debt consolidation is the main purpose for loan defaults. Home ownership, debt-to-income ratio, employment length and installment amount are not significant predictors of loan status.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶
Interest charged to `small business, deb_consolidation` is more than others under purpose category. Also, high interest rates are chanrged for high loan amount.

### Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Interest rate bins are created to distribute data in intervals of 5%. Annual income bins are created like low, medium, high and very high to visualize data in different income groups. Similarly, debt-to-income ratio bins are created fto distribute data in groups. Installment bins are also created to visualize data in different installment categories like low, medium etc.

# Multivariate Exploration
As seen during bivariate analysis `loan purpose`, `loan term`, `grade`, `interest rate`, `annual income` and `loan amount` are significant variables to determine rate of defaults. Hence, we try to mix few of these variables to see the relationship with rate of default.

In [None]:
# let's check the top 5 purpose of loans
df.purpose.value_counts()

In [None]:
# let's take top 4 purpose excluding Other since it's detail is not very clear
main_purposes = ["debt_consolidation", "credit_card","home_improvement","major_purchase"]
df = df[df['purpose'].isin(main_purposes)]
df['purpose'].value_counts()

In [None]:
sns.countplot(data = df, x = 'purpose', color = base_color);

## Loan Term, Purpose Vs Loan Status
Plotting loan purpose against loan term shows that `debt consolidation` has the highest rate of default in both loan terms 36 and 60 months.There's an increase in trend of loan defaulters when we move from lower term to higher term. Among all purposes, `debt consolidation` tends to default most. There's much higher default rate 25%+ for 60 months term against only 12% for 36 months period. Also, 60 months term has much higher default rate almost double than 36 months which can also be considered to be very risky.

In [None]:
# let's now compare the default rates across two types of categorical variables
# purpose of loan (constant) and another categorical variable (which changes)
plt.figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
sns.barplot(x='term_months', y='loan_status', hue="purpose", data=df)
plt.title('Loan Purpose, Term Default Rate')
plt.xlabel('Loan term (Months)')
plt.ylabel('Loan Status')
plt.show();

## Loan Amount, Interest Rate Vs Loan Status
Let's check how default rate varies with Loan amount and interest rates which are significant variables found during bivariate analysis. As can be seen, high loan amount are charged high interest rates. As the interest rate goes higher, default rates goes up as well. For high loan amount and interest rate between 20-25 % have more than 40% default rates.

In [None]:
# Loan Amount
plt.figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
sns.barplot(x='loan_amnt_bin', y='loan_status', hue="int_rate_bin", data=df, order = loan_order)
plt.title('Loan Amount and Interest Default Rate')
plt.xlabel('Loan Amount')
plt.ylabel('Loan Status')
plt.show();

## Installment, Annual Income Vs Loan Status
Let's plot Installment and Annual income against loan status. We've already seen low income group and high installment have high default rates, similar trend can also be seen below where low income group having high installment are highest defaulters. Higest default rate for low income is around 18% and the lowest default rate for high income is 12.5%.

In [None]:
# Installment
plt.figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
sns.barplot(x='annual_inc', y='loan_status', hue="installment", data=df, order = loan_order)
plt.title('Loan Installment, Annual Income Default Rate')
plt.xlabel('Annual Income')
plt.ylabel('Loan Status')
plt.show();

## Grade, Purpose Vs Loan Status
Extending our finding of `debt_consolidation` as the most common reason for default. Let's plot purpose with Grade category. We've already seen higher grade tends to default. Let's see if this trend of higher grade and `deb_consolidation` as purpose also hold true. As seen below, there's an increase in trend for higher order Grades namely E, F and G having `deb_consolidation` as main purpose defaults the most.

In [None]:
# grade
plt.figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
sns.barplot(x='grade', y='loan_status', hue="purpose", data=df, order = level_order)
plt.title('Loan Purpose, Grade Default Rate')
plt.xlabel('Grades')
plt.ylabel('Loan Status')
plt.show();

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
`debt_consolidation` having 60 months loan term is the top category to default. This also, hold true for higher Grades as well. Higher loan amount attaracts higher interest rate with high risk of default. Low income group with high installment amount has high default rate.

### Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Top 4 purpose are segregated into main purpose and plotted against Grade and loan term.