<a href="https://www.kaggle.com/code/swapnil09/lending-club-loan-default-prediction?scriptVersionId=208946170" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Problem Statement
LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

Solving this case study will give us an idea about how real business problems are solved using EDA and Machine Learning. In this case study, we will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

# Business Understanding
You work for the LendingClub company which specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

* If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company
* If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

The data given contains the information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for takin actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

When a person applies for a loan, there are two types of decisions that could be taken by the company:

1. Loan accepted: If the company approves the loan, there are 3 possible scenarios described below:
* Fully paid: Applicant has fully paid the loan (the principal and the interest rate)
* Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as 'defaulted'.
* Charged-off: Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan
2. Loan rejected: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

# Business Objectives

* LendingClub is the largest online loan marketplace, facilitating personal loans, business loans, and financing of medical procedures. Borrowers can easily access lower interest rate loans through a fast online interface.
* Like most other lending companies, lending loans to ‘risky’ applicants is the largest source of financial loss (called credit loss). The credit loss is the amount of money lost by the lender when the borrower refuses to pay or runs away with the money owed. In other words, borrowers who defaultcause the largest amount of loss to the lenders. In this case, the customers labelled as 'charged-off' are the 'defaulters'.
* If one is able to identify these risky loan applicants, then such loans can be reduced thereby cutting down the amount of credit loss. Identification of such applicants using EDA and machine learning is the aim of this case study.
* In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.
* To develop your understanding of the domain, you are advised to independently research a little about risk analytics (understanding the types of variables and their significance should be enough).

# Data Description

Here is the information on this particular data set:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: left;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>sub_grade</td>
      <td>LC assigned loan subgrade</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_title</td>
      <td>The job title supplied by the Borrower when applying for the loan.*</td>
    </tr>
    <tr>
      <th>7</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>8</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>9</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>10</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>11</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>12</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>13</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>14</th>
      <td>title</td>
      <td>The loan title provided by the borrower</td>
    </tr>
    <tr>
      <th>15</th>
      <td>zip_code</td>
      <td>The first 3 numbers of the zip code provided by the borrower in the loan application.</td>
    </tr>
    <tr>
      <th>16</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>17</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>18</th>
      <td>earliest_cr_line</td>
      <td>The month the borrower's earliest reported credit line was opened</td>
    </tr>
    <tr>
      <th>19</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>20</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>21</th>
      <td>revol_bal</td>
      <td>Total credit revolving balance</td>
    </tr>
    <tr>
      <th>22</th>
      <td>revol_util</td>
      <td>Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.</td>
    </tr>
    <tr>
      <th>23</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>24</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
    <tr>
      <th>25</th>
      <td>application_type</td>
      <td>Indicates whether the loan is an individual application or a joint application with two co-borrowers</td>
    </tr>
    <tr>
      <th>26</th>
      <td>mort_acc</td>
      <td>Number of mortgage accounts.</td>
    </tr>
    <tr>
      <th>27</th>
      <td>pub_rec_bankruptcies</td>
      <td>Number of public record bankruptcies</td>
    </tr>
  </tbody>
</table>

---
----

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pointbiserialr
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report,roc_auc_score, 
                             roc_curve, auc,ConfusionMatrixDisplay, RocCurveDisplay)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE 
from imblearn.over_sampling import BorderlineSMOTE 
from imblearn.over_sampling import ADASYN  



# Data Collection

In [None]:
df = pd.read_csv("/kaggle/input/lending-club-dataset/lending_club_loan_two.csv") 
df.head()

# Missing Values Check

In [None]:
df.isnull().sum()

In [None]:
def nulls_summary_table(df):
    """
    Returns a summary table showing null value counts and percentage
    
    Parameters:
    df (DataFrame): Dataframe to check
    
    Returns:
    null_values (DataFrame)
    """
    null_values = pd.DataFrame(df.isnull().sum())
    null_values[1] = null_values[0]*100/len(df)
    null_values.columns = ['null_count','null_pct']
    return null_values

nulls_summary_table(df)

In [None]:
drop_df = df.copy()
drop_df = drop_df.dropna()

In [None]:
nulls_summary_table(drop_df)

# Exploratory Data Analysis

The goal is to understand the data,look for important variables

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# Count the occurrences of each loan status
status_counts = drop_df['loan_status'].value_counts()

# Create a bar chart
plt.bar(status_counts.index, status_counts.values)
plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.title('Loan Status Distribution')
plt.show()

In [None]:
term_counts = drop_df['term'].value_counts()

# Create a bar chart
plt.bar(term_counts.index, term_counts.values)
plt.xlabel('Terms')
plt.ylabel('Count')
plt.title('Loan Terms Distribution')
plt.show()

In [None]:
plt.violinplot(drop_df['loan_amnt'])

In [None]:
sns.boxplot(drop_df['loan_amnt'])

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(drop_df.corr(numeric_only=True), annot=True, cmap='viridis')

In [None]:
sns.scatterplot(x = drop_df['installment'], y = drop_df['loan_amnt'], hue=drop_df['loan_status'], alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='loan_status', y='loan_amnt', data=drop_df)
plt.title('Loan Amount Distribution by Loan Status')
plt.xlabel('Loan Status')
plt.ylabel('Loan Amount')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='loan_status', y='installment', data=drop_df)
plt.title('Installments Distribution by Loan Status')
plt.xlabel('Loan Status')
plt.ylabel('Installments')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=drop_df, x='loan_amnt', hue='loan_status', kde=True, multiple='stack')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.title('Distribution of Loan Amount by Loan Status')
plt.show()


In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=drop_df, x='installment', hue='loan_status', kde=True, multiple='stack')
plt.xlabel('Installment')
plt.ylabel('Frequency')
plt.title('Distribution of Installments by Loan Status')
plt.show()

In [None]:
drop_df.groupby(by='loan_status')['loan_amnt'].describe()

**grade & sub_grade**
* grade: LC assigned loan grade
* sub_grade: LC assigned loan subgrade
Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans.

What are the unique possible grade & sub_grade?

In [None]:
plt.figure(figsize=(15, 10))

plt.subplot(2, 2, 1)
grade = sorted(drop_df.grade.unique().tolist())
sns.countplot(x='grade', data=drop_df, hue='loan_status', order=grade)

plt.subplot(2, 2, 2)
sub_grade = sorted(drop_df.sub_grade.unique().tolist())
g = sns.countplot(x='sub_grade', data=drop_df, hue='loan_status', order=sub_grade)
g.set_xticklabels(g.get_xticklabels(), rotation=90);

## **term, home_ownership, verification_status & purpose**
* term: The number of payments on the loan. Values are in months and can be either 36 or 60.
* home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
* verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified
* purpose: A category provided by the borrower for the loan request.

In [None]:
drop_df['home_ownership'].value_counts()

In [None]:
plt.figure(figsize=(15, 20))

plt.subplot(4, 2, 1)
sns.countplot(data=drop_df, x='home_ownership', hue='loan_status')
plt.title('Home Ownership by Loan Status')
plt.xlabel('Home Ownership Type')
plt.ylabel('Count')
plt.legend(title='Loan Status')


plt.subplot(4, 2, 2)
sns.countplot(data=drop_df, x='term', hue='loan_status')
plt.title('Term by Loan Status')
plt.xlabel('Term Type')
plt.ylabel('Count')
plt.legend(title='Loan Status')


plt.subplot(4, 2, 3)
sns.countplot(data=drop_df, x='verification_status', hue='loan_status')
plt.title('Verfication Status by Loan Status')
plt.xlabel('Verfication Status')
plt.ylabel('Count')
plt.legend(title='Loan Status')


plt.subplot(4, 2, 4)
sns.countplot(data=drop_df, x='purpose', hue='loan_status')
plt.title('Purpose by Loan Status')
plt.xlabel('Purpose')
plt.ylabel('Count')
plt.xticks(rotation=90)  # Rotate x-axis labels if needed
plt.legend(title='Loan Status')


In [None]:
drop_df.loc[drop_df['home_ownership']=='OTHER', 'loan_status'].value_counts()


## **int_rate & annual_inc**
* int_rate: Interest Rate on the loan
* annual_inc: The self-reported annual income provided by the borrower during registration

In [None]:
plt.figure(figsize=(15, 20))

plt.subplot(4, 2, 1)
sns.histplot(data=drop_df, x='int_rate', hue='loan_status', bins =20, multiple="stack", kde=True)
plt.title('Interest Rate by Loan Status')
plt.xlabel('Interest Rate')
plt.ylabel('Count')
plt.legend(title='Loan Status')


plt.subplot(4, 2, 2)
sns.histplot(data=drop_df, x='annual_inc', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('Annual Income by Loan Status')
plt.xlabel('Annual Income')
plt.ylabel('Count')
plt.legend(title='Loan Status')

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=drop_df[drop_df['annual_inc']<= 250000], x='annual_inc', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('Annual Income(<= 250000) by Loan Status')
plt.xlabel('Annual Income')
plt.ylabel('Count')
plt.legend(title='Loan Status')

In [None]:
print((drop_df[drop_df.annual_inc >= 250000].shape[0] / drop_df.shape[0]) * 100)
print((drop_df[drop_df.annual_inc >= 1000000].shape[0] / drop_df.shape[0]) * 100)

* **only 1.05% borrowers have annual income greater than 250000 and only 0.018% people have annual income greater than 1000000.** 

In [None]:
drop_df.loc[drop_df.annual_inc >= 1000000, 'loan_status'].value_counts()

In [None]:
drop_df.loc[drop_df.annual_inc >= 250000, 'loan_status'].value_counts()

* Loans with giher interest rates are more likely to be unpaid.
* Only 61 borrowers have annual income greater than 1 Million and 3538 borrowers have annual income greater than 250K

## emp_title & emp_length
* emp_title: The job title supplied by the Borrower when applying for the loan.
* emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

In [None]:
drop_df['emp_title'].value_counts()[:20]

In [None]:
plt.figure(figsize=(15, 12))

plt.subplot(2, 2, 1)
order = ['< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', 
          '6 years', '7 years', '8 years', '9 years', '10+ years',]
g = sns.countplot(x='emp_length', data=drop_df, hue='loan_status', order=order)
g.set_xticklabels(g.get_xticklabels(), rotation=90);

plt.subplot(2, 2, 2)
plt.barh(drop_df.emp_title.value_counts()[:30].index, drop_df.emp_title.value_counts()[:30])
plt.title("The most 30 jobs title afforded a loan")
plt.tight_layout()

## title
* title: The loan title provided by the borrower

In [None]:
drop_df.title.value_counts()[:10]

We will remove title column as we already have a purpose column.

## dti, open_acc, revol_bal, revol_util, & total_acc
* dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
* open_acc: The number of open credit lines in the borrower's credit file.
* revol_bal: Total credit revolving balance
* revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
* total_acc: The total number of credit lines currently in the borrower's credit file

In [None]:
drop_df['closed_credit_line'] = drop_df.apply(lambda row: 0 if row['total_acc'] - row['open_acc'] == 0 else 1, axis=1)


In [None]:
drop_df['closed_credit_line'].value_counts()

This means that 332345 borrowers have a atleast 1 credit line closed which says that there is a high chance of them not defaulting a loan as they have previous experience managing multiple credit lines and closing them by paying off the loan.

In [None]:
plt.figure(figsize=(15, 20))

plt.subplot(4, 2, 1)
sns.histplot(data=drop_df, x='dti', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('dti by Loan Status')
plt.xlabel('Debt to Income Ratio')
plt.ylabel('Count')
plt.legend(title='Loan Status')


plt.subplot(4, 2, 2)
sns.histplot(data=drop_df[drop_df['dti']<= 50], x='dti', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('dti(<= 50) by Loan Status')
plt.xlabel('Debt to Income Ratio')
plt.ylabel('Count')
plt.legend(title='Loan Status')

In [None]:
drop_df.loc[drop_df['dti']>=50, 'loan_status'].value_counts()

In [None]:
plt.figure(figsize=(15, 20))

plt.subplot(4, 2, 1)
sns.histplot(data=drop_df, x='open_acc', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('Loan Status by The number of open credit lines')
plt.xlabel('The number of open credit lines')
plt.ylabel('Count')
plt.legend(title='Loan Status')


plt.subplot(4, 2, 2)
sns.histplot(data=drop_df, x='total_acc', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('Loan Status by The total number of credit lines')
plt.xlabel('The total number of credit lines')
plt.ylabel('Count')
plt.legend(title='Loan Status')

In [None]:
print(drop_df.shape)
print(drop_df[drop_df.open_acc > 40].shape)

In [None]:
print(drop_df.shape)
print(drop_df[drop_df.total_acc > 80].shape)

In [None]:
print(drop_df.shape)
print(drop_df[drop_df.revol_util > 120].shape)

In [None]:
plt.figure(figsize=(15, 20))

plt.subplot(4, 2, 1)
sns.histplot(data=drop_df, x='revol_util', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('Loan Status by Revolving line utilization rate')
plt.xlabel('Revolving line utilization rate')
plt.ylabel('Count')
plt.legend(title='Loan Status')


plt.subplot(4, 2, 2)
sns.histplot(data=drop_df[drop_df['revol_util']<=120], x='revol_util', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('Loan Status by Revolving line utilization rate(<= 120)')
plt.xlabel('Revolving line utilization rate')
plt.ylabel('Count')
plt.legend(title='Loan Status')

In [None]:
drop_df[drop_df['revol_util'] > 200]

In [None]:
print(drop_df.shape)
print(drop_df[drop_df.revol_bal > 250000].shape)

In [None]:
plt.figure(figsize=(15, 20))

plt.subplot(4, 2, 1)
sns.histplot(data=drop_df, x='revol_bal', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('Loan Status by Revolving balance')
plt.xlabel('Revolving balance')
plt.ylabel('Count')
plt.legend(title='Loan Status')


plt.subplot(4, 2, 2)
sns.histplot(data=drop_df[drop_df['revol_bal'] <= 250000], x='revol_bal', hue='loan_status', bins =50, multiple="stack", kde=True)
plt.title('Loan Status by Revolving balance(<= 250000)')
plt.xlabel('Revolving balance')
plt.ylabel('Count')
plt.legend(title='Loan Status')

In [None]:
drop_df.loc[drop_df.revol_bal > 250000, 'loan_status'].value_counts()

* It seems that the smaller the dti the more likely that the loan will not be paid.
* Only 217 borrower have more than 40 open credit lines.
* Only 266 borrower have more than 80 credit line in the borrower credit file.

## Correlation between Loan Status and Numerical features

In [None]:
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
print(numerical_features)

In [None]:
drop_df['loan_status'] = drop_df.loan_status.map({'Fully Paid':1, 'Charged Off':0})

In [None]:
# Calculate Point-Biserial correlation for each numerical feature
for col in numerical_features:
    correlation, p_value = pointbiserialr(drop_df[col], drop_df['loan_status'])
    print(f"{col}: correlation = {correlation}, p-value = {p_value}")

* Strongest Predictors: int_rate (interest rate) and dti (debt-to-income ratio) have the highest correlation with loan default and are statistically significant. They may play important roles in any predictive model.
* Weak or Negligible Predictors: Features like revol_bal, total_acc, and pub_rec_bankruptcies show little to no correlation with default, suggesting they may not be valuable in predicting default risk.

In [None]:
for col in numerical_features:
    plt.figure(figsize=(8, 4))
    sns.boxplot(x='loan_status', y=col, data=drop_df)
    plt.title(f'Distribution of {col} by loan status')
    plt.show()

# Data PreProcessing
**Objectives** -
* Convert categorical features into dummy variables.
* Detect outliers and remove them or winsorize them.
* Remove unnecessary or repetitive features.

In [None]:
print(f"The Length of the data: {drop_df.shape}")

In [None]:
drop_df.emp_title.nunique()

* There are too many titles so we can't create dummies for these values so we'll have to drop this column.

In [None]:
drop_df.drop('emp_title', axis=1, inplace=True)

In [None]:
drop_df.emp_length.unique()

In [None]:
for year in drop_df.emp_length.unique():
    print(f"{year} in this position:")
    print(f"{drop_df[drop_df.emp_length == year].loan_status.value_counts(normalize=True)}")
    print('==========================================')

Charge off rates are extremely similar across all employment lengths. So we are going to drop the emp_length column.

In [None]:
drop_df.drop('emp_length', axis=1, inplace=True)

In [None]:
drop_df.title.value_counts().head()

In [None]:
drop_df.purpose.value_counts().head()

The title column is simply a string subcategory/description of the purpose column. So we are going to drop the title column.

In [None]:
drop_df.drop('title', axis=1, inplace=True)

In [None]:
drop_df.mort_acc.value_counts()

In [None]:
def pub_rec(number):
    if number == 0.0:
        return 0
    else:
        return 1
    
def mort_acc(number):
    if number == 0.0:
        return 0
    elif number >= 1.0:
        return 1
    else:
        return number
    
def pub_rec_bankruptcies(number):
    if number == 0.0:
        return 0
    elif number >= 1.0:
        return 1
    else:
        return number

In [None]:
drop_df['pub_rec'] = drop_df.pub_rec.apply(pub_rec)
drop_df['mort_acc'] = drop_df.mort_acc.apply(mort_acc)
drop_df['pub_rec_bankruptcies'] = drop_df.pub_rec_bankruptcies.apply(pub_rec_bankruptcies)

In [None]:
drop_df.mort_acc.value_counts()

In [None]:
drop_df['pub_rec'].value_counts()

In [None]:
drop_df['pub_rec_bankruptcies'].value_counts()

In [None]:
drop_df.isnull().sum()

In [None]:
plt.figure(figsize=(12, 30))

plt.subplot(6, 2, 1)
sns.countplot(x='pub_rec', data=drop_df, hue='loan_status')

plt.subplot(6, 2, 2)
sns.countplot(x='initial_list_status', data=drop_df, hue='loan_status')

plt.subplot(6, 2, 3)
sns.countplot(x='application_type', data=drop_df, hue='loan_status')

plt.subplot(6, 2, 4)
sns.countplot(x='mort_acc', data=drop_df, hue='loan_status')

plt.subplot(6, 2, 5)
sns.countplot(x='pub_rec_bankruptcies', data=drop_df, hue='loan_status')

In [None]:
drop_df.dropna(inplace=True)

In [None]:
drop_df.shape

## Categorical Variables

In [None]:
print([column for column in drop_df.columns if drop_df[column].dtype == object])

In [None]:
drop_df.term.unique()

In [None]:
term_values = {' 36 months': 36, ' 60 months': 60}
drop_df['term'] = drop_df.term.map(term_values)

In [None]:
drop_df.term.unique()

In [None]:
drop_df.grade.unique()

In [None]:
drop_df.sub_grade.unique()

We know that grade is just a sub feature of sub_grade, So we are goinig to drop it.

In [None]:
drop_df.drop('grade', axis=1, inplace=True)

In [None]:
dummies = ['sub_grade', 'verification_status', 'purpose', 'initial_list_status', 
           'application_type', 'home_ownership']
final_df = pd.get_dummies(drop_df, columns=dummies, drop_first=True)

In [None]:
final_df.head()

**Address**
* We are going to feature engineer a zip code column from the address in the data set. Create a column called 'zip_code' that extracts the zip code from the address column.

In [None]:
final_df.address.head()

In [None]:
final_df['zip_code'] = final_df.address.apply(lambda x: x[-5:])

In [None]:
final_df.zip_code.head()

In [None]:
final_df.zip_code.value_counts()

In [None]:
final_df = pd.get_dummies(final_df, columns=['zip_code'], drop_first=True)

In [None]:
final_df.head()

In [None]:
final_df.drop('address', axis=1, inplace=True)

**issue_d**
* This would be data leakage, we wouldn't know beforehand whether or not a loan would be issued when using our model, so in theory we wouldn't have an issue_date, drop this feature.

In [None]:
final_df.drop('issue_d', axis=1, inplace=True)

In [None]:
# Convert 'earliest_cr_line' to datetime format
final_df['earliest_cr_line'] = pd.to_datetime(final_df['earliest_cr_line'], errors='coerce')

# Extract the year from 'earliest_cr_line'
final_df['earliest_cr_line'] = final_df['earliest_cr_line'].dt.year

In [None]:
final_df.earliest_cr_line.value_counts()

## Train Test Split

In [None]:
w_p = final_df.loan_status.value_counts()[0] / final_df.shape[0]
w_n = final_df.loan_status.value_counts()[1] / final_df.shape[0]

print(f"Weight of positive values {w_p}")
print(f"Weight of negative values {w_n}")

As we can see there's a clear data imbalance and we'll have to address this issue to train our model by using techniques like SMOTE.

In [None]:
train, test = train_test_split(final_df, test_size=0.33, random_state=42)

print(train.shape)
print(test.shape)

# Removing Outliers

In [None]:
print(train[train['dti'] <= 50].shape)
print(train.shape)

In [None]:
print(train.shape)
train = train[train['annual_inc'] <= 250000]
train = train[train['dti'] <= 50]
train = train[train['open_acc'] <= 40]
train = train[train['total_acc'] <= 80]
train = train[train['revol_util'] <= 120]
train = train[train['revol_bal'] <= 250000]
print(train.shape)

## Normalizing the Data

In [None]:
X_train, y_train = train.drop('loan_status', axis=1), train.loan_status
X_test, y_test = test.drop('loan_status', axis=1), test.loan_status

In [None]:
X_train.dtypes

In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
y_train.head()

# Handling data Imbalance

In [None]:
w_p_train = y_train.value_counts()[0] / y_train.shape[0]
w_n_train = y_train.value_counts()[1] / y_train.shape[0]

print(f"Weight of positive values {w_p}")
print(f"Weight of negative values {w_n}")

We can clearly see that there's a huge imbalance between our classes and we need to address it so that our model is trained on sufficient data for good predictions.

## Synthetic Minority Oversampling (SMOTE)
Synthetic Minority Oversampling (SMOTE) is an oversampling technique that creates synthetic data points. SMOTE address’ the core problem in oversampling. Oversampling creates duplicate datapoints whereas SMOTE slightly alters these data points.

In [None]:
smote = SMOTE(random_state = 42) 
X_smote, y_smote = smote.fit_resample(X_train,y_train)

In [None]:
y_smote.value_counts().plot.bar()

## Borderline Smote
The idea behind borderline SMOTE is that we only want to use data that’s at risk of being misclassified as the data to be oversampled. In this case, we build a classifier to classify points as positive or negative. Then, for the data points we misclassify, we oversample those data points. This would hopefully train our algorithm to better recognize these difficult instances and correct for them.

In [None]:
bsmote = BorderlineSMOTE(random_state = 42) 

X_bsmote, y_bsmote = bsmote.fit_resample(X_train,y_train)

In [None]:
y_bsmote.value_counts().plot.bar()

## Adaptive Synthetic Oversampling (ADASYN)
The idea behind AdaSyn is to use a weight distribution of our minority class. Essentially, we give higher weight to instances that are more difficult to learn and lower weight to instances that are easier to learn. AdaSyn is very similar to safe-level SMOTE, except there’s just a different way of computing the synthetic data points.

In [None]:
adasyn = ADASYN(random_state = 42)
X_ada, y_ada = adasyn.fit_resample(X_train,y_train)

In [None]:
y_ada.value_counts().plot.bar()

As we can see for all the three methods the values counts is almost similar but the values that have been added are different based on the different techniques used to calculate them so we'll try to use all three sets of training data in our model and see which one performs the best.

# Model Building
We'll have logistic Regression as our baseline model and then we'll use XGBOOST Classifier, Random Forest Classifier models for increasing our performance.

In [None]:
def print_score(true, pred, train=True):
    if train:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")
        
    elif train==False:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")

In [None]:
X_smote = np.array(X_train).astype(np.float32)
y_smote = np.array(y_train).astype(np.float32)
X_bsmote = np.array(X_train).astype(np.float32)
y_bsmote = np.array(y_train).astype(np.float32)
X_ada = np.array(X_train).astype(np.float32)
y_ada = np.array(y_train).astype(np.float32)
X_test = np.array(X_test).astype(np.float32)
y_test = np.array(y_test).astype(np.float32)

## Logistic Regression

**Trying with Simple SMOTE**

In [None]:
base = LogisticRegression(max_iter=500)
base.fit(X_smote, y_smote)

y_smote_pred = base.predict(X_smote)
y_test_pred = base.predict(X_test)

print_score(y_smote, y_smote_pred, train=True)
print_score(y_test, y_test_pred, train=False)

**Trying with Baseline SMOTE**

In [None]:
base_b = LogisticRegression(max_iter=500)
base_b.fit(X_bsmote, y_bsmote)

y_bsmote_pred = base_b.predict(X_bsmote)
y_test_pred = base_b.predict(X_test)

print_score(y_bsmote, y_bsmote_pred, train=True)
print_score(y_test, y_test_pred, train=False)

**Trying with ADASYNC**

In [None]:
base_ada = LogisticRegression(max_iter=500)
base_ada.fit(X_ada, y_ada)

y_ada_pred = base_ada.predict(X_ada)
y_test_pred = base_ada.predict(X_test)

print_score(y_ada, y_ada_pred, train=True)
print_score(y_test, y_test_pred, train=False)

With all the three techniques the base model gives similar results on both train and test sets so we'll just go with simple SMOTE.

In [None]:
disp = ConfusionMatrixDisplay.from_estimator(
    base, X_test, y_test, 
    cmap='Blues', values_format='d', 
    display_labels=['Default', 'Fully-Paid']
)

disp = RocCurveDisplay.from_estimator(base, X_test, y_test)

In [None]:
scores_dict= {
    'Logistic Regression':{
        'Train': roc_auc_score(y_smote, base.predict(X_smote)),
        'Test': roc_auc_score(y_test, base.predict(X_test)),
    },
}

In [None]:
scores_dict

## Random Forest Classifier

In [None]:
rfc = RandomForestClassifier(n_estimators = 100)
rfc.fit(X_smote, y_smote)

y_smote_pred_rfc = rfc.predict(X_smote)
y_test_pred_rfc = rfc.predict(X_test)

print_score(y_smote, y_smote_pred_rfc, train = True)
print_score(y_test, y_test_pred_rfc, train = False)

In [None]:
disp = ConfusionMatrixDisplay.from_estimator(
    rfc, X_test, y_test, 
    cmap='Blues', values_format='d', 
    display_labels=['Default', 'Fully-Paid']
)

disp = RocCurveDisplay.from_estimator(rfc, X_test, y_test)

In [None]:
scores_dict['Random Forest Classifier'] = {
        'Train': roc_auc_score(y_smote, rfc.predict(X_smote)),
        'Test': roc_auc_score(y_test, rfc.predict(X_test)),
    }

In [None]:
scores_dict

## XGBoost Classifier

In [None]:
xgb = XGBClassifier(use_label_encoder=False)
xgb.fit(X_smote, y_smote)

y_smote_pred_xgb = xgb.predict(X_smote)
y_test_pred_xgb = xgb.predict(X_test)

print_score(y_smote, y_smote_pred_xgb, train = True)
print_score(y_test, y_test_pred_xgb, train = False)

In [None]:
disp = ConfusionMatrixDisplay.from_estimator(
    xgb, X_test, y_test, 
    cmap='Blues', values_format='d', 
    display_labels=['Default', 'Fully-Paid']
)

disp = RocCurveDisplay.from_estimator(xgb, X_test, y_test)

In [None]:
scores_dict['XGBoost Classifier'] = {
        'Train': roc_auc_score(y_smote, xgb.predict(X_smote)),
        'Test': roc_auc_score(y_test, xgb.predict(X_test)),
    }

In [None]:
scores_dict

# Comparing Model Performances

In [None]:
ml_models = {
    'Random Forest': rfc, 
    'XGBoost': xgb, 
    'Logistic Regression': base
}

for model in ml_models:
    print(f"{model.upper():{30}} roc_auc_score: {roc_auc_score(y_test, ml_models[model].predict(X_test)
):.3f}")

In [None]:
!pip install -q hvplot

In [None]:
scores_df = pd.DataFrame(scores_dict)
scores_df.hvplot.barh(
    width=500, height=400, 
    title="ROC Scores of ML Models", xlabel="ROC Scores", 
    alpha=0.4, legend='top'
)

We can say that the best performing model was XGBoost based on the ROC-AUC Score.