## Problem Statement

LendingClub, a peer-to-peer lending platform, provides a dataset of historical loan data, including borrower information, loan characteristics, and repayment statuses. The goal is to analyze this data to gain insights into borrower behavior, identify factors that contribute to loan defaults, and build predictive models to assess the risk associated with future loans.

## Hypothesis

**Borrower Creditworthiness Hypothesis:** Borrowers with lower credit scores are more likely to default on loans.


**Loan Amount Hypothesis:** Larger loan amounts are associated with higher default rates.


**Debt-to-Income Ratio Hypothesis:** Borrowers with higher debt-to-income (DTI) ratios have a higher probability of defaulting.


**Interest Rate Hypothesis:**Higher interest rates correlate with higher default rates, possibly indicating riskier borrowers.


## Variable Description

The LendingClub loan dataset contains numerous variables that provide information about loans and borrowers. Here are some of the key variables:

**loan_amnt:** The total amount of money borrowed.


**term:** The duration of the loan (e.g., 36 months or 60 months).


**int_rate:** The interest rate of the loan.


**installment:** The monthly payment amount for the loan.


**grade:** Loan grade assigned by LendingClub based on borrower’s credit profile.


**sub_grade:** More granular loan grade.


**emp_length:** The number of years the borrower has been employed.


**home_ownership:** The ownership status of the borrower's residence (e.g., RENT, OWN, MORTGAGE).


**annual_inc:** The self-reported annual income of the borrower.


**verification_status:** The status of the income verification (e.g., Verified, Source Verified, Not Verified).


**issue_d:** The month the loan was issued.


**loan_status:** The current status of the loan (e.g., Fully Paid, Charged Off, Current).


**purpose:** The stated reason for the loan (e.g., debt consolidation, home improvement).

**addr_state:** The state provided by the borrower in the loan application.

**dti:** Debt-to-income ratio of the borrower.


**delinq_2yrs:** The number of delinquency incidents in the past two years.

**fico_range_low and fico_range_high:** The lower and upper range of the borrower’s FICO score.


**revol_util:** Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.


**total_pymnt:** The total amount paid by the borrower.

**application_type:** Indicates whether the loan is an individual or joint application.


**pub_rec:** Number of derogatory public records.

## Data Cleaning

Lets prepare the data for analysis

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('loan.csv')

  df = pd.read_csv('loan.csv')


In [2]:
# Check for top 5 rows
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,2500,2500,2500.0,36 months,13.56,84.92,C,C1,...,,,Cash,N,,,,,,
1,,,30000,30000,30000.0,60 months,18.94,777.23,D,D2,...,,,Cash,N,,,,,,
2,,,5000,5000,5000.0,36 months,17.97,180.69,D,D1,...,,,Cash,N,,,,,,
3,,,4000,4000,4000.0,36 months,18.94,146.51,D,D2,...,,,Cash,N,,,,,,
4,,,30000,30000,30000.0,60 months,16.14,731.78,C,C4,...,,,Cash,N,,,,,,


In [3]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
print("Missing Values:\n", missing_values)

Missing Values:
 id                                            2260668
url                                           2260668
member_id                                     2260668
orig_projected_additional_accrued_interest    2252242
hardship_length                               2250055
                                               ...   
delinq_amnt                                        29
acc_now_delinq                                     29
pub_rec                                            29
annual_inc                                          4
zip_code                                            1
Length: 113, dtype: int64


In [4]:
# Drop columns with a high percentage of missing values (e.g., >50%)
# Any way these columns are not useful for prediction
df = df.drop(columns=missing_values[missing_values > len(df) * 0.5].index)

In [None]:
# Impute missing values for numerical columns with the median
numerical_cols = df.select_dtypes(include=[np.number]).columns
df[numerical_cols] = df[numerical_cols].apply(lambda x: x.fillna(x.median()))

In [None]:
# Impute missing values for categorical columns with the mode
categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))

In [None]:
# Convert relevant columns to the appropriate data types
df['term'] = df['term'].str.extract('(\d+)').astype(int)  # Convert term to integer
df['int_rate'] = df['int_rate'].str.rstrip('%').astype(float) / 100.0  # Convert interest rate to float
df['revol_util'] = df['revol_util'].str.rstrip('%').astype(float) / 100.0  # Convert revolving utilization rate to float

In [None]:
# Remove duplicates
df = df.drop_duplicates()

In [None]:
# Confirm that all missing values are handled
print("Remaining Missing Values:\n", df.isnull().sum().sum())  # Should be 0

## Univariate Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Descriptive statistics for numerical variables
print("Descriptive Statistics:\n", df.describe())

In [None]:
# Univariate analysis for a numerical variable (loan_amnt)
plt.figure(figsize=(10, 5))
sns.histplot(df['loan_amnt'], bins=30, kde=True, color='blue')
plt.title('Distribution of Loan Amounts')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Univariate analysis for a categorical variable (loan_status)
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='loan_status', order=df['loan_status'].value_counts().index)
plt.title('Loan Status Distribution')
plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Univariate analysis for interest rate
plt.figure(figsize=(10, 5))
sns.histplot(df['int_rate'], bins=30, kde=True, color='green')
plt.title('Distribution of Interest Rates')
plt.xlabel('Interest Rate')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Univariate analysis for employment length
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x='emp_length', order=df['emp_length'].value_counts().index)
plt.title('Employment Length Distribution')
plt.xlabel('Employment Length')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

## Bivariate Analysis

In [None]:
# Bivariate analysis between loan amount and interest rate
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='loan_amnt', y='int_rate', alpha=0.5)
plt.title('Loan Amount vs. Interest Rate')
plt.xlabel('Loan Amount')
plt.ylabel('Interest Rate')
plt.show()

In [None]:
# Bivariate analysis between annual income and loan amount
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='annual_inc', y='loan_amnt', alpha=0.5)
plt.title('Annual Income vs. Loan Amount')
plt.xlabel('Annual Income')
plt.ylabel('Loan Amount')
plt.xlim(0, 200000)  # Limit for visualization purposes
plt.show()

In [None]:
# Bivariate analysis between loan status and interest rate
plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x='loan_status', y='int_rate', order=df['loan_status'].value_counts().index)
plt.title('Loan Status vs. Interest Rate')
plt.xlabel('Loan Status')
plt.ylabel('Interest Rate')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Bivariate analysis between debt-to-income ratio and loan status
plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x='loan_status', y='dti', order=df['loan_status'].value_counts().index)
plt.title('Loan Status vs. Debt-to-Income Ratio')
plt.xlabel('Loan Status')
plt.ylabel('Debt-to-Income Ratio')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Correlation matrix for numerical variables
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Variables')
plt.show()