Step One: Load and Preprocess the Dataset



In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('credit_risk_dataset.csv')

In [None]:
# Identify numerical columns
numerical_cols = ['person_age', 'person_income', 'person_emp_length', 'cb_person_cred_hist_length']

# Identify categorical columns
categorical_cols = ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']

# Separate numerical and categorical data
numerical_data = df[numerical_cols]
categorical_data = df[categorical_cols]

In [None]:
# Fill missing values in numerical columns with the median
numerical_data = numerical_data.fillna(numerical_data.median())

# Fill missing values in categorical columns with the mode (most frequent value)
categorical_data = categorical_data.apply(lambda x: x.fillna(x.mode()[0]))

In [None]:
# Combine numerical and categorical data back into the DataFrame
df[numerical_cols] = numerical_data
df[categorical_cols] = categorical_data

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the numerical columns
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

Step Two: Select Relavant Numerical Columns for Clustering

In [None]:
X = df[numerical_cols]

Step Three: Perform the K-Means Clustering

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Calculate the Within-Cluster-Sum of Squared Errors (WCSS)
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot the Elbow graph
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
optimal_clusters = 3
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
df['cluster'] = kmeans.fit_predict(X)

Step 4: Interpret the Clusters

In [None]:
# Select relevant numerical columns along with the 'cluster' column
numerical_cols_with_cluster = numerical_cols + ['cluster']

# Group by 'cluster' and calculate the mean for numerical columns only
clusters = df[numerical_cols_with_cluster].groupby('cluster').mean()
print(clusters)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Loan Amount by Cluster
sns.boxplot(x='cluster', y='loan_amnt', data=df)
plt.title('Loan Amount by Cluster')
plt.show()

# Income by Cluster
sns.boxplot(x='cluster', y='person_income', data=df)
plt.title('Income by Cluster')
plt.show()

# Loan Status by Cluster
sns.boxplot(x='cluster', y='loan_status', data=df)
plt.title('Loan Status by Cluster')
plt.show()

# Visualizing Employment Length by Cluster
sns.boxplot(x='cluster', y='person_emp_length', data=df)
plt.title('Employment Length by Cluster')
plt.show()

Interpretation of Clusters

**Cluster 0:**

Income: High income levels.

Employment Length: Longest job tenures.

Loan Amount: Higher loan amounts.

Loan Status: Higher likelihood of non-defaults.

Summary: Financially stable, higher borrowers with reliable employment and higher loan amounts.

**Cluster 1:**

Income: Moderate to high income levels.

Employment Length: Varied employment histories.

Loan Amount: Moderate loan amounts.

Loan Status: Mixed default and non-default statuses.

Summary: Median income borrowers with varied job stability and moderate loan amounts, leading to mixed loan outcomes.

**Cluster 2:**

Income: Lower income levels.

Employment Length: Shorter job tenures.

Loan Amount: Moderate loan amounts.

Loan Status: Balanced default and non-default statuses.

Summary: Younger, lower-income borrowers with shorter employment and credit histories, moderate loan amounts, and balanced loan outcomes.                                                                      
                                                                             
                                                                              
**Microsoft Copilot was used to help generate this code**