## **Objective**

This assignment focuses on two key machine learning techniques: clustering analysis and classification using logistic regression. You will analyze the provided dataset (credit_risk_dataset.csv) and draw meaningful insights from your models.

Data Explanation

person_age: Age

person_income: Annual Income

person_homeownership: Home ownership

person_emp_length: Employment length (in years)

loan_intent: Loan intent

loan_grade: Loan grade

loan_amnt: Loan amount

loan_int_rate: Interest rate

loan_status: Loan status (0 is non-default, 1 is default)

loan_percent_income: Percent income

cb_person_default_on_file: Historical default

cb_person_cred_hist_length: Credit history length


#Group 1: Noah Severin, Sufyan Haroon, Jay Capozzoli


# **Problem 1: Clustering Analysis**
Task

Perform a clustering analysis on the dataset to identify groups of loan applicants based on numerical variables.

Steps

Load the dataset and preprocess it: Handle missing values appropriately.

Standardize numerical variables.

Select relevant numerical columns for clustering (for example, person_age, person_income, person_emp_length, cb_person_cred_hist_length, etc.).

Use the K-Means algorithm to perform clustering.

Determine the optimal number of clusters using the Elbow Method.

Fit the K-Means model and assign cluster labels.

**Interpret the clusters:**


*   What patterns do you observe in the clusters?


*   How do different clusters compare in terms of loan characteristics (e.g.,
loan amount, income, loan status)?

# **Problem 1 Solution: Clustering Analysis**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from IPython.display import display  # Import display for dataframe visualization

# File path
file_path = "/content/credit_risk_dataset.csv"
df = pd.read_csv(file_path)

# Handle missing values by filling with median (since numerical values may be skewed)
df['person_emp_length'].fillna(df['person_emp_length'].median(), inplace=True)
df['loan_int_rate'].fillna(df['loan_int_rate'].median(), inplace=True)

# Select relevant numerical columns for clustering
numerical_cols = [
    'person_age', 'person_income', 'person_emp_length',
    'loan_amnt', 'loan_int_rate', 'loan_percent_income',
    'cb_person_cred_hist_length'
]
df_cluster = df[numerical_cols]

# Standardize the numerical variables
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_cluster)

# Determine the optimal number of clusters using the Elbow Method
distortions = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(df_scaled)
    distortions.append(sum(np.min(cdist(df_scaled, kmeans.cluster_centers_, 'euclidean'), axis=1)) / df_scaled.shape[0])

# Plot the Elbow Method chart
plt.figure(figsize=(8, 5))
plt.plot(K, distortions, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Average Distortion')
plt.title('Elbow Method for Optimal Clusters')
plt.show()

# Based on the elbow method, let's choose an optimal k (e.g., 4) for clustering
optimal_k = 4

# Fit the K-Means model with the chosen number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(df_scaled)

# Analyze the clusters by summarizing key characteristics
cluster_summary = df.groupby('cluster')[numerical_cols].mean()

# Display cluster analysis summary using display()
print("Cluster Analysis Summary:")
display(cluster_summary)  # Use display() to show the dataframe

# ***Characteristic Comparison:***
The clusters revealed distinct borrower profiles based on age, income, credit history, and loan characteristics. Cluster 0 consists of the youngest borrowers 25.6 years old with the lowest income of 5498 and shortest credit history of 4.46 years. They receive the smallest loans of 6,288 but at the highest interest rates 12.94% due to their high-risk profile.

Cluster 1 includes slightly older borrowers 26.3 years with higher incomes of 68,437 but takes on the largest loans 17,309 with the highest loan-to-income ratio 0.29. Their moderate credit history of 4.91 years helps them secure a lower interest rate of 11.98% than Cluster 0.

Cluster 2 are similar to Cluster 1 in age and income but borrows conservatively 7,211 with the lowest loan-to-income ratio 0.12. They receive the lowest interest rates of 8.05%, making them the safest borrowers.

Cluster 3 represents the most financially stable cluster, with the oldest borrowers being 40.9 years, highest income of 86,989, longest employment of
6.35 years, and strongest credit history of 14.27 years. They take on  moderate loans amount of 9,784 at 11.18% interest rate.

# ***Key patterns:***

*Loan amount:* Cluster 1 borrows the most, Cluster 0 the least.

*Interest rates:* Cluster 2 enjoys the lowest, Cluster 0 the highest.

*Loan-to-income ratio:* Cluster 1 is the most leveraged, Cluster 2 the most conservative.

*Credit history:* Cluster 3 is the strongest, Cluster 0 the weakest.
Overall, Cluster 2 is the safest, Cluster 1 the riskiest, and Cluster 3 the most established.

# **Problem 2: Classification Using Logistic Regression**
**Task**

Choose one meaningful categorical variable from the dataset and build a logistic regression model to classify loan status (default or non-default).

**Steps**

1. Select a categorical variable (e.g., cb_person_default_on_file, loan_grade, or loan_intent).

2. **Preprocess data:** Convert categorical variables into numerical form (e.g., one-hot encoding or label encoding).

3. Handle missing values if applicable.

4. Split the data into training and testing sets.

5. Train a logistic regression model to predict loan_status (0 or 1).

6. **Evaluate model performance using:**
Confusion matrix
Accuracy, Precision, Recall, and F1-score

7. Interpret results:
What insights can you gain from the logistic regression model?
What features are most important for predicting loan defaults?
How reliable is the model for making predictions?


# **Problem 2 Solution: Classification Using Logistic Regression**


In [None]:
#Import
import kagglehub
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_auc_score,
    roc_curve,
    accuracy_score
)
import seaborn as sns

df = pd.read_csv('credit_risk_dataset.csv')
df.head(5)

In [None]:
# Check the data columns
df.columns

In [None]:
# Check data info
df.info()

In [None]:
# Strip whitespace from column names
df.columns = df.columns.str.strip()

# Identify categorical columns
categorical_cols = ['person_home_ownership', 'loan_intent','loan_grade', 'cb_person_default_on_file']

# present all unique values per categorical columns
for col in categorical_cols:
    print(f'for column {col}, the unique values are')
    print(df[col].unique())

# Apply one-hot encoding to categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [None]:
df_encoded.columns

In [None]:
# Categorical Variable Chosen: Defaults on File 'cb_person default_on_file'
# Drop columns for selected target and categorical variables not chosen
selected_features = [col for col in df_encoded.columns
                     if col not in ["loan_status",
                                     "person_home_ownership_OTHER",
                                     "person_home_ownership_OWN",
                                     "person_home_ownership_RENT",
                                     "loan_intent_EDUCATION",
                                     "loan_intent_HOMEIMPROVEMENT",
                                     "loan_intent_MEDICAL",
                                     "loan_intent_PERSONAL",
                                     "loan_intent_VENTURE",
                                     "loan_grade_B",
                                     "loan_grade_C",
                                     "loan_grade_D",
                                     "loan_grade_E",
                                     "loan_grade_F",
                                     "loan_grade_G",
                                    ]]
target = "loan_status"

# Features and target
X = df_encoded[selected_features]
y = df_encoded[target]

# Compute correlation between features and loan_status (default risk)
correlation_matrix = df_encoded.corr()

# Extract correlations related to loan_status
loan_correlation = correlation_matrix["loan_status"].drop("loan_status").sort_values(ascending=False)
print("\nFeature Correlation with Default Risk:\n", loan_correlation)

# Plot correlation heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(correlation_matrix, cmap="coolwarm", annot=False, linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()


# Split data into training and testing sets (75% training, 25% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21, stratify=y)

# Initialize logistic regression model
log_reg = LogisticRegression()

# Drop rows with any missing values in the training and testing sets
X_train = X_train.dropna()
y_train = y_train[X_train.index]  # Make sure y_train aligns with X_train after dropping rows
X_test = X_test.dropna()
y_test = y_test[X_test.index]  # Make sure y_test aligns with X_test after dropping rows

# Train the model
log_reg.fit(X_train, y_train)

# Generate predictions and probabilities
y_pred = log_reg.predict(X_test)
y_pred_prob = log_reg.predict_proba(X_test)[:, 1]

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["df_encoded = 0", "df_encoded = 1"], yticklabels=["df_encoded = 0", "df_encoded = 1"])
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Classification Report for precision, recall, F1-score
print("Classification Report:\n", classification_report(y_test, y_pred))

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)



# ***Analysis***

The logistic regression model provided valuable insights into predicting loan defaults, but it also highlighted areas for improvement. The model has an  accuracy of 82.4%, the model performs well overall, particularly in identifying non-defaulters. However, the confusion matrix and classification report reveal a significant discrepincies in predictive performance. The model achieves 96% recall for non-defaulters, meaning that it correctly identifies most individuals who will repay their loans. In comparison, it only captures 35% of actual defaulters, meaning a substantial number of defaulters are incorrectly classified as non-defaulters. This misclassification is a major concern in loan risk assessment, as failing to identify high-risk borrowers can lead to financial losses.

In terms of feature importance, key factors  influencing loan defaults include loan income ratio, loan interest rate, and loan grade. Borrowers with a higher loan-to-income ratio (0.379 correlation) are more likely to default, as higher debt burdens make repayments difficult. Similarly, high interest rates (0.335 correlation) increase default risk by raising monthly payments. Loan grades also play a significant role, with lower-grade loans showing strong positive correlations with default, indicating that riskier loans are more prone to failure. Alternatively, higher income (-0.144 correlation), homeownership (-0.102), and longer employment history (-0.082) are associated with lower default risk, as they indicate stronger financial security. These factors should be prioritized when assessing loan risk.

Despite its high overall accuracy, the model is not entirely reliable for predicting defaults due to its low recall for defaulters. This suggests a bias toward the non-defaulters. This is likely due to class imbalance in the dataset, as majority of them are non-defaulters. Overall, while the model effectively identifies non-defaulters, its low recall for defaulters makes it less reliable for financial decision-making, particularly in risk assessment.If used lenders will contibue to underestimate default probabilities, leading to potential financial exposure.
