HW 2
Group 3
Problem 1

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_auc_score,
    roc_curve,
    accuracy_score
)
import seaborn as sns

In [None]:
df = pd.read_csv('credit_risk_dataset.csv')
df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
from sklearn.preprocessing import StandardScaler

# Assuming 'df' is your DataFrame
numerical_cols = ['person_age', 'person_income','person_emp_length', 'loan_amnt', 'loan_int_rate','cb_person_cred_hist_length']

scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df.head()

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
df=df.dropna()
df.columns = df.columns.str.strip()
columns_for_clustering = ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'cb_person_cred_hist_length']
X=df[columns_for_clustering]

In [None]:
# Elbow Method
inertia = []
silhouette_scores = []
range_n_clusters = range(2, 11)  # Test for 2 to 10 clusters

for k in range_n_clusters:
    kmeans = KMeans(n_clusters=k, random_state=20).fit(X)
    inertia.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot Inertia and Silhouette Score
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(range_n_clusters, inertia, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')

plt.subplot(1, 2, 2)
plt.plot(range_n_clusters, silhouette_scores, marker='o', color='green')
plt.title('Silhouette Score')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')

plt.tight_layout()
plt.show()

We are selecting 7 clusters as the optimal number per the elbow method. It is where the slope of the inertia slows down significantly.

In [None]:
optimal_clusters = 7

# K-Means Clustering
kmeans = KMeans(n_clusters=optimal_clusters, random_state=20).fit(X)
X['Cluster'] = kmeans.labels_

df['Cluster'] =  kmeans.labels_
cluster_analysis = df.groupby('Cluster').agg(person_age = ('person_age', 'mean'),
                                                      person_income = ('person_income', 'mean'),
                                                      person_emp_length = ('person_emp_length', 'mean'),
                                                      loan_amnt = ('loan_amnt', 'mean'),
                                                      loan_int_rate = ('loan_int_rate', 'mean'),
                                                      cb_person_cred_hist_length = ('cb_person_cred_hist_length', 'mean'),
                                                      num_comp= ('person_age', 'count')
                                                      ).reset_index()

print("Cluster Analysis (Means and Record Counts):\n", cluster_analysis)

Question 4: Looking at the data, the size of each cluster varies a lot from 2320 to 7484. The data in the largest cluster has the lowest interest rates, is the 2nd youngest cluster. Cluster 4 is the oldest cluster by far and has the longest credit history length. Cluster five has the highest income and highest loan amounts. The clusters seem to make sense how the model grouped people together.

Problem 2 - Logistic Regression

In [None]:
df = pd.read_csv('credit_risk_dataset.csv')
categorical_cols = ['loan_intent']

# present all unique values per categorical columns
for col in categorical_cols:
    print(f'for column {col}, the unique values are')
    print(df[col].unique())

# Apply one-hot encoding to categorical columns
selected_columns = ['loan_intent','loan_status']
new_df=df[selected_columns]
df_encoded = pd.get_dummies(new_df, columns=categorical_cols, drop_first=True)


In [None]:
df_encoded.columns

In [None]:
grouped_averages = df_encoded.groupby("loan_status").mean()

# Display the grouped averages
grouped_averages.head()


In [None]:
selected_features = df_encoded.columns.drop("loan_status")
target = "loan_status"
# Features and target
X = df_encoded[selected_features]
y = df_encoded[target]

# Split data into training and testing sets (75% training, 25% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21, stratify=y)

# Initialize logistic regression model
log_reg = LogisticRegression()

# Train the model
log_reg.fit(X_train, y_train)

# Generate predictions and probabilities
y_pred = log_reg.predict(X_test)
y_pred_prob = log_reg.predict_proba(X_test)[:, 1]

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["df_encoded = 0", "df_encoded = 1"], yticklabels=["df_encoded = 0", "df_encoded = 1"])
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Classification Report for precision, recall, F1-score
print("Classification Report:\n", classification_report(y_test, y_pred))

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# AUC (Area Under the Curve)
auc_score = roc_auc_score(y_test, y_pred_prob)
print("AUC:", auc_score)

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {auc_score:.2f})", color="blue")
plt.plot([0, 1], [0, 1], "k--", label="Random Classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="best")
plt.grid()
plt.show()

The model we created predicted that everyone would be denied for a loan based solely on the loan intent variable. This wouldn't be a good model to use, because there were 1,422 data points classified incorrectly. It says that it is 78% accurate but this is just because of the data set. Because the AUC is about .5, but we know the model is not working and just denying all the loans. The f1 score is .88 for denying loans but 0 for approving loans. The threshold may be too high which is why it is denying every loan. Based on this model, we cannot determine which variable is most important. It is not a reliable model for making predictions.

Part 3 Group Survey:
Course Rating - 4, Difficulty Level - 4
Pros: The content is interesting and applicable. We appreciate the guidance with providing the code. Guest speakers are also interesting.
Cons: The lectures are very content heavy and it can be difficult to digest it all. For those that don't have much experience coding it can be difficult to deal with errors in the code.
Suggestions: Maybe taking more time to go through the code in more detail so we can understand what you are doing better.