### Credit Risk Modeling with Expert LLM Reporting

This notebook performs classification on credit risk customers using three robust models:

- Random Forest
- XGBoost
- Deep Learning (Keras Sequential)

I aim to **maximize recall**, ensuring high-risk customers are not misclassified as low-risk. SMOTE is used for class rebalancing.

At the end, a **local LLM (Mistral via Ollama)** will interpret performance metrics, confusion matrices, and feature importance plots.

[**Data Source:** Kaggle - Credit Risk Customers](https://www.kaggle.com/datasets/ppb00x/credit-risk-customers/data)

**Dataset Overview:** The dataset contains information on customers, including demographic details, financial status, and credit history. The target variable, `class`, indicates whether a customer is at high risk - bad (1) or low risk - good (0). Class imbalance is evident in the dataset (`class` has more 1s than 0s). To mitigate this, I use SMOTE to oversample the minority class.


#### 1. Load Libraries and Data
**Steps:**
1. Load dependencies and the dataset.
2. Visualize the df and check for missing values (none found in this dataset).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Ensure plot directory exists
os.makedirs("plots", exist_ok=True)

# Load data
df = pd.read_csv("data/credit_customers.csv")
print(f"Shape: {df.shape}")
display(df.head())

# Check for missing values
print("Missing values per column:\n", df.isnull().sum())

# Plot class distribution
plt.figure(figsize=(4,2))
sns.countplot(x='class', data=df)
plt.title("Class Distribution (Original)")
plt.savefig("plots/class_distribution.png", bbox_inches="tight")
plt.close()

Shape: (1000, 21)


Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,...,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes,good
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,...,real estate,22.0,none,own,1.0,skilled,1.0,none,yes,bad
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,...,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes,good
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,...,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes,good
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,...,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes,bad


Missing values per column:
 checking_status           0
duration                  0
credit_history            0
purpose                   0
credit_amount             0
savings_status            0
employment                0
installment_commitment    0
personal_status           0
other_parties             0
residence_since           0
property_magnitude        0
age                       0
other_payment_plans       0
housing                   0
existing_credits          0
job                       0
num_dependents            0
own_telephone             0
foreign_worker            0
class                     0
dtype: int64


#### 2. Data Preprocessing
Encode categorical variables, handle class imbalance with SMOTE, and scale features. This ensures the models can learn effectively from the data.

In [2]:
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

cat_cols = df.select_dtypes(include=['object']).columns.tolist()
cat_cols.remove('class')

encoder = OrdinalEncoder()
df[cat_cols] = encoder.fit_transform(df[cat_cols])

df['class'] = df['class'].map({'good': 0, 'bad': 1})

X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# SMOTE for class imbalance
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Feature scaling
scaler = StandardScaler()
X_train_res = scaler.fit_transform(X_train_res)
X_test = scaler.transform(X_test)

# Plot class distribution after SMOTE
plt.figure(figsize=(4,2))
sns.countplot(x=y_train_res)
plt.title("Class Distribution After SMOTE")
plt.savefig("plots/class_distribution_after_smote.png", bbox_inches="tight")
plt.close()


#### 3. Model Training & Evaluation
Train three models (Random Forest, XGBoost, Deep Learning), evaluate their performance, and save confusion matrices and feature importances as plots for later analysis.

In [3]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report, confusion_matrix
import json

results = {}

# Random Forest
rf = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=42, n_jobs=-1)
rf.fit(X_train_res, y_train_res)
y_pred_rf = rf.predict(X_test)
results['RandomForest'] = classification_report(y_test, y_pred_rf, output_dict=True)
cm_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure()
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues')
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.savefig("plots/conf_matrix_rf.png", bbox_inches="tight")
plt.close()

# XGBoost
xgb = XGBClassifier(n_estimators=40, max_depth=4, use_label_encoder=False, eval_metric='logloss', random_state=42, n_jobs=-1)
xgb.fit(X_train_res, y_train_res)
y_pred_xgb = xgb.predict(X_test)
results['XGBoost'] = classification_report(y_test, y_pred_xgb, output_dict=True)
cm_xgb = confusion_matrix(y_test, y_pred_xgb)
plt.figure()
sns.heatmap(cm_xgb, annot=True, fmt='d', cmap='Blues')
plt.title("XGBoost Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.savefig("plots/conf_matrix_xgb.png", bbox_inches="tight")
plt.close()

# Deep Learning
dl = Sequential([
    Dense(64, activation='relu', input_shape=(X_train_res.shape[1],)),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])
dl.compile(optimizer='adam', loss='binary_crossentropy', metrics=['Recall'])
es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True, verbose=0)
dl.fit(X_train_res, y_train_res, validation_split=0.2, epochs=10, batch_size=32, callbacks=[es], verbose=0)
y_pred_dl = (dl.predict(X_test) > 0.5).astype(int)
results['DeepLearning'] = classification_report(y_test, y_pred_dl, output_dict=True)
cm_dl = confusion_matrix(y_test, y_pred_dl)
plt.figure()
sns.heatmap(cm_dl, annot=True, fmt='d', cmap='Blues')
plt.title("Deep Learning Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.savefig("plots/conf_matrix_dl.png", bbox_inches="tight")
plt.close()

# Feature Importances
features = X.columns
importances_rf = rf.feature_importances_
importances_xgb = xgb.feature_importances_

plt.figure(figsize=(10,4))
sns.barplot(x=importances_rf, y=features)
plt.title("Random Forest Feature Importance")
plt.savefig("plots/feature_importance_rf.png", bbox_inches="tight")
plt.close()

plt.figure(figsize=(10,4))
sns.barplot(x=importances_xgb, y=features)
plt.title("XGBoost Feature Importance")
plt.savefig("plots/feature_importance_xgb.png", bbox_inches="tight")
plt.close()

# Save metrics
os.makedirs("results", exist_ok=True)
with open("results/metrics.json", "w") as f:
    json.dump(results, f, indent=2)


Parameters: { "use_label_encoder" } are not used.

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step


#### 4. Automated Executive Summary with LLM (Mistral via Ollama)
Generate an executive summary using a local LLM (Mistral via Ollama). The model will interpret saved metrics and plots, and produce a management-ready report.

In [4]:
import requests

# Load metrics
with open("results/metrics.json", "r") as f:
    metrics = json.load(f)

# Compose prompt, referencing saved plot files
prompt = f"""
You are a senior financial data scientist.
Below are the model performance metrics for a credit risk classification task:

{json.dumps(metrics, indent=2)}

The following plots are available for your review:
- plots/class_distribution.png (original class distribution)
- plots/class_distribution_after_smote.png (post-SMOTE class distribution)
- plots/conf_matrix_rf.png (Random Forest confusion matrix)
- plots/conf_matrix_xgb.png (XGBoost confusion matrix)
- plots/conf_matrix_dl.png (Deep Learning confusion matrix)
- plots/feature_importance_rf.png (Random Forest feature importance)
- plots/feature_importance_xgb.png (XGBoost feature importance)

Please write a detailed, expert-level executive summary for management. 
Interpret the metrics and each plot, highlight key findings, and provide actionable recommendations.
"""

# Call Mistral via Ollama
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL_NAME = "mistral"

response = requests.post(
    OLLAMA_URL,
    json={
        "model": MODEL_NAME,
        "prompt": prompt,
        "stream": False
    }
)
llm_report = response.json()["response"]

# Save the report as markdown
with open("executive_summary.md", "w", encoding="utf-8") as f:
    f.write(llm_report)

# Display the report in the notebook
from IPython.display import Markdown, display
display(Markdown(llm_report))


 Executive Summary: Credit Risk Classification Model Performance Analysis

In this analysis, we evaluated three machine learning models for credit risk classification: Random Forest (RF), XGBoost (XGB), and Deep Learning (DL). The dataset consisted of 200 samples, with approximately 70% belonging to class 0 (low-risk) and 30% belonging to class 1 (high-risk).

Model Performance Metrics:

1. Random Forest: The model demonstrated relatively high precision for low-risk samples (0.80) but lower precision for high-risk samples (0.52). Recall was higher for both classes, with a notable improvement in high-risk class recall (0.55 vs 0.78 for low-risk). The overall accuracy of the model was moderate at 0.715.

2. XGBoost: This model outperformed RF in terms of precision and recall for both classes, with a significant increase in high-risk class recall (0.56 vs 0.87 for low-risk). However, the overall accuracy was higher than RF but lower than DL at 0.78.

3. Deep Learning: This model achieved the highest precision for both classes, with a slight edge over XGBoost in high-risk class precision (0.65 vs 0.52). Recall was slightly lower compared to RF and XGBoost for both classes but still acceptable. The overall accuracy of the model was moderate at 0.705.

It is worth noting that while DL had a slightly higher weighted F1-score than XGBoost, it also had the lowest recall for high-risk samples. This suggests that the DL model may be biased towards predicting low-risk samples, which could result in more false negatives for high-risk cases.

Class Distribution and SMOTE Impact:
The class distribution plots showed a significant imbalance between low-risk (70%) and high-risk (30%) samples. To address this issue, the dataset was resampled using the Synthetic Minority Over-sampling Technique (SMOTE). The post-SMOTE plot showed a more balanced distribution between the two classes.

Confusion Matrices:
The confusion matrices for all three models demonstrated that all models had a higher tendency to predict low-risk samples, as shown by the larger number of false negatives in the high-risk class and smaller numbers of false positives in the low-risk class. The Deep Learning model had the highest number of false negatives among the three models.

Feature Importance:
The feature importance plots for RF and XGBoost were similar, with credit history, loan amount, and income being the most important features in predicting credit risk. For DL, these top three features remained consistent, but other factors such as purpose of loan and employment status also played a significant role in model predictions.

Key Findings:
- All models demonstrated relatively high precision for low-risk samples but struggled to accurately classify high-risk samples.
- The Deep Learning model had the highest overall accuracy, but it also had the lowest recall for high-risk samples and a tendency to overfit on the low-risk class.
- Class imbalance in the dataset was addressed using SMOTE, leading to a more balanced distribution between classes.

Actionable Recommendations:
1. Optimize the Deep Learning model by adjusting hyperparameters or using ensemble methods to reduce overfitting and improve high-risk class predictions.
2. Evaluate the use of additional features such as employment status, purpose of loan, and other relevant factors that may impact credit risk.
3. Continuously monitor and retrain the models with new data to ensure their performance remains robust over time.
4. Implement strategies to mitigate false negatives in high-risk samples, such as more stringent underwriting guidelines or additional review processes for borderline cases.

In [5]:
!pandoc executive_summary.md -o executive_summary.pdf