In [1]:
# Agent Role Definition and Task for German Credit Risk Classifier

# 1. Role: Data Science Consultant specializing in Binary Classification and Risk Assessment.
# 2. Goal: Build a high-accuracy Classification model to predict credit risk (Good or Bad).
# 3. Data Source: German Credit Data (available in Scikit-learn or standard datasets).

# Agent, your primary task is to:
# 1. Load the German Credit Data.
# 2. Perform comprehensive Exploratory Data Analysis (EDA), focusing on distribution and relationships to the 'risk' target variable.
# 3. Handle categorical variables (e.g., one-hot encoding).
# 4. Train two classification models: Logistic Regression (as a baseline) and a more advanced model (e.g., Support Vector Machine or Random Forest).
# 5. Evaluate the models using standard classification metrics (Accuracy, Precision, Recall, F1-Score).
# 6. Provide clear code, comments, and a final conclusion on which model is best for financial risk assessment.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# We will use the Agent to suggest the best model and further steps.

In [3]:
# Agent Task 1: Load the German Credit Data and display its first 5 rows and information (info()).

import pandas as pd
from ucimlrepo import fetch_ucirepo

# Fetch dataset 
# The German Credit Data is readily available through the UCI Machine Learning Repository
try:
    german_credit = fetch_ucirepo(id=144)
    
    # Data (features) and Target variable
    X = german_credit.data.features
    y = german_credit.data.targets
    
    # Combine for easier EDA
    df = pd.concat([X, y], axis=1)
    
    # Rename target column for clarity (0: Good Risk, 1: Bad Risk)
    # The original target column is often named 'class' or similar. We will rename the last column.
    df.columns = list(df.columns[:-1]) + ['Credit_Risk']
    
    print("--- First 5 Rows of Data ---")
    print(df.head())
    print("\n--- Data Information (Types and Missing Values) ---")
    df.info()

except Exception as e:
    print(f"Error loading data. Trying alternative source (if applicable) or check library installation: {e}") 

--- First 5 Rows of Data ---
  Attribute1  Attribute2 Attribute3 Attribute4  Attribute5 Attribute6  \
0        A11           6        A34        A43        1169        A65   
1        A12          48        A32        A43        5951        A61   
2        A14          12        A34        A46        2096        A61   
3        A11          42        A32        A42        7882        A61   
4        A11          24        A33        A40        4870        A61   

  Attribute7  Attribute8 Attribute9 Attribute10  ...  Attribute12 Attribute13  \
0        A75           4        A93        A101  ...         A121          67   
1        A73           2        A92        A101  ...         A121          22   
2        A74           2        A93        A101  ...         A121          49   
3        A74           2        A93        A103  ...         A122          45   
4        A73           3        A93        A101  ...         A124          53   

   Attribute14 Attribute15 Attribute16  Attri

In [4]:
# Agent Task 2: Data Preprocessing and Target Variable Analysis

# 1. Analyze the distribution of the target variable (Credit_Risk).
# 2. Convert the 'object' type categorical columns into numerical form using One-Hot Encoding (pd.get_dummies).
# 3. Rename the target variable's values to be binary (0 and 1) instead of the current (1 and 2), where 1 = Good Risk and 0 = Bad Risk, as is standard in ML.
# 4. Display the shapes of the original and processed DataFrame.

# Analyze Target Distribution
print("--- Credit Risk Distribution (Before Renaming) ---")
print(df['Credit_Risk'].value_counts())
print("\n")

# The original dataset defines 1 as Good Risk and 2 as Bad Risk. We remap 1 -> 1 (Good) and 2 -> 0 (Bad) for standard binary classification.
# NOTE: Based on UCI documentation, the original classes are 1 (Good) and 2 (Bad). Let's remap 1 to 0 (Good) and 2 to 1 (Bad) to align with standard ML practice where 1 often represents the event of interest (Bad Risk).
# Let's check the distribution again after remapping to 0 and 1
df['Credit_Risk'] = df['Credit_Risk'].replace({1: 0, 2: 1})

print("--- Remapped Credit Risk Distribution (0: Good Risk, 1: Bad Risk) ---")
print(df['Credit_Risk'].value_counts())
print(f"Total Bad Risk (1): {df['Credit_Risk'].value_counts()[1]} ({df['Credit_Risk'].value_counts(normalize=True)[1]:.2f}%)")
print(f"Total Good Risk (0): {df['Credit_Risk'].value_counts()[0]} ({df['Credit_Risk'].value_counts(normalize=True)[0]:.2f}%)")

# Handle Categorical Features using One-Hot Encoding
df_processed = pd.get_dummies(df, columns=df.select_dtypes(include=['object']).columns, drop_first=True)

print("\n--- Processed Data Shape ---")
print(f"Original Shape: {df.shape}")
print(f"Processed Shape (after One-Hot Encoding): {df_processed.shape}") 

--- Credit Risk Distribution (Before Renaming) ---
Credit_Risk
1    700
2    300
Name: count, dtype: int64


--- Remapped Credit Risk Distribution (0: Good Risk, 1: Bad Risk) ---
Credit_Risk
0    700
1    300
Name: count, dtype: int64
Total Bad Risk (1): 300 (0.30%)
Total Good Risk (0): 700 (0.70%)

--- Processed Data Shape ---
Original Shape: (1000, 21)
Processed Shape (after One-Hot Encoding): (1000, 49)


In [5]:
# Agent Task 3: Model Training and Evaluation

# 1. Separate features (X) and target (y).
# 2. Split the data into training and testing sets (80% train, 20% test).
# 3. Train two models: Logistic Regression and Random Forest Classifier.
# 4. Evaluate both models and print a detailed classification report for each.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# 1. Separate Features and Target
X = df_processed.drop('Credit_Risk', axis=1)
y = df_processed['Credit_Risk']

# 2. Split Data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Note: stratify=y ensures the 70/30 imbalance is maintained in both train and test sets.

print("--- Data Splitting Complete ---")
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print("\n")


# 3a. Train Logistic Regression (Baseline Model)
logreg_model = LogisticRegression(max_iter=500, random_state=42)
logreg_model.fit(X_train, y_train)
y_pred_logreg = logreg_model.predict(X_test)

# 4a. Evaluate Logistic Regression
print("--- Evaluation: Logistic Regression ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_logreg):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_logreg))
print("\n")


# 3b. Train Random Forest Classifier (Advanced Model)
# Random Forest often handles complex relationships and is good with high-dimensional data (after one-hot encoding).
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# 4b. Evaluate Random Forest
print("--- Evaluation: Random Forest Classifier ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf)) 

--- Data Splitting Complete ---
Training set size: 800 samples
Testing set size: 200 samples




STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


--- Evaluation: Logistic Regression ---
Accuracy: 0.7850
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.89      0.85       140
           1       0.68      0.53      0.60        60

    accuracy                           0.79       200
   macro avg       0.75      0.71      0.73       200
weighted avg       0.78      0.79      0.78       200



--- Evaluation: Random Forest Classifier ---
Accuracy: 0.7700
Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.94      0.85       140
           1       0.72      0.38      0.50        60

    accuracy                           0.77       200
   macro avg       0.75      0.66      0.68       200
weighted avg       0.76      0.77      0.75       200



In [6]:
# Agent Task 4: Final Conclusion and Documentation Summary

print("--- FINAL CLASSIFICATION CONCLUSION ---")
print("1. Data Imbalance: The data is highly imbalanced (70% Good Risk, 30% Bad Risk).")
print("2. Evaluation Focus: Given the financial context, the primary focus must be on identifying 'Bad Risk' (Class 1) correctly, which means prioritizing Recall and F1-Score for Class 1.")
print("\n")

# Comparative Analysis
logreg_f1 = 0.60
rf_f1 = 0.50

print(f"Logistic Regression F1-Score (Bad Risk): {logreg_f1}")
print(f"Random Forest F1-Score (Bad Risk): {rf_f1}")

if logreg_f1 > rf_f1:
    print("\nCONCLUSION: The Logistic Regression model demonstrates better overall performance, especially in terms of F1-Score for the minority class (Bad Risk). Although Recall for Bad Risk (53%) is still a concern, it is the superior model between the two tested. Further optimization (e.g., using techniques like SMOTE or cost-sensitive learning) is recommended to improve the identification of high-risk customers.")
else:
    print("\nCONCLUSION: The Random Forest Classifier demonstrates better overall performance, particularly in managing the complexity introduced by One-Hot Encoding. This model is recommended for deployment, but further tuning is needed to increase the Recall for the Bad Risk class.")

# Save the Notebook for GitHub
# Ensure you save your Notebook as 'German_Credit_Risk_Classifier.ipynb' 

--- FINAL CLASSIFICATION CONCLUSION ---
1. Data Imbalance: The data is highly imbalanced (70% Good Risk, 30% Bad Risk).
2. Evaluation Focus: Given the financial context, the primary focus must be on identifying 'Bad Risk' (Class 1) correctly, which means prioritizing Recall and F1-Score for Class 1.


Logistic Regression F1-Score (Bad Risk): 0.6
Random Forest F1-Score (Bad Risk): 0.5

CONCLUSION: The Logistic Regression model demonstrates better overall performance, especially in terms of F1-Score for the minority class (Bad Risk). Although Recall for Bad Risk (53%) is still a concern, it is the superior model between the two tested. Further optimization (e.g., using techniques like SMOTE or cost-sensitive learning) is recommended to improve the identification of high-risk customers.
