# ðŸŽ“ ADVANCED MACHINE LEARNING FOR CREDIT RISK CLASSIFICATION

---

## ðŸ“‹ COVER PAGE

<div style="text-align: center; padding: 60px 20px;">

### **LONDON METROPOLITAN UNIVERSITY**

---

#### **Advanced Machine Learning Project**

#### **Credit Risk Classification using Ensemble Methods**

---

### **Student Information**

**Name:** SUJAN PAUDEL

**LondonMet ID:** 23050272

---

### **Project Title:**
### Advanced Machine Learning for Credit Risk Classification with Feature Engineering and Ensemble Optimization

---

### **Objective:**
To develop an optimized machine learning system for credit risk classification using advanced feature engineering, SMOTE-based class balancing, hyperparameter optimization, and ensemble voting techniques to improve prediction accuracy and AUC-ROC scores.

---

### **Key Techniques:**
- Advanced Feature Engineering (16 engineered features)
- SMOTE Oversampling for Class Imbalance
- RobustScaler Preprocessing (Outlier-Resistant)
- RandomizedSearchCV Hyperparameter Optimization
- Ensemble Voting Classifier with Weighted Voting
- Threshold Optimization for Best Classification Performance

---

### **Dataset:** South German Credit Dataset (1000 samples, 20 original features)

---

**Date:** 2024

---

</div>

## 1. Import Required Libraries
Import all necessary libraries for data processing, machine learning, and visualization.

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import warnings
import time

# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from xgboost import XGBClassifier

# Preprocessing and pipeline utilities
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import RobustScaler
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE

# Hyperparameter optimization
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

# Evaluation metrics
from sklearn.metrics import (
    accuracy_score, roc_auc_score, f1_score, confusion_matrix, 
    classification_report, roc_curve, precision_score, recall_score
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')

print("âœ“ All libraries imported successfully")

## 2. Load and Explore Data
Load the South German Credit dataset and perform initial exploratory analysis.

In [None]:
# Define column names
column_names = [
    'Status_Checking_Account', 'Duration_Months', 'Credit_History', 'Purpose', 
    'Credit_Amount', 'Savings_Account', 'Employment_Since', 'Installment_Rate', 
    'Gender_Status', 'Other_Debtors', 'Residence_Years', 'Property', 'Age', 
    'Other_Installments', 'Housing', 'Existing_Credits', 'Job', 'Dependents', 
    'Telephone', 'Foreign_Worker', 'Credit_Risk'
]

# Load dataset
df = pd.read_csv('SouthGermanCredit.asc', delim_whitespace=True, header=0, names=column_names)

print(f"Dataset shape: {df.shape}")
print(f"\nTarget distribution:")
print(df['Credit_Risk'].value_counts())
print(f"\nClass balance: {df['Credit_Risk'].value_counts(normalize=True).round(3)}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Dataset info
print("Dataset Information:")
df.info()
print("\nStatistical Summary:")
df.describe()

## 3. Data Visualization and Analysis

### 3.1 Target Distribution Visualization
Bar chart and pie chart to confirm and quantify the severe class imbalance between Good Risk and Bad Risk loans.

In [None]:
# Target Distribution - Bar Chart and Pie Chart
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
credit_counts = df['Credit_Risk'].value_counts()
colors_target = ['#2ecc71', '#e74c3c']  # Green for Good, Red for Bad
bars = axes[0].bar(credit_counts.index, credit_counts.values, color=colors_target, alpha=0.8, edgecolor='black', linewidth=1.5)
axes[0].set_xlabel('Credit Risk Class', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=12, fontweight='bold')
axes[0].set_title('Target Distribution: Credit Risk Classes', fontsize=14, fontweight='bold')
axes[0].set_xticklabels(['Good Risk (0)', 'Bad Risk (1)'])
axes[0].grid(alpha=0.3, axis='y')

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontweight='bold')

# Percentage distribution
percentages = (credit_counts / len(df) * 100).round(2)
colors_pie = ['#2ecc71', '#e74c3c']
wedges, texts, autotexts = axes[1].pie(percentages.values, labels=['Good Risk', 'Bad Risk'], 
                                        autopct='%1.1f%%', colors=colors_pie, startangle=90,
                                        textprops={'fontsize': 11, 'fontweight': 'bold'})
axes[1].set_title('Class Imbalance Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nâœ“ CLASS IMBALANCE ANALYSIS:")
print(f"  Good Risk (0): {credit_counts[0]} ({percentages[0]:.2f}%)")
print(f"  Bad Risk (1): {credit_counts[1]} ({percentages[1]:.2f}%)")
print(f"  Imbalance Ratio: 1:{credit_counts[0]/credit_counts[1]:.2f}")

### 3.2 Numerical Features Distribution Analysis
Histograms for Credit Amount and Duration to detect skewness and confirm scaling/capping necessity.

In [None]:
# Numerical Features Distribution - Histograms
numerical_features = ['Credit_Amount', 'Duration_Months', 'Age', 'Installment_Rate']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, feature in enumerate(numerical_features):
    # Histogram with KDE
    axes[idx].hist(df[feature], bins=30, color='steelblue', alpha=0.7, edgecolor='black', density=True)
    df[feature].plot(kind='kde', ax=axes[idx], secondary_y=False, color='red', linewidth=2.5, label='KDE')
    
    axes[idx].set_xlabel(feature, fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Density', fontsize=11, fontweight='bold')
    axes[idx].set_title(f'Distribution of {feature}', fontsize=12, fontweight='bold')
    axes[idx].grid(alpha=0.3)
    
    # Calculate skewness
    skewness = df[feature].skew()
    axes[idx].text(0.98, 0.97, f'Skewness: {skewness:.3f}', 
                   transform=axes[idx].transAxes, fontsize=10,
                   verticalalignment='top', horizontalalignment='right',
                   bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("\nâœ“ NUMERICAL FEATURES SKEWNESS ANALYSIS:")
for feature in numerical_features:
    skewness = df[feature].skew()
    skewness_type = "Highly Skewed" if abs(skewness) > 1 else "Moderately Skewed" if abs(skewness) > 0.5 else "Nearly Symmetric"
    print(f"  {feature}: {skewness:.4f} ({skewness_type})")

### 3.3 Categorical Features Distribution Analysis
Count plots for categorical features like Purpose and Checking Account Status.

In [None]:
categorical_features_viz = [
    'Status_Checking_Account',
    'Purpose',
    'Savings_Account',
    'Housing'
]

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for idx, feature in enumerate(categorical_features_viz):
    counts = df[feature].value_counts().sort_values(ascending=True)
    colors = plt.cm.Set3(np.linspace(0, 1, len(counts)))

    axes[idx].barh(
        counts.index.astype(str),
        counts.values,
        color=colors,
        edgecolor='black',
        linewidth=1.2
    )

    axes[idx].set_xlabel('Count', fontsize=11, fontweight='bold')
    axes[idx].set_title(f'Distribution of {feature}', fontsize=12, fontweight='bold')
    axes[idx].grid(axis='x', alpha=0.3)

    # Add value labels
    for i, v in enumerate(counts.values):
        axes[idx].text(v + 2, i, str(v), va='center', fontweight='bold')

plt.tight_layout()
plt.show()

### 3.4 Correlation and Multicollinearity Check
Heatmap for numerical features to reveal potential multicollinearity issues.

In [None]:
# Correlation Heatmap - Multicollinearity Check
numerical_cols = ['Duration_Months', 'Credit_Amount', 'Age', 'Residence_Years', 
                  'Installment_Rate', 'Existing_Credits', 'Dependents']

correlation_matrix = df[numerical_cols].corr()

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1.5, cbar_kws={'label': 'Correlation'},
            vmin=-1, vmax=1, ax=ax)
ax.set_title('Correlation Matrix - Numerical Features\n(Multicollinearity Check)', 
             fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Identify high correlations
print("\nâœ“ MULTICOLLINEARITY ANALYSIS:")
print("\nHigh Correlations (|r| > 0.7):")
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7:
            high_corr_pairs.append((correlation_matrix.columns[i], 
                                   correlation_matrix.columns[j], 
                                   correlation_matrix.iloc[i, j]))
            print(f"  {correlation_matrix.columns[i]} <-> {correlation_matrix.columns[j]}: {correlation_matrix.iloc[i, j]:.4f}")

if not high_corr_pairs:
    print("  No high correlations detected (good for model stability)")

### 3.5 Predictive Power Analysis
Box plots showing distribution of numerical features stratified by target classes (Good vs Bad Risk).

In [None]:
# Predictive Power Analysis - Box Plots by Target Class
features_for_boxplot = ['Credit_Amount', 'Duration_Months', 'Age', 'Installment_Rate']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for idx, feature in enumerate(features_for_boxplot):
    # Create box plot with target classes
    df_temp = df.copy()
    df_temp['Credit_Risk_Label'] = df_temp['Credit_Risk'].map({0: 'Good Risk', 1: 'Bad Risk'})
    
    bp = axes[idx].boxplot([df_temp[df_temp['Credit_Risk'] == 0][feature],
                             df_temp[df_temp['Credit_Risk'] == 1][feature]],
                            labels=['Good Risk', 'Bad Risk'],
                            patch_artist=True,
                            widths=0.6,
                            showmeans=True,
                            meanprops=dict(marker='D', markerfacecolor='red', markersize=8))
    
    # Color the boxes
    colors = ['#2ecc71', '#e74c3c']
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.6)
    
    axes[idx].set_ylabel(feature, fontsize=11, fontweight='bold')
    axes[idx].set_title(f'{feature} Distribution by Credit Risk', fontsize=12, fontweight='bold')
    axes[idx].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nâœ“ PREDICTIVE POWER ANALYSIS (Mean Values):")
for feature in features_for_boxplot:
    good_mean = df[df['Credit_Risk'] == 0][feature].mean()
    bad_mean = df[df['Credit_Risk'] == 1][feature].mean()
    difference = abs(good_mean - bad_mean)
    print(f"\n{feature}:")
    print(f"  Good Risk Mean: {good_mean:.2f}")
    print(f"  Bad Risk Mean: {bad_mean:.2f}")
    print(f"  Difference: {difference:.2f} (Discriminative Power: {'HIGH' if difference > df[feature].std()*0.5 else 'MODERATE'})")

### 3.6 Outlier Identification and Analysis
Box plots for all numeric variables to identify outliers and assess their impact.