<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/German%20Credit%20Data/German%20Credit%20Data%20-%20SMOTE%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# German Credit Data - Imbalanced Classification with Lift Analysis

This notebook demonstrates the **Universal ML Workflow** applied to credit risk classification, with special focus on **Lift Curves** to demonstrate model value even when accuracy improvements are modest.

## Learning Objectives

By the end of this notebook, you will be able to:
- Apply neural networks to **imbalanced binary classification**
- Understand why **accuracy alone can be misleading** for imbalanced datasets
- Use **Lift Curves** and **Cumulative Gains** to demonstrate model value
- Handle class imbalance using **SMOTE** (Synthetic Minority Over-sampling)
- Apply the Universal ML Workflow with **K-Fold Cross-Validation**

---

## Why Lift Curves Matter

In credit scoring, a model may not dramatically improve accuracy over the naive baseline (predicting all "Good"). However, the model's **ranking ability** provides significant business value:

| Metric | What It Shows | Business Value |
|--------|---------------|----------------|
| **Accuracy** | Overall correct predictions | Can be misleading with imbalance |
| **Lift** | How much better than random at each decile | Prioritise high-risk applicants |
| **Cumulative Gains** | % of positives captured at each threshold | Resource allocation |

**Example:** If the top 30% of applicants (ranked by risk score) contain 60% of the bad credits, the model provides **2x lift** - extremely valuable for credit decisions.

---

## Dataset Overview

| Attribute | Description |
|-----------|-------------|
| **Source** | [UCI German Credit Data](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) |
| **Problem Type** | Binary Classification (Good/Bad Credit Risk) |
| **Samples** | 1,000 applicants |
| **Class Distribution** | 70% Good (700), 30% Bad (300) |
| **Imbalance Ratio** | 2.33:1 |
| **Features** | 7 numerical + 13 categorical variables |

---

## Technique Scope

This notebook uses only techniques from **Chapters 1–4** of *Deep Learning with Python* (Chollet, 2021):

| Technique | Status | Rationale |
|-----------|--------|-----------|
| **Dense layers (DNN)** | ✓ Used | Core building block (Ch. 3-4) |
| **Dropout** | ✓ Used | Regularisation technique (Ch. 4) |
| **L2 regularisation** | ✓ Used | Weight penalty (Ch. 4) |
| **SMOTE** | ✓ Used | Handles class imbalance |
| **Early stopping** | ✗ Not used | Introduced in Ch. 7 |

---

## 1. Defining the Problem and Assembling a Dataset

**Problem Statement:** Classify credit applicants as good or bad credit risks based on their financial and personal attributes.

**Business Context:**
- **False Negative (FN):** Approving a bad credit → Financial loss from default
- **False Positive (FP):** Rejecting a good credit → Lost business opportunity
- **Cost asymmetry:** FN is typically 5-10x more costly than FP

**Key Insight:** Even if overall accuracy is similar to baseline, a model that **ranks bad credits higher** provides significant value through prioritisation.

## 2. Choosing a Measure of Success

### Data-Driven Metric Selection

| Criterion | This Dataset | Decision |
|-----------|--------------|----------|
| **Class Balance** | 70% Good, 30% Bad | Imbalanced (2.33:1) |
| **Imbalance Ratio** | 2.33:1 | Below 3:1 threshold |
| **Primary Metric** | **AUC** | Standard for credit scoring |
| **Secondary Metrics** | Lift, Accuracy, Gini | Business interpretation |

### Why AUC (not F1-Score) for Credit Scoring?

| Metric | Why AUC is Preferred |
|--------|---------------------|
| **Threshold-independent** | F1 depends on a specific threshold; AUC evaluates across ALL thresholds |
| **Measures ranking** | Credit scoring is fundamentally a ranking problem - who is riskier? |
| **Probability interpretation** | AUC = P(random bad credit ranked higher than random good credit) |
| **Gini relationship** | Gini = 2×AUC - 1; Gini is the industry-standard metric for scorecards |

### Metric Relationships

```
AUC = 0.50  →  Gini = 0.00  →  Random model (no discrimination)
AUC = 0.70  →  Gini = 0.40  →  Acceptable scorecard
AUC = 0.80  →  Gini = 0.60  →  Good scorecard
AUC = 0.90  →  Gini = 0.80  →  Excellent scorecard
```

### Lift Curve Interpretation

| Lift Value | Meaning |
|------------|---------|
| **Lift = 1.0** | No better than random |
| **Lift = 2.0** | 2x better than random at that decile |
| **Lift = 3.0** | 3x better than random at that decile |

**Key Insight:** Even with moderate AUC improvement over baseline, the **Lift Curve** shows the model's practical value for prioritising high-risk applicants.

## 3. Deciding on an Evaluation Protocol

### Data-Driven Protocol Selection

| Criterion | This Dataset | Decision |
|-----------|--------------|----------|
| **Sample Size** | 1,000 samples | Below 10,000 threshold |
| **Recommendation** | K-Fold Cross-Validation | More robust estimates |
| **K Value** | 5 folds | ~200 samples per fold |

**Critical Point for SMOTE:**
- **Train** on SMOTE-balanced data (synthetic minority samples added)
- **Validate/Test** on original imbalanced data (reflects real-world distribution)

```
Original Data (1000 samples: 700 Good, 300 Bad)
├── Test Set (10% = 100 samples) - Original distribution
└── Training Pool (90% = 900 samples)
    └── 5-Fold Stratified Cross-Validation
        ├── Each fold: Apply SMOTE to training portion only
        └── Validate on original imbalanced fold
```

## 4. Preparing Your Data

### 4.1 Import Libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# SMOTE for handling class imbalance
%pip install -q imbalanced-learn
from imblearn.over_sampling import SMOTE

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Keras Tuner for hyperparameter search
%pip install -q -U keras-tuner
import keras_tuner as kt

import matplotlib.pyplot as plt

# ============================================================
# RANDOM SEED - Set once, use everywhere
# ============================================================
SEED = 204

tf.random.set_seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings('ignore')

In [None]:
# ============================================================
# LOAD DATASET
# ============================================================
FILE_PATH = 'http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'

HEADERS = ['checking_account', 'duration', 'credit_history', 'purpose', 'credit_amount',
           'savings_account', 'employment', 'installment_rate', 'personal_status',
           'other_debtors', 'residence_since', 'property', 'age', 'other_installments',
           'housing', 'existing_credits', 'job', 'num_dependents', 'telephone', 
           'foreign_worker', 'credit_risk']

df = pd.read_csv(FILE_PATH, sep=" ", header=None, names=HEADERS)

print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# ============================================================
# DEFINE FEATURE TYPES
# ============================================================
NUMERICAL_VARIABLES = ['duration', 'credit_amount', 'installment_rate',
                       'residence_since', 'age', 'existing_credits', 'num_dependents']

CATEGORICAL_VARIABLES = ['checking_account', 'credit_history', 'purpose',
                         'savings_account', 'employment', 'personal_status',
                         'other_debtors', 'property', 'other_installments',
                         'housing', 'job', 'telephone', 'foreign_worker']

TARGET_VARIABLE = 'credit_risk'

print(f"Numerical features: {len(NUMERICAL_VARIABLES)}")
print(f"Categorical features: {len(CATEGORICAL_VARIABLES)}")
print(f"Total features: {len(NUMERICAL_VARIABLES) + len(CATEGORICAL_VARIABLES)}")

In [None]:
# ============================================================
# DATA-DRIVEN ANALYSIS: Class Balance and Dataset Size
# ============================================================
features = df[NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES]
target = df[TARGET_VARIABLE]

# Target: 1 = Good, 2 = Bad (in original data)
# Convert to: 0 = Good, 1 = Bad (for binary classification)
target_binary = (target == 2).astype(int)

# Class distribution
n_good = (target_binary == 0).sum()
n_bad = (target_binary == 1).sum()
imbalance_ratio = n_good / n_bad

# Dataset size
n_samples = len(df)
HOLDOUT_THRESHOLD = 10000

print("=" * 60)
print("DATA-DRIVEN CONFIGURATION")
print("=" * 60)
print(f"\n1. DATASET SIZE: {n_samples:,} samples")
print(f"   Threshold: {HOLDOUT_THRESHOLD:,} samples")
print(f"   Decision: {'Hold-Out' if n_samples > HOLDOUT_THRESHOLD else 'K-Fold Cross-Validation'}")

print(f"\n2. CLASS DISTRIBUTION:")
print(f"   Good Credit (0): {n_good} ({100*n_good/n_samples:.1f}%)")
print(f"   Bad Credit (1):  {n_bad} ({100*n_bad/n_samples:.1f}%)")
print(f"   Imbalance Ratio: {imbalance_ratio:.2f}:1")

print(f"\n3. IMBALANCE HANDLING:")
print(f"   Ratio < 3:1, but using SMOTE to improve minority class learning")

print("\n" + "=" * 60)
print("PRIMARY METRIC: AUC (threshold-independent ranking)")
print("VALIDATION: 5-Fold Stratified Cross-Validation")
print("=" * 60)

### 4.2 Train/Test Split

In [None]:
# ============================================================
# TRAIN/TEST SPLIT (90%/10%)
# ============================================================
TEST_SIZE = 0.10

X_train_raw, X_test_raw, y_train_full, y_test = train_test_split(
    features, target_binary,
    test_size=TEST_SIZE,
    stratify=target_binary,
    random_state=SEED,
    shuffle=True
)

print(f"Training pool: {len(X_train_raw):,} samples")
print(f"Test set: {len(X_test_raw):,} samples")

### 4.3 Preprocessing with ColumnTransformer

In [None]:
# ============================================================
# PREPROCESSING PIPELINE
# ============================================================
preprocessor = ColumnTransformer([
    ('one-hot-encoder', OneHotEncoder(handle_unknown="ignore"), CATEGORICAL_VARIABLES),
    ('standard_scaler', StandardScaler(), NUMERICAL_VARIABLES)
])

# Fit on training data only (prevent data leakage)
preprocessor.fit(X_train_raw)

# Transform both sets
X_train_full = preprocessor.transform(X_train_raw)
X_test = preprocessor.transform(X_test_raw)

# Convert to numpy arrays
y_train_full = y_train_full.values
y_test = y_test.values

print(f"Preprocessed training shape: {X_train_full.shape}")
print(f"Preprocessed test shape: {X_test.shape}")

### 4.4 K-Fold Cross-Validation Setup

In [None]:
# ============================================================
# K-FOLD CROSS-VALIDATION SETUP
# ============================================================
N_FOLDS = 5

skfold = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

print(f"K-Fold Configuration:")
print(f"  Number of folds: {N_FOLDS}")
print(f"  Training pool: {X_train_full.shape[0]:,} samples")
print(f"  Samples per fold: ~{X_train_full.shape[0] // N_FOLDS:,}")
print(f"  Test set (held out): {X_test.shape[0]:,} samples")

# For initial model development, we use the first fold
first_fold = list(skfold.split(X_train_full, y_train_full))[0]
train_idx, val_idx = first_fold

X_train = X_train_full[train_idx]
X_val = X_train_full[val_idx]
y_train = y_train_full[train_idx]
y_val = y_train_full[val_idx]

print(f"\nFirst fold (for initial development):")
print(f"  Training: {X_train.shape[0]:,} samples")
print(f"  Validation: {X_val.shape[0]:,} samples")

### 4.5 Apply SMOTE to Balance Training Data

**How SMOTE Works:**
1. For each minority sample, find its k nearest neighbors
2. Randomly select one neighbor
3. Create a synthetic sample on the line between original and neighbor

**Critical:** Only apply SMOTE to training data, never to validation or test data!

In [None]:
# ============================================================
# APPLY SMOTE TO TRAINING DATA ONLY
# ============================================================
smote = SMOTE(sampling_strategy='auto', random_state=SEED)

# Apply SMOTE to training fold only (NOT validation)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Before SMOTE:")
print(f"  Good (0): {(y_train == 0).sum()}, Bad (1): {(y_train == 1).sum()}")

print("\nAfter SMOTE:")
print(f"  Good (0): {(y_train_smote == 0).sum()}, Bad (1): {(y_train_smote == 1).sum()}")

print("\nValidation set (unchanged - original distribution):")
print(f"  Good (0): {(y_val == 0).sum()}, Bad (1): {(y_val == 1).sum()}")

In [None]:
# ============================================================
# LIFT CURVE AND CUMULATIVE GAINS FUNCTIONS
# ============================================================
def calculate_lift_curve(y_true, y_pred_proba, n_bins=10):
    """
    Calculate lift curve data.
    
    Parameters:
    -----------
    y_true : array-like
        True binary labels (0/1)
    y_pred_proba : array-like
        Predicted probabilities for positive class
    n_bins : int
        Number of deciles/bins
    
    Returns:
    --------
    dict : Contains deciles, lift values, and cumulative gains
    """
    # Create dataframe for sorting
    data = pd.DataFrame({
        'y_true': y_true,
        'y_pred': y_pred_proba.flatten()
    })
    
    # Sort by predicted probability (descending - highest risk first)
    data = data.sort_values('y_pred', ascending=False).reset_index(drop=True)
    
    # Calculate metrics for each decile
    total_positives = data['y_true'].sum()
    total_samples = len(data)
    base_rate = total_positives / total_samples
    
    deciles = []
    lifts = []
    cum_gains = []
    cum_positives = 0
    
    bin_size = total_samples // n_bins
    
    for i in range(n_bins):
        start_idx = i * bin_size
        end_idx = (i + 1) * bin_size if i < n_bins - 1 else total_samples
        
        bin_data = data.iloc[start_idx:end_idx]
        bin_positives = bin_data['y_true'].sum()
        bin_rate = bin_positives / len(bin_data)
        
        # Lift = bin_rate / base_rate
        lift = bin_rate / base_rate
        
        # Cumulative gains
        cum_positives += bin_positives
        cum_gain = cum_positives / total_positives
        
        deciles.append(i + 1)
        lifts.append(lift)
        cum_gains.append(cum_gain)
    
    return {
        'deciles': deciles,
        'lifts': lifts,
        'cum_gains': cum_gains,
        'base_rate': base_rate,
        'total_positives': total_positives
    }


def plot_lift_analysis(y_true, y_pred_proba, title='Model'):
    """
    Plot Lift Curve and Cumulative Gains side by side.
    """
    results = calculate_lift_curve(y_true, y_pred_proba)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Lift Curve
    ax1 = axes[0]
    ax1.bar(results['deciles'], results['lifts'], color='steelblue', edgecolor='black')
    ax1.axhline(y=1.0, color='red', linestyle='--', linewidth=2, label='Random (Lift=1)')
    ax1.set_xlabel('Decile (1=Highest Risk)')
    ax1.set_ylabel('Lift')
    ax1.set_title(f'Lift Curve - {title}')
    ax1.set_xticks(results['deciles'])
    ax1.legend()
    ax1.grid(axis='y', alpha=0.3)
    
    # Plot 2: Cumulative Gains
    ax2 = axes[1]
    decile_pct = [d * 10 for d in results['deciles']]
    cum_gains_pct = [g * 100 for g in results['cum_gains']]
    
    ax2.plot(decile_pct, cum_gains_pct, 'b-o', linewidth=2, markersize=8, label='Model')
    ax2.plot([0, 100], [0, 100], 'r--', linewidth=2, label='Random')
    ax2.fill_between(decile_pct, cum_gains_pct, [d for d in decile_pct], alpha=0.3)
    ax2.set_xlabel('% of Population (Sorted by Risk Score)')
    ax2.set_ylabel('% of Bad Credits Captured')
    ax2.set_title(f'Cumulative Gains Curve - {title}')
    ax2.legend()
    ax2.grid(alpha=0.3)
    ax2.set_xlim([0, 100])
    ax2.set_ylim([0, 100])
    
    plt.tight_layout()
    plt.show()
    
    # Print summary
    print("\nLift Analysis Summary:")
    print("-" * 50)
    print(f"{'Decile':<10} {'Lift':>10} {'Cum Gain':>15}")
    print("-" * 50)
    for i, (d, l, g) in enumerate(zip(results['deciles'], results['lifts'], results['cum_gains'])):
        print(f"{d:<10} {l:>10.2f} {g*100:>14.1f}%")
    print("-" * 50)
    
    return results

In [None]:
# ============================================================
# HELPER FUNCTION: Plot Training History
# ============================================================
def plot_training_history(history, title='Training History'):
    """Plot training and validation loss/AUC curves."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Loss
    axes[0].plot(history.history['loss'], 'b-', label='Training Loss')
    axes[0].plot(history.history['val_loss'], 'r-', label='Validation Loss')
    axes[0].set_title('Training and Validation Loss')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # AUC
    axes[1].plot(history.history['auc'], 'b-', label='Training AUC')
    axes[1].plot(history.history['val_auc'], 'r-', label='Validation AUC')
    axes[1].set_title('Training and Validation AUC')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('AUC')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.suptitle(title, fontsize=14)
    plt.tight_layout()
    plt.show()

## 5. Developing a Model That Does Better Than a Baseline

**Baseline Metrics:**
- **Accuracy baseline:** 70% (predict all "Good")
- **AUC baseline:** 0.50 (random ranking)
- **Lift baseline:** 1.0 (no better than random at any decile)

**Goal:** Beat the AUC baseline and demonstrate positive lift in top deciles.

In [None]:
# ============================================================
# MODEL CONFIGURATION
# ============================================================
INPUT_DIMENSION = X_train_full.shape[1]
OUTPUT_DIMENSION = 1  # Binary classification

OPTIMIZER = 'adam'
LOSS_FUNC = 'binary_crossentropy'
METRICS = ['accuracy', tf.keras.metrics.AUC(name='auc')]

# Training configuration
BATCH_SIZE = 32  # Smaller batch for small dataset
EPOCHS_BASELINE = 100
EPOCHS_REGULARIZED = 150

print(f"Input dimension: {INPUT_DIMENSION}")
print(f"Output dimension: {OUTPUT_DIMENSION}")
print(f"Batch size: {BATCH_SIZE}")

In [None]:
# ============================================================
# ESTABLISH BASELINES
# ============================================================
# Accuracy baseline: predict all "Good" (majority class)
accuracy_baseline = n_good / n_samples

# AUC baseline: random ranking
auc_baseline = 0.5

# Lift baseline: 1.0 (no better than random)
lift_baseline = 1.0

print("Baseline Metrics:")
print(f"  Accuracy (predict all Good): {accuracy_baseline:.2%}")
print(f"  AUC (random ranking): {auc_baseline:.2f}")
print(f"  Lift (random): {lift_baseline:.1f}")

In [None]:
# ============================================================
# SINGLE LAYER PERCEPTRON (SLP) - Simplest model
# ============================================================
slp_model = Sequential(name='Single_Layer_Perceptron')
slp_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
slp_model.add(Dense(OUTPUT_DIMENSION, activation='sigmoid'))
slp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

slp_model.summary()

In [None]:
# Train SLP on SMOTE-balanced data
slp_history = slp_model.fit(
    X_train_smote, y_train_smote,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS_BASELINE,
    validation_data=(X_val, y_val),  # Validate on original distribution
    verbose=0
)

# Evaluate on original (imbalanced) validation set
slp_preds = slp_model.predict(X_val, verbose=0)
slp_auc = roc_auc_score(y_val, slp_preds)
slp_acc = accuracy_score(y_val, (slp_preds > 0.5).astype(int))

print(f"SLP Results (Original Validation Distribution):")
print(f"  Accuracy: {slp_acc:.2%} (baseline: {accuracy_baseline:.2%})")
print(f"  AUC: {slp_auc:.4f} (baseline: {auc_baseline:.2f})")
print(f"  Gini: {2*slp_auc - 1:.4f}")

In [None]:
# Plot SLP training history
plot_training_history(slp_history, 'Single Layer Perceptron')

## 6. Scaling Up: Developing a Model That Overfits

Adding a hidden layer to learn more complex patterns in credit risk data.

In [None]:
# ============================================================
# MULTI-LAYER PERCEPTRON (MLP) - Standard architecture
# ============================================================
HIDDEN_NEURONS = 64

mlp_model = Sequential(name='Multi_Layer_Perceptron')
mlp_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
mlp_model.add(Dense(HIDDEN_NEURONS, activation='relu'))
mlp_model.add(Dense(OUTPUT_DIMENSION, activation='sigmoid'))
mlp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

mlp_model.summary()

In [None]:
# Train MLP on SMOTE-balanced data
mlp_history = mlp_model.fit(
    X_train_smote, y_train_smote,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS_BASELINE,
    validation_data=(X_val, y_val),
    verbose=0
)

# Evaluate on original validation set
mlp_preds = mlp_model.predict(X_val, verbose=0)
mlp_auc = roc_auc_score(y_val, mlp_preds)
mlp_acc = accuracy_score(y_val, (mlp_preds > 0.5).astype(int))

print(f"MLP Results (Original Validation Distribution):")
print(f"  Accuracy: {mlp_acc:.2%} (baseline: {accuracy_baseline:.2%})")
print(f"  AUC: {mlp_auc:.4f} (baseline: {auc_baseline:.2f})")
print(f"  Gini: {2*mlp_auc - 1:.4f}")

In [153]:
print('Balanced Accuracy (Validation): {:.2f} (baseline = 0.5)'.format(balanced_accuracy_score(y_val, (preds > 0.5).astype('int32'))))

Balanced Accuracy (Validation): 0.70 (baseline = 0.5)


In [154]:
def plot_training_history(history, monitors=['loss', 'AUC']) :

  # using the variable axs for multiple Axes
  fig, axs = plt.subplots(1, 2, sharex='all', figsize=(15,5))
 
  for ax, monitor in zip(axs.flat, monitors) :
    loss, val_loss = history.history[monitor], history.history['val_' + monitor]

    if monitor == 'loss' :
      monitor = monitor.capitalize()

    epochs = range(1, len(loss)+1)

    ax.plot(epochs, loss, 'b.', label=monitor)
    ax.plot(epochs, val_loss, 'r.', label='Validation ' + monitor)
    ax.set_xlim([0, len(loss)])
    ax.title.set_text('Training and Validation ' + monitor + 's')
    ax.set_xlabel('Epochs')
    ax.set_ylabel(monitor)
    ax.legend()
    ax.grid()

  _ = plt.show()

In [None]:
plot_training_history(history, monitors=['loss', 'auc'])

 ## Scaling up: developing a model that overfits

In [None]:
# Larger model to test for overfitting (3 hidden layers, no dropout)
mlp_model = Sequential(name='Multi_Layer_Perceptron')
mlp_model.add(Dense(8, activation='relu', input_shape=(INPUT_DIMENSION,)))
mlp_model.add(Dense(8, activation='relu'))
mlp_model.add(Dense(8, activation='relu'))
mlp_model.add(Dense(1, activation='sigmoid'))
mlp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

mlp_model.summary()

# Train MLP on SMOTE-balanced data
history_mlp = mlp_model.fit(Xs_train, ys_train, batch_size=batch_size, epochs=EPOCHS, 
                            validation_data=(Xs_val, ys_val), verbose=0)
val_score_mlp = mlp_model.evaluate(Xs_val, ys_val, verbose=0)[1:]

## 6. Scaling Up: Developing a Model That Overfits

Testing if a larger model can capture more complex patterns in the SMOTE-balanced data.

In [None]:
plot_training_history(history_mlp, monitors=['loss', 'auc'])

In [None]:
print('Accuracy (Balanced Validation): {:.2f} (baseline=0.5)'.format(val_score_mlp[0]))
print('Precision (Balanced Validation): {:.2f}'.format(val_score_mlp[1]))
print('Recall (Balanced Validation): {:.2f}'.format(val_score_mlp[2]))
print('AUC (Balanced Validation): {:.2f}'.format(val_score_mlp[3]))

In [None]:
preds = mlp_model.predict(X_val, verbose=0)

print('Accuracy (Imbalanced Validation): {:.2f} (baseline=0.7)'.format(accuracy_score(y_val, (preds > 0.5).astype('int32'))))
print('Precision (Imbalanced Validation): {:.2f}'.format(precision_score(y_val, (preds > 0.5).astype('int32'))))
print('Recall (Imbalanced Validation): {:.2f}'.format(recall_score(y_val, (preds > 0.5).astype('int32'))))
print('AUC (Imbalanced Validation): {:.2f}'.format(roc_auc_score(y_val, preds)))
print('Balanced Accuracy (Validation): {:.2f} (baseline = 0.5)'.format(balanced_accuracy_score(y_val, (preds > 0.5).astype('int32'))))

## 7. Regularizing Your Model and Tuning Hyperparameters

Using **Hyperband** for efficient hyperparameter tuning with a frozen architecture.

### Why Hyperband?

**Hyperband** is more efficient than grid search because it:
1. Starts training many configurations for a few epochs
2. Eliminates poor performers early
3. Allocates more resources to promising configurations

In [None]:
# Hyperband Model Builder for Binary Classification (SMOTE)
def build_model_hyperband(hp):
    """
    Build German Credit model with FROZEN architecture (2 layers: 16 -> 8 neurons).
    Only tunes regularization (Dropout) and learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))

    # Fixed architecture: 2 hidden layers with 16 and 8 neurons
    # Layer 1: 16 neurons
    model.add(layers.Dense(16, activation='relu'))
    drop_0 = hp.Float('drop_0', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(drop_0))

    # Layer 2: 8 neurons
    model.add(layers.Dense(8, activation='relu'))
    drop_1 = hp.Float('drop_1', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(drop_1))

    # Output layer for binary classification
    model.add(layers.Dense(OUTPUT_DIMENSION, activation='sigmoid'))

    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss=LOSS_FUNC,
        metrics=METRICS
    )
    return model

In [None]:
# Configure Hyperband tuner
tuner = kt.Hyperband(
    build_model_hyperband,
    objective='val_auc',
    max_epochs=20,
    factor=3,
    directory='german_credit_hyperband',
    project_name='german_credit_tuning'
)

# Run Hyperband search on SMOTE-balanced data
tuner.search(
    Xs_train, ys_train,
    validation_data=(Xs_val, ys_val),
    epochs=20,
    batch_size=batch_size
)

In [None]:
# Get best hyperparameters and build best model
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters:")
print(f"  Dropout Layer 1: {best_hp.get('drop_0')}")
print(f"  Dropout Layer 2: {best_hp.get('drop_1')}")
print(f"  Learning Rate: {best_hp.get('lr')}")

opt_model = tuner.hypermodel.build(best_hp)
opt_model.summary()

In [None]:
# Train the best model on SMOTE-balanced data
history_opt = opt_model.fit(
    Xs_train, ys_train,
    validation_data=(Xs_val, ys_val),
    epochs=50,
    batch_size=batch_size,
    verbose=1
)

trained_opt_model = {
    'model': opt_model,
    'val_score': opt_model.evaluate(Xs_val, ys_val, verbose=0)[1:],
    'history': history_opt
}

In [None]:
# ============================================================
# FINAL EVALUATION WITH LIFT ANALYSIS
# ============================================================
# Get predictions on original (imbalanced) validation data
final_preds = mlp_model.predict(X_val, verbose=0)

# Standard metrics
final_auc = roc_auc_score(y_val, final_preds)
final_gini = 2 * final_auc - 1
final_acc = accuracy_score(y_val, (final_preds > 0.5).astype(int))

print("=" * 60)
print("FINAL EVALUATION - ORIGINAL VALIDATION DISTRIBUTION")
print("=" * 60)
print(f"\n1. STANDARD METRICS:")
print(f"   Accuracy: {final_acc:.2%} (baseline: {accuracy_baseline:.2%})")
print(f"   AUC: {final_auc:.4f} (baseline: {auc_baseline:.2f})")
print(f"   Gini: {final_gini:.4f}")

print(f"\n2. LIFT ANALYSIS:")
print("   (See charts below)")
print("=" * 60)

# Plot lift analysis
lift_results = plot_lift_analysis(y_val, final_preds, title='Final Model')

# Key business insight
top_3_decile_gain = lift_results['cum_gains'][2]  # First 3 deciles (30%)
print(f"\n3. BUSINESS INSIGHT:")
print(f"   By reviewing the top 30% of applicants (ranked by risk score),")
print(f"   the model captures {top_3_decile_gain*100:.1f}% of bad credits.")
print(f"   This is {top_3_decile_gain/0.3:.1f}x better than random selection.")

---

## 8. Key Takeaways

### Decision Framework Summary

| Decision | Threshold | This Dataset | Choice | Reference |
|----------|-----------|--------------|--------|-----------|
| **Hold-Out vs K-Fold** | > 10,000 samples | 1,000 samples | **K-Fold (5 folds)** | Kohavi (1995) |
| **Primary Metric** | Credit scoring | Ranking problem | **AUC** | Industry standard |
| **Imbalance Handling** | 2.33:1 ratio | Moderate imbalance | **SMOTE** | Chawla et al. (2002) |

### Why Lift Curves Matter for Credit Scoring

1. **Accuracy can be misleading:** A model predicting all "Good" achieves 70% accuracy but provides zero business value.

2. **Lift shows ranking ability:** Even with modest AUC improvement, significant lift in top deciles demonstrates the model's value for prioritising high-risk applicants.

3. **Business interpretation:** "By reviewing the top 30% of applicants (ranked by risk score), we capture 60% of bad credits" is actionable for credit analysts.

4. **Threshold-independent:** Unlike accuracy (threshold=0.5), lift and AUC evaluate the model across all possible thresholds.

### Lessons Learned

1. **Train on SMOTE-balanced, Validate on Imbalanced:** This reflects real-world performance while teaching the model to recognize minority patterns.

2. **AUC is the Standard for Credit Scoring:** It measures ranking ability and relates directly to Gini coefficient (Gini = 2×AUC - 1).

3. **Lift Complements AUC:** While AUC gives a single number, lift curves show WHERE the model excels (typically in top deciles).

4. **K-Fold for Small Datasets:** With only 1,000 samples, K-Fold cross-validation provides more robust performance estimates.

### References

- Chawla, N.V. et al. (2002) 'SMOTE: Synthetic Minority Over-sampling Technique', *Journal of Artificial Intelligence Research*, 16, pp. 321-357.

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

- Siddiqi, N. (2017) *Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards*. 2nd edn. Wiley.

---

## Appendix: Modular Helper Functions

The following modular functions can be used for more complex experiments or reusable workflows. These functions encapsulate the model creation and training logic demonstrated inline above.

In [None]:
# =============================================================================
# Modular Helper Functions for DNN Classification
# =============================================================================

def create_binary_classifier(input_dim, hidden_layers, neurons_per_layer, 
                             dropout_rate=None, activation='relu',
                             optimizer='rmsprop', loss='binary_crossentropy', 
                             metrics=None, name=None):
    """
    Create a binary classification neural network with customizable architecture.
    
    Parameters:
    -----------
    input_dim : int
        Number of input features
    hidden_layers : int
        Number of hidden layers
    neurons_per_layer : int or list
        Neurons per hidden layer (int for uniform, list for varying)
    dropout_rate : float, optional
        Dropout rate after each hidden layer (None = no dropout)
    activation : str
        Activation function for hidden layers
    optimizer : str
        Optimizer for model compilation
    loss : str
        Loss function for model compilation
    metrics : list
        Metrics to track during training
    name : str, optional
        Model name
    
    Returns:
    --------
    keras.Sequential : Compiled model
    
    Example:
    --------
    # model = create_binary_classifier(
    #     input_dim=61, hidden_layers=2, neurons_per_layer=16,
    #     dropout_rate=0.25, metrics=['accuracy']
    # )
    """
    model = Sequential(name=name)
    
    # Handle uniform vs varying neurons per layer
    if isinstance(neurons_per_layer, int):
        neurons_list = [neurons_per_layer] * hidden_layers
    else:
        neurons_list = neurons_per_layer
    
    for i, neurons in enumerate(neurons_list):
        if i == 0:
            model.add(Dense(neurons, activation=activation, input_shape=(input_dim,)))
        else:
            model.add(Dense(neurons, activation=activation))
        if dropout_rate is not None:
            model.add(Dropout(dropout_rate))
    
    # Output layer for binary classification
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics or ['accuracy'])
    return model


def train_model(model, X_train, y_train, X_val, y_val, 
                batch_size=32, epochs=100, verbose=0):
    """
    Train a model and return training history and validation scores.
    
    Note: This function does NOT use class_weights - use with SMOTE-balanced data.
    
    Parameters:
    -----------
    model : keras.Model
        Compiled Keras model
    X_train, y_train : array-like
        Training data (should be SMOTE-balanced for imbalanced problems)
    X_val, y_val : array-like
        Validation data (keep original imbalanced distribution)
    batch_size : int
        Training batch size
    epochs : int
        Number of training epochs
    verbose : int
        Verbosity level (0=silent, 1=progress bar, 2=one line per epoch)
    
    Returns:
    --------
    tuple : (history, val_scores)
        - history: Training history object
        - val_scores: Validation metric scores (excluding loss)
    
    Example:
    --------
    # history, val_scores = train_model(
    #     model, Xs_train, ys_train, Xs_val, ys_val,
    #     batch_size=32, epochs=100
    # )
    """
    history = model.fit(
        X_train, y_train,
        batch_size=batch_size,
        epochs=epochs,
        validation_data=(X_val, y_val),
        verbose=verbose
    )
    val_scores = model.evaluate(X_val, y_val, verbose=0)[1:]
    return history, val_scores