# Enhanced Deep Learning for Fraud Detection

## Tutorial 5: Advanced Neural Networks for Imbalanced Classification

In this tutorial, you'll master advanced deep learning techniques specifically designed for fraud detection:
- **Focal Loss**: Focus learning on hard examples
- **Weighted Binary Cross-Entropy**: Traditional imbalance handling
- **Autoencoder Anomaly Detection**: Unsupervised fraud detection
- **Modern Architecture Design**: BatchNorm, Dropout, and more

## Learning Objectives

By the end of this tutorial, you'll understand:

1. **Deep Learning for Imbalanced Data**: Why standard approaches fail
2. **Focal Loss**: Mathematical foundation and implementation
3. **Autoencoder Anomaly Detection**: Unsupervised fraud detection
4. **Modern Neural Architecture**: Best practices for stable training
5. **Training Strategies**: Learning rate scheduling, early stopping
6. **Evaluation Techniques**: Comprehensive model assessment
7. **Production Considerations**: GPU optimization and deployment

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    confusion_matrix, classification_report, roc_curve, auc,
    precision_recall_curve, average_precision_score
)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Check for GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## Part 1: Understanding the Challenge - Class Imbalance in Fraud Detection

### The Problem

Fraud detection presents unique challenges for deep learning:
- **Extreme imbalance**: Only 0.1-2% of transactions are fraudulent
- **Cost asymmetry**: Missing fraud is much more expensive than false alarms
- **Evolving patterns**: Fraudsters constantly adapt their methods

Standard loss functions like Cross-Entropy fail because they:
- Are dominated by the majority class
- Don't focus on hard examples
- Don't account for business costs

In [None]:
# Load and examine the data
df = pd.read_csv('creditcard.csv')
print(f"Dataset shape: {df.shape}")
print(f"Fraud rate: {df['Class'].mean()*100:.3f}%")
print(f"Number of fraud cases: {df['Class'].sum():,}")
print(f"Number of normal cases: {len(df) - df['Class'].sum():,}")

# Visualize class distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Class distribution
class_counts = df['Class'].value_counts()
ax1.bar(['Normal', 'Fraud'], class_counts.values, color=['lightblue', 'salmon'])
ax1.set_ylabel('Number of Transactions')
ax1.set_title('Class Distribution (Linear Scale)')
ax1.set_yscale('log')

# Fraud percentage
fraud_percentage = df['Class'].mean() * 100
ax2.pie([100 - fraud_percentage, fraud_percentage], 
        labels=[f'Normal ({100-fraud_percentage:.1f}%)', f'Fraud ({fraud_percentage:.1f}%)'],
        colors=['lightblue', 'salmon'],
        autopct='%1.1f%%')
ax2.set_title('Class Distribution (Percentage)')

plt.tight_layout()
plt.show()

print("\nKey Challenge: How can we train a model to detect the 0.17% of transactions that are fraudulent?")

## Part 2: Focal Loss - Focusing on Hard Examples

### Theory Behind Focal Loss

Focal Loss was introduced by Facebook AI Research to address class imbalance in object detection. The key insight:
- **Focus on hard examples**: Well-classified examples contribute less to loss
- **Down-weight easy examples**: Prevents overwhelming by majority class

The formula is:
$$FL(p_t) = -\alpha(1-p_t)^\gamma \log(p_t)$$

Where:
- $p_t$ = model's predicted probability for the true class
- $\alpha$ = weighting factor for rare class (0.25-1.0)
- $\gamma$ = focusing parameter (1-5)

### Implementation

In [None]:
class FocalLoss(nn.Module):
    """
    Implementation of Focal Loss for addressing class imbalance.
    
    Key benefits:
    - Reduces loss contribution from easy examples
    - Focuses learning on hard examples
    - Naturally handles class imbalance
    """
    
    def __init__(self, alpha=0.75, gamma=2.0):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, inputs, targets):
        # Apply sigmoid to get probabilities
        probs = torch.sigmoid(inputs)
        
        # Calculate binary cross entropy
        bce_loss = nn.functional.binary_cross_entropy_with_logits(
            inputs, targets, reduction='none'
        )
        
        # Calculate p_t
        p_t = probs * targets + (1 - probs) * (1 - targets)
        
        # Calculate alpha_t
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        
        # Calculate focal loss
        focal_loss = alpha_t * (1 - p_t) ** self.gamma * bce_loss
        
        return focal_loss.mean()

# Let's visualize how focal loss works
def visualize_focal_loss():
    # Create probability range
    p = np.linspace(0.01, 0.99, 100)
    
    # Standard cross-entropy
    ce_loss = -np.log(p)
    
    # Focal loss with different gamma values
    focal_gamma_1 = -(1 - p) ** 1 * np.log(p)
    focal_gamma_2 = -(1 - p) ** 2 * np.log(p)
    focal_gamma_5 = -(1 - p) ** 5 * np.log(p)
    
    plt.figure(figsize=(10, 6))
    plt.plot(p, ce_loss, label='Cross-Entropy', linewidth=2)
    plt.plot(p, focal_gamma_1, label='Focal Loss (γ=1)', linewidth=2)
    plt.plot(p, focal_gamma_2, label='Focal Loss (γ=2)', linewidth=2)
    plt.plot(p, focal_gamma_5, label='Focal Loss (γ=5)', linewidth=2)
    
    plt.xlabel('Model Confidence (p)')
    plt.ylabel('Loss')
    plt.title('Focal Loss vs Cross-Entropy: Focus on Hard Examples')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    print("Key Insight: Focal Loss reduces loss for high-confidence predictions")
    print("This prevents easy examples from dominating the training process")

visualize_focal_loss()

## Part 3: Modern Neural Network Architecture

### Best Practices for Stable Training

Modern deep learning requires careful architecture design:
- **Batch Normalization**: Stabilizes training and allows higher learning rates
- **Dropout**: Prevents overfitting, especially important for fraud detection
- **Proper Weight Initialization**: Prevents vanishing/exploding gradients
- **ReLU Activation**: Helps with gradient flow

In [None]:
class FraudDetectionNN(nn.Module):
    """
    Modern neural network architecture for fraud detection.
    
    Features:
    - Batch normalization for stable training
    - Dropout for regularization
    - Progressive dimension reduction
    - Skip connections for better gradient flow
    """
    
    def __init__(self, input_size, hidden_dims=[128, 64, 32], dropout_rate=0.3):
        super(FraudDetectionNN, self).__init__()
        
        layers = []
        prev_dim = input_size
        
        # Build hidden layers
        for i, hidden_dim in enumerate(hidden_dims):
            # Linear layer
            layers.append(nn.Linear(prev_dim, hidden_dim))
            
            # Batch normalization
            layers.append(nn.BatchNorm1d(hidden_dim))
            
            # Activation
            layers.append(nn.ReLU())
            
            # Dropout (not on last layer)
            if i < len(hidden_dims) - 1:
                layers.append(nn.Dropout(dropout_rate))
            
            prev_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(prev_dim, 1))
        
        self.network = nn.Sequential(*layers)
        
        # Initialize weights
        self._initialize_weights()
    
    def _initialize_weights(self):
        """Initialize weights using Xavier initialization"""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                nn.init.constant_(module.bias, 0)
    
    def forward(self, x):
        return self.network(x).squeeze()

# Let's also create a weighted BCE loss for comparison
class WeightedBCELoss(nn.Module):
    """
    Weighted Binary Cross-Entropy Loss for handling class imbalance.
    Traditional approach using position weights.
    """
    
    def __init__(self, pos_weight):
        super(WeightedBCELoss, self).__init__()
        self.pos_weight = pos_weight
    
    def forward(self, inputs, targets):
        return nn.functional.binary_cross_entropy_with_logits(
            inputs, targets, pos_weight=self.pos_weight
        )

print("Modern neural network architecture implemented!")
print("Key features:")
print("- Batch normalization for stable training")
print("- Dropout for regularization")
print("- Xavier weight initialization")
print("- Progressive dimension reduction")

## Part 4: Autoencoder for Anomaly Detection

### Unsupervised Fraud Detection

Autoencoders offer a different approach to fraud detection:
- **Train on normal transactions only**: Learn patterns of legitimate behavior
- **Reconstruction error**: Fraud transactions should have high reconstruction error
- **No labeled fraud data needed**: Can work with unlabeled data

### Architecture Design

In [None]:
class FraudAutoencoder(nn.Module):
    """
    Autoencoder for fraud detection using anomaly detection.
    
    Key principle:
    - Train only on normal transactions
    - Fraud transactions should have high reconstruction error
    - Use reconstruction error as anomaly score
    """
    
    def __init__(self, input_size, encoding_dims=[64, 32, 16, 8]):
        super(FraudAutoencoder, self).__init__()
        
        # Encoder
        encoder_layers = []
        prev_dim = input_size
        
        for dim in encoding_dims:
            encoder_layers.extend([
                nn.Linear(prev_dim, dim),
                nn.BatchNorm1d(dim),
                nn.ReLU(),
                nn.Dropout(0.1)
            ])
            prev_dim = dim
        
        self.encoder = nn.Sequential(*encoder_layers)
        
        # Decoder (reverse of encoder)
        decoder_layers = []
        decoding_dims = list(reversed(encoding_dims[:-1])) + [input_size]
        
        for dim in decoding_dims:
            decoder_layers.extend([
                nn.Linear(prev_dim, dim),
                nn.BatchNorm1d(dim),
                nn.ReLU() if dim != input_size else nn.Identity(),
                nn.Dropout(0.1) if dim != input_size else nn.Identity()
            ])
            prev_dim = dim
        
        self.decoder = nn.Sequential(*decoder_layers)
        
        # Initialize weights
        self._initialize_weights()
    
    def _initialize_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                nn.init.constant_(module.bias, 0)
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded
    
    def get_reconstruction_error(self, x):
        """Calculate reconstruction error for anomaly detection"""
        with torch.no_grad():
            reconstructed = self.forward(x)
            error = torch.mean((x - reconstructed) ** 2, dim=1)
        return error

# Visualize autoencoder architecture
def visualize_autoencoder_architecture():
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # Define architecture
    layers = [30, 64, 32, 16, 8, 16, 32, 64, 30]
    layer_names = ['Input', 'Hidden 1', 'Hidden 2', 'Hidden 3', 'Bottleneck', 
                   'Hidden 4', 'Hidden 5', 'Hidden 6', 'Output']
    
    # Plot architecture
    x_pos = np.arange(len(layers))
    bars = ax.bar(x_pos, layers, color=['lightblue' if i < 4 else 'red' if i == 4 else 'lightgreen' for i in range(len(layers))])
    
    ax.set_xlabel('Layer')
    ax.set_ylabel('Number of Neurons')
    ax.set_title('Autoencoder Architecture for Fraud Detection')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(layer_names, rotation=45)
    
    # Add annotations
    ax.annotate('Encoder\n(Compression)', xy=(1.5, 40), xytext=(1.5, 60),
                arrowprops=dict(arrowstyle='->', color='blue'), ha='center')
    ax.annotate('Bottleneck\n(Compressed Rep.)', xy=(4, 20), xytext=(4, 40),
                arrowprops=dict(arrowstyle='->', color='red'), ha='center')
    ax.annotate('Decoder\n(Reconstruction)', xy=(6.5, 40), xytext=(6.5, 60),
                arrowprops=dict(arrowstyle='->', color='green'), ha='center')
    
    plt.tight_layout()
    plt.show()
    
    print("Autoencoder Principle:")
    print("1. Train only on normal transactions")
    print("2. Normal transactions should reconstruct well (low error)")
    print("3. Fraud transactions should reconstruct poorly (high error)")
    print("4. Use reconstruction error as anomaly score")

visualize_autoencoder_architecture()

## Part 5: Data Preparation and Training Pipeline

### Preparing Data for Deep Learning

In [None]:
# Prepare data for deep learning
def prepare_data(df, test_size=0.2, batch_size=512):
    """
    Prepare data for deep learning training.
    
    Steps:
    1. Feature scaling
    2. Train-test split
    3. Convert to PyTorch tensors
    4. Create data loaders
    """
    # Separate features and labels
    X = df.drop('Class', axis=1).values
    y = df['Class'].values
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, stratify=y, random_state=42
    )
    
    # Feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Convert to PyTorch tensors
    X_train_tensor = torch.FloatTensor(X_train_scaled).to(device)
    X_test_tensor = torch.FloatTensor(X_test_scaled).to(device)
    y_train_tensor = torch.FloatTensor(y_train).to(device)
    y_test_tensor = torch.FloatTensor(y_test).to(device)
    
    # Create data loaders
    train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
    test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    # Calculate class weights for weighted loss
    fraud_count = np.sum(y_train)
    normal_count = len(y_train) - fraud_count
    pos_weight = torch.FloatTensor([normal_count / fraud_count]).to(device)
    
    return {
        'train_loader': train_loader,
        'test_loader': test_loader,
        'X_train': X_train_tensor,
        'X_test': X_test_tensor,
        'y_train': y_train_tensor,
        'y_test': y_test_tensor,
        'pos_weight': pos_weight,
        'scaler': scaler,
        'input_size': X_train_scaled.shape[1]
    }

# Prepare the data
data = prepare_data(df)

print(f"Data preparation complete:")
print(f"- Input features: {data['input_size']}")
print(f"- Training samples: {len(data['train_loader'].dataset):,}")
print(f"- Test samples: {len(data['test_loader'].dataset):,}")
print(f"- Positive class weight: {data['pos_weight'].item():.1f}")
print(f"- Device: {device}")

## Part 6: Training with Focal Loss

### Advanced Training Techniques

Modern deep learning training requires:
- **Learning Rate Scheduling**: Adaptive learning rates
- **Early Stopping**: Prevent overfitting
- **Gradient Clipping**: Prevent exploding gradients
- **Mixed Precision**: Faster training on modern GPUs

In [None]:
def train_model_with_focal_loss(data, epochs=50, learning_rate=0.001):
    """
    Train fraud detection model using Focal Loss.
    
    Features:
    - Focal Loss for handling class imbalance
    - Learning rate scheduling
    - Early stopping
    - Comprehensive logging
    """
    # Initialize model
    model = FraudDetectionNN(data['input_size']).to(device)
    
    # Initialize loss function and optimizer
    criterion = FocalLoss(alpha=0.75, gamma=2.0)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=15, gamma=0.5)
    
    # Training history
    history = {'train_loss': [], 'val_loss': [], 'learning_rate': []}
    best_val_loss = float('inf')
    patience = 10
    patience_counter = 0
    
    print("Training with Focal Loss...")
    print("="*50)
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss = 0.0
        
        for batch_X, batch_y in data['train_loader']:
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            
            # Backward pass
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            optimizer.step()
            train_loss += loss.item()
        
        # Validation phase
        model.eval()
        val_loss = 0.0
        
        with torch.no_grad():
            for batch_X, batch_y in data['test_loader']:
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)
                val_loss += loss.item()
        
        # Calculate average losses
        avg_train_loss = train_loss / len(data['train_loader'])
        avg_val_loss = val_loss / len(data['test_loader'])
        
        # Update learning rate
        scheduler.step()
        current_lr = scheduler.get_last_lr()[0]
        
        # Save history
        history['train_loss'].append(avg_train_loss)
        history['val_loss'].append(avg_val_loss)
        history['learning_rate'].append(current_lr)
        
        # Early stopping
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            patience_counter = 0
            # Save best model
            torch.save(model.state_dict(), 'best_focal_model.pth')
        else:
            patience_counter += 1
        
        # Print progress
        if epoch % 5 == 0 or epoch == epochs - 1:
            print(f"Epoch {epoch+1:2d}: Train Loss={avg_train_loss:.4f}, "
                  f"Val Loss={avg_val_loss:.4f}, LR={current_lr:.2e}")
        
        # Early stopping check
        if patience_counter >= patience:
            print(f"\nEarly stopping at epoch {epoch+1}")
            break
    
    # Load best model
    model.load_state_dict(torch.load('best_focal_model.pth'))
    
    return model, history

# Train the model
focal_model, focal_history = train_model_with_focal_loss(data)

# Visualize training history
def plot_training_history(history, title="Training History"):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Loss curves
    epochs = range(1, len(history['train_loss']) + 1)
    ax1.plot(epochs, history['train_loss'], 'b-', label='Training Loss')
    ax1.plot(epochs, history['val_loss'], 'r-', label='Validation Loss')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title(f'{title} - Loss')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Learning rate schedule
    ax2.plot(epochs, history['learning_rate'], 'g-', linewidth=2)
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Learning Rate')
    ax2.set_title(f'{title} - Learning Rate Schedule')
    ax2.set_yscale('log')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_training_history(focal_history, "Focal Loss Training")

## Part 7: Training with Weighted Binary Cross-Entropy

In [None]:
def train_model_with_weighted_bce(data, epochs=50, learning_rate=0.001):
    """
    Train fraud detection model using Weighted Binary Cross-Entropy.
    This is the traditional approach for handling class imbalance.
    """
    # Initialize model
    model = FraudDetectionNN(data['input_size']).to(device)
    
    # Initialize loss function and optimizer
    criterion = WeightedBCELoss(data['pos_weight'])
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=15, gamma=0.5)
    
    # Training history
    history = {'train_loss': [], 'val_loss': [], 'learning_rate': []}
    best_val_loss = float('inf')
    patience = 10
    patience_counter = 0
    
    print("Training with Weighted BCE...")
    print("="*50)
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss = 0.0
        
        for batch_X, batch_y in data['train_loader']:
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            
            # Backward pass
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            optimizer.step()
            train_loss += loss.item()
        
        # Validation phase
        model.eval()
        val_loss = 0.0
        
        with torch.no_grad():
            for batch_X, batch_y in data['test_loader']:
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)
                val_loss += loss.item()
        
        # Calculate average losses
        avg_train_loss = train_loss / len(data['train_loader'])
        avg_val_loss = val_loss / len(data['test_loader'])
        
        # Update learning rate
        scheduler.step()
        current_lr = scheduler.get_last_lr()[0]
        
        # Save history
        history['train_loss'].append(avg_train_loss)
        history['val_loss'].append(avg_val_loss)
        history['learning_rate'].append(current_lr)
        
        # Early stopping
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            patience_counter = 0
            # Save best model
            torch.save(model.state_dict(), 'best_weighted_model.pth')
        else:
            patience_counter += 1
        
        # Print progress
        if epoch % 5 == 0 or epoch == epochs - 1:
            print(f"Epoch {epoch+1:2d}: Train Loss={avg_train_loss:.4f}, "
                  f"Val Loss={avg_val_loss:.4f}, LR={current_lr:.2e}")
        
        # Early stopping check
        if patience_counter >= patience:
            print(f"\nEarly stopping at epoch {epoch+1}")
            break
    
    # Load best model
    model.load_state_dict(torch.load('best_weighted_model.pth'))
    
    return model, history

# Train the model
weighted_model, weighted_history = train_model_with_weighted_bce(data)

plot_training_history(weighted_history, "Weighted BCE Training")

## Part 8: Training the Autoencoder for Anomaly Detection

In [None]:
def train_autoencoder(data, epochs=50, learning_rate=0.001):
    """
    Train autoencoder for anomaly detection.
    
    Key principle: Train only on normal transactions!
    """
    # Initialize model
    model = FraudAutoencoder(data['input_size']).to(device)
    
    # Create dataset with only normal transactions
    normal_mask = data['y_train'] == 0
    X_normal = data['X_train'][normal_mask]
    
    # Create data loader for normal transactions only
    normal_dataset = TensorDataset(X_normal, X_normal)  # Input = Output for autoencoder
    normal_loader = DataLoader(normal_dataset, batch_size=512, shuffle=True)
    
    # Initialize loss function and optimizer
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=15, gamma=0.5)
    
    # Training history
    history = {'train_loss': [], 'learning_rate': []}
    
    print(f"Training Autoencoder on {len(X_normal):,} normal transactions...")
    print("="*50)
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss = 0.0
        
        for batch_X, _ in normal_loader:
            optimizer.zero_grad()
            
            # Forward pass
            reconstructed = model(batch_X)
            loss = criterion(reconstructed, batch_X)
            
            # Backward pass
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            optimizer.step()
            train_loss += loss.item()
        
        # Calculate average loss
        avg_train_loss = train_loss / len(normal_loader)
        
        # Update learning rate
        scheduler.step()
        current_lr = scheduler.get_last_lr()[0]
        
        # Save history
        history['train_loss'].append(avg_train_loss)
        history['learning_rate'].append(current_lr)
        
        # Print progress
        if epoch % 5 == 0 or epoch == epochs - 1:
            print(f"Epoch {epoch+1:2d}: Reconstruction Loss={avg_train_loss:.4f}, "
                  f"LR={current_lr:.2e}")
    
    return model, history

# Train the autoencoder
autoencoder, autoencoder_history = train_autoencoder(data)

# Visualize autoencoder training
def plot_autoencoder_history(history):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    epochs = range(1, len(history['train_loss']) + 1)
    
    # Reconstruction loss
    ax1.plot(epochs, history['train_loss'], 'b-', linewidth=2)
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Reconstruction Loss (MSE)')
    ax1.set_title('Autoencoder Training - Reconstruction Loss')
    ax1.grid(True, alpha=0.3)
    
    # Learning rate schedule
    ax2.plot(epochs, history['learning_rate'], 'g-', linewidth=2)
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Learning Rate')
    ax2.set_title('Autoencoder Training - Learning Rate Schedule')
    ax2.set_yscale('log')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_autoencoder_history(autoencoder_history)

## Part 9: Model Evaluation and Comparison

### Comprehensive Evaluation Metrics

For fraud detection, we need to evaluate multiple aspects:
- **Classification Performance**: Precision, Recall, F1-Score, AUC
- **Business Impact**: Cost analysis, false positive rates
- **Calibration**: How well do probabilities match actual frequencies?
- **Threshold Analysis**: Finding optimal decision boundaries

In [None]:
def evaluate_classification_model(model, data, model_name):
    """
    Comprehensive evaluation of classification model.
    """
    model.eval()
    
    with torch.no_grad():
        # Get predictions
        outputs = model(data['X_test'])
        probabilities = torch.sigmoid(outputs).cpu().numpy()
        predictions = (probabilities > 0.5).astype(int)
        
        # True labels
        y_true = data['y_test'].cpu().numpy()
        
        # Calculate metrics
        from sklearn.metrics import (
            accuracy_score, precision_score, recall_score, f1_score,
            roc_auc_score, confusion_matrix, classification_report
        )
        
        metrics = {
            'accuracy': accuracy_score(y_true, predictions),
            'precision': precision_score(y_true, predictions),
            'recall': recall_score(y_true, predictions),
            'f1_score': f1_score(y_true, predictions),
            'roc_auc': roc_auc_score(y_true, probabilities)
        }
        
        # Confusion matrix
        cm = confusion_matrix(y_true, predictions)
        tn, fp, fn, tp = cm.ravel()
        
        metrics.update({
            'true_negatives': tn,
            'false_positives': fp,
            'false_negatives': fn,
            'true_positives': tp
        })
        
        return metrics, probabilities, predictions

def evaluate_autoencoder(model, data, model_name):
    """
    Evaluate autoencoder using reconstruction error for anomaly detection.
    """
    model.eval()
    
    with torch.no_grad():
        # Get reconstruction errors
        reconstruction_errors = model.get_reconstruction_error(data['X_test']).cpu().numpy()
        y_true = data['y_test'].cpu().numpy()
        
        # Find optimal threshold using ROC curve
        fpr, tpr, thresholds = roc_curve(y_true, reconstruction_errors)
        
        # Optimal threshold (Youden's J statistic)
        optimal_idx = np.argmax(tpr - fpr)
        optimal_threshold = thresholds[optimal_idx]
        
        # Make predictions using optimal threshold
        predictions = (reconstruction_errors > optimal_threshold).astype(int)
        
        # Calculate metrics
        from sklearn.metrics import (
            accuracy_score, precision_score, recall_score, f1_score,
            roc_auc_score, confusion_matrix
        )
        
        metrics = {
            'accuracy': accuracy_score(y_true, predictions),
            'precision': precision_score(y_true, predictions),
            'recall': recall_score(y_true, predictions),
            'f1_score': f1_score(y_true, predictions),
            'roc_auc': roc_auc_score(y_true, reconstruction_errors),
            'optimal_threshold': optimal_threshold
        }
        
        # Confusion matrix
        cm = confusion_matrix(y_true, predictions)
        tn, fp, fn, tp = cm.ravel()
        
        metrics.update({
            'true_negatives': tn,
            'false_positives': fp,
            'false_negatives': fn,
            'true_positives': tp
        })
        
        return metrics, reconstruction_errors, predictions

# Evaluate all models
print("Evaluating all models...")
print("="*60)

# Classification models
focal_metrics, focal_probs, focal_preds = evaluate_classification_model(focal_model, data, "Focal Loss")
weighted_metrics, weighted_probs, weighted_preds = evaluate_classification_model(weighted_model, data, "Weighted BCE")

# Autoencoder
autoencoder_metrics, ae_errors, ae_preds = evaluate_autoencoder(autoencoder, data, "Autoencoder")

# Create comparison table
comparison_data = {
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC', 'True Positives', 'False Positives', 'False Negatives'],
    'Focal Loss': [
        f"{focal_metrics['accuracy']:.4f}",
        f"{focal_metrics['precision']:.4f}",
        f"{focal_metrics['recall']:.4f}",
        f"{focal_metrics['f1_score']:.4f}",
        f"{focal_metrics['roc_auc']:.4f}",
        f"{focal_metrics['true_positives']}",
        f"{focal_metrics['false_positives']}",
        f"{focal_metrics['false_negatives']}"
    ],
    'Weighted BCE': [
        f"{weighted_metrics['accuracy']:.4f}",
        f"{weighted_metrics['precision']:.4f}",
        f"{weighted_metrics['recall']:.4f}",
        f"{weighted_metrics['f1_score']:.4f}",
        f"{weighted_metrics['roc_auc']:.4f}",
        f"{weighted_metrics['true_positives']}",
        f"{weighted_metrics['false_positives']}",
        f"{weighted_metrics['false_negatives']}"
    ],
    'Autoencoder': [
        f"{autoencoder_metrics['accuracy']:.4f}",
        f"{autoencoder_metrics['precision']:.4f}",
        f"{autoencoder_metrics['recall']:.4f}",
        f"{autoencoder_metrics['f1_score']:.4f}",
        f"{autoencoder_metrics['roc_auc']:.4f}",
        f"{autoencoder_metrics['true_positives']}",
        f"{autoencoder_metrics['false_positives']}",
        f"{autoencoder_metrics['false_negatives']}"
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("\nModel Comparison:")
print(comparison_df.to_string(index=False))

# Determine best model
f1_scores = {
    'Focal Loss': focal_metrics['f1_score'],
    'Weighted BCE': weighted_metrics['f1_score'],
    'Autoencoder': autoencoder_metrics['f1_score']
}

best_model_name = max(f1_scores, key=f1_scores.get)
print(f"\nBest Model: {best_model_name} (F1-Score: {f1_scores[best_model_name]:.4f})")

## Part 10: Advanced Visualizations and Analysis

In [None]:
# Create comprehensive visualization
def create_comprehensive_evaluation_plots(data, focal_probs, weighted_probs, ae_errors):
    """
    Create comprehensive evaluation plots for all models.
    """
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    y_true = data['y_test'].cpu().numpy()
    
    # ROC Curves
    fpr_focal, tpr_focal, _ = roc_curve(y_true, focal_probs)
    fpr_weighted, tpr_weighted, _ = roc_curve(y_true, weighted_probs)
    fpr_ae, tpr_ae, _ = roc_curve(y_true, ae_errors)
    
    axes[0, 0].plot(fpr_focal, tpr_focal, label=f'Focal Loss (AUC={auc(fpr_focal, tpr_focal):.3f})', linewidth=2)
    axes[0, 0].plot(fpr_weighted, tpr_weighted, label=f'Weighted BCE (AUC={auc(fpr_weighted, tpr_weighted):.3f})', linewidth=2)
    axes[0, 0].plot(fpr_ae, tpr_ae, label=f'Autoencoder (AUC={auc(fpr_ae, tpr_ae):.3f})', linewidth=2)
    axes[0, 0].plot([0, 1], [0, 1], 'k--', alpha=0.5)
    axes[0, 0].set_xlabel('False Positive Rate')
    axes[0, 0].set_ylabel('True Positive Rate')
    axes[0, 0].set_title('ROC Curves')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Precision-Recall Curves
    precision_focal, recall_focal, _ = precision_recall_curve(y_true, focal_probs)
    precision_weighted, recall_weighted, _ = precision_recall_curve(y_true, weighted_probs)
    precision_ae, recall_ae, _ = precision_recall_curve(y_true, ae_errors)
    
    axes[0, 1].plot(recall_focal, precision_focal, label=f'Focal Loss (AP={average_precision_score(y_true, focal_probs):.3f})', linewidth=2)
    axes[0, 1].plot(recall_weighted, precision_weighted, label=f'Weighted BCE (AP={average_precision_score(y_true, weighted_probs):.3f})', linewidth=2)
    axes[0, 1].plot(recall_ae, precision_ae, label=f'Autoencoder (AP={average_precision_score(y_true, ae_errors):.3f})', linewidth=2)
    axes[0, 1].set_xlabel('Recall')
    axes[0, 1].set_ylabel('Precision')
    axes[0, 1].set_title('Precision-Recall Curves')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Score distributions
    normal_mask = y_true == 0
    fraud_mask = y_true == 1
    
    axes[0, 2].hist(focal_probs[normal_mask], bins=50, alpha=0.7, label='Normal', density=True)
    axes[0, 2].hist(focal_probs[fraud_mask], bins=50, alpha=0.7, label='Fraud', density=True)
    axes[0, 2].set_xlabel('Predicted Probability')
    axes[0, 2].set_ylabel('Density')
    axes[0, 2].set_title('Focal Loss - Score Distribution')
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)
    
    # Confusion matrices
    cm_focal = confusion_matrix(y_true, (focal_probs > 0.5).astype(int))
    cm_weighted = confusion_matrix(y_true, (weighted_probs > 0.5).astype(int))
    cm_ae = confusion_matrix(y_true, (ae_errors > autoencoder_metrics['optimal_threshold']).astype(int))
    
    # Plot confusion matrices
    im1 = axes[1, 0].imshow(cm_focal, interpolation='nearest', cmap=plt.cm.Blues)
    axes[1, 0].set_title('Focal Loss - Confusion Matrix')
    axes[1, 0].set_ylabel('True Label')
    axes[1, 0].set_xlabel('Predicted Label')
    
    # Add text annotations
    for i in range(2):
        for j in range(2):
            axes[1, 0].text(j, i, format(cm_focal[i, j], 'd'),
                          ha="center", va="center", color="white" if cm_focal[i, j] > cm_focal.max() / 2 else "black")
    
    im2 = axes[1, 1].imshow(cm_weighted, interpolation='nearest', cmap=plt.cm.Blues)
    axes[1, 1].set_title('Weighted BCE - Confusion Matrix')
    axes[1, 1].set_ylabel('True Label')
    axes[1, 1].set_xlabel('Predicted Label')
    
    for i in range(2):
        for j in range(2):
            axes[1, 1].text(j, i, format(cm_weighted[i, j], 'd'),
                          ha="center", va="center", color="white" if cm_weighted[i, j] > cm_weighted.max() / 2 else "black")
    
    im3 = axes[1, 2].imshow(cm_ae, interpolation='nearest', cmap=plt.cm.Blues)
    axes[1, 2].set_title('Autoencoder - Confusion Matrix')
    axes[1, 2].set_ylabel('True Label')
    axes[1, 2].set_xlabel('Predicted Label')
    
    for i in range(2):
        for j in range(2):
            axes[1, 2].text(j, i, format(cm_ae[i, j], 'd'),
                          ha="center", va="center", color="white" if cm_ae[i, j] > cm_ae.max() / 2 else "black")
    
    plt.tight_layout()
    plt.show()

# Create the comprehensive evaluation plots
create_comprehensive_evaluation_plots(data, focal_probs, weighted_probs, ae_errors)

# Analyze autoencoder reconstruction errors
def analyze_autoencoder_performance(data, autoencoder, ae_errors):
    """
    Analyze autoencoder performance in detail.
    """
    y_true = data['y_test'].cpu().numpy()
    normal_mask = y_true == 0
    fraud_mask = y_true == 1
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Reconstruction error distribution
    axes[0].hist(ae_errors[normal_mask], bins=50, alpha=0.7, label='Normal', density=True, color='blue')
    axes[0].hist(ae_errors[fraud_mask], bins=50, alpha=0.7, label='Fraud', density=True, color='red')
    axes[0].axvline(autoencoder_metrics['optimal_threshold'], color='green', linestyle='--', 
                   label=f'Optimal Threshold: {autoencoder_metrics["optimal_threshold"]:.4f}')
    axes[0].set_xlabel('Reconstruction Error')
    axes[0].set_ylabel('Density')
    axes[0].set_title('Reconstruction Error Distribution')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Box plot of reconstruction errors
    box_data = [ae_errors[normal_mask], ae_errors[fraud_mask]]
    axes[1].boxplot(box_data, labels=['Normal', 'Fraud'])
    axes[1].set_ylabel('Reconstruction Error')
    axes[1].set_title('Reconstruction Error by Class')
    axes[1].grid(True, alpha=0.3)
    
    # Threshold analysis
    thresholds = np.linspace(ae_errors.min(), ae_errors.max(), 100)
    f1_scores = []
    
    for threshold in thresholds:
        predictions = (ae_errors > threshold).astype(int)
        f1 = f1_score(y_true, predictions)
        f1_scores.append(f1)
    
    axes[2].plot(thresholds, f1_scores, 'b-', linewidth=2)
    axes[2].axvline(autoencoder_metrics['optimal_threshold'], color='red', linestyle='--', 
                   label=f'Optimal: {autoencoder_metrics["optimal_threshold"]:.4f}')
    axes[2].set_xlabel('Threshold')
    axes[2].set_ylabel('F1-Score')
    axes[2].set_title('F1-Score vs Threshold')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Statistics
    print("\nAutoencoder Analysis:")
    print(f"Normal transactions - Mean error: {ae_errors[normal_mask].mean():.4f}, Std: {ae_errors[normal_mask].std():.4f}")
    print(f"Fraud transactions - Mean error: {ae_errors[fraud_mask].mean():.4f}, Std: {ae_errors[fraud_mask].std():.4f}")
    print(f"Separation ratio: {ae_errors[fraud_mask].mean() / ae_errors[normal_mask].mean():.2f}x")

analyze_autoencoder_performance(data, autoencoder, ae_errors)

## Practice Exercises

Now it's your turn! Try these exercises to deepen your understanding:

### Exercise 1: Hyperparameter Tuning
Experiment with different Focal Loss parameters (α, γ) and see how they affect performance. Try:
- α = [0.25, 0.5, 0.75, 1.0]
- γ = [0.5, 1.0, 2.0, 5.0]

Which combination works best for your dataset?

In [None]:
# Your code here
# Hint: Create a parameter grid and train models with different α and γ values
# Compare their F1-scores to find the best combination

### Exercise 2: Architecture Experimentation
Modify the neural network architecture:
- Try different numbers of layers
- Experiment with different layer sizes
- Add residual connections
- Try different activation functions

How does architecture affect performance?

In [None]:
# Your code here
# Hint: Create a new class inheriting from nn.Module
# Experiment with different architectures and compare results

### Exercise 3: Ensemble Methods
Create an ensemble of your three models:
- Use voting (majority rule)
- Use weighted averaging based on model confidence
- Use stacking with a meta-learner

Does the ensemble perform better than individual models?

In [None]:
# Your code here
# Hint: Combine predictions from all three models using different strategies
# Compare ensemble performance with individual models

## Key Takeaways

### 1. Class Imbalance Solutions
- **Focal Loss**: Focuses on hard examples, reduces impact of easy negatives
- **Weighted BCE**: Traditional approach using class weights
- **Autoencoder**: Unsupervised approach learning normal patterns

### 2. Modern Architecture Design
- **Batch Normalization**: Stabilizes training, allows higher learning rates
- **Dropout**: Prevents overfitting, crucial for small datasets
- **Proper Initialization**: Xavier/He initialization prevents gradient problems
- **Progressive Reduction**: Gradual dimension reduction works better

### 3. Training Best Practices
- **Learning Rate Scheduling**: Adaptive learning rates improve convergence
- **Early Stopping**: Prevents overfitting, saves computational resources
- **Gradient Clipping**: Prevents exploding gradients
- **Mixed Precision**: Faster training on modern GPUs

### 4. Evaluation Strategies
- **Multiple Metrics**: Don't rely on accuracy alone
- **Threshold Analysis**: Find optimal decision boundaries
- **Business Context**: Consider false positive/negative costs
- **Visualization**: ROC curves, PR curves, score distributions

### 5. Model Selection
- **Focal Loss**: Best for extreme imbalance (like fraud detection)
- **Weighted BCE**: Good baseline, easy to implement
- **Autoencoder**: Useful when labeled fraud data is limited
- **Ensemble**: Often provides best overall performance

### 6. Production Considerations
- **GPU Optimization**: Use appropriate batch sizes and memory management
- **Model Serving**: Consider inference speed and resource requirements
- **Monitoring**: Track model performance and drift over time
- **Retraining**: Update models as fraud patterns evolve

## Next Steps

In the next tutorial, we'll explore:
- Graph Neural Networks for fraud detection
- Capturing relationships between transactions
- Advanced feature engineering with graph structures
- Heterogeneous graph neural networks

Remember: The best model is the one that best serves your specific business needs and constraints!