# ML Practice Questions - Part 9: Deep Learning Architectures

This notebook covers advanced deep learning architectures and specialized neural network designs.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_digits, make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, mean_squared_error
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import warnings
warnings.filterwarnings('ignore')

plt.style.use('default')
sns.set_palette("husl")

## Question 1: Convolutional Neural Networks (CNNs)

**Question**: Implement a CNN from scratch and explain the mathematical foundations of convolution, pooling, and feature mapping. Compare different architectures like LeNet and analyze the role of receptive fields.

### Theory: Convolutional Neural Networks

CNNs are specialized neural networks for processing grid-like data such as images. The key operations are:

**1. Convolution Operation:**
$$S(i,j) = (I * K)(i,j) = \sum_m \sum_n I(m,n)K(i-m, j-n)$$

Where:
- $S(i,j)$ = output feature map at position $(i,j)$
- $I$ = input image
- $K$ = convolution kernel/filter

**2. Pooling Operation:**
- Max pooling: $P(i,j) = \max_{(m,n) \in \mathcal{R}(i,j)} I(m,n)$
- Average pooling: $P(i,j) = \frac{1}{|\mathcal{R}(i,j)|} \sum_{(m,n) \in \mathcal{R}(i,j)} I(m,n)$

**3. Receptive Field:**
$$RF_l = RF_{l-1} + (K_l - 1) \prod_{i=1}^{l-1} S_i$$

Where $RF_l$ is receptive field at layer $l$, $K_l$ is kernel size, $S_i$ is stride.

In [None]:
class Conv2DCustom:
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        
        # Initialize weights using Xavier initialization
        fan_in = in_channels * kernel_size * kernel_size
        fan_out = out_channels * kernel_size * kernel_size
        std = np.sqrt(2.0 / (fan_in + fan_out))
        
        self.weights = np.random.normal(0, std, 
                                      (out_channels, in_channels, kernel_size, kernel_size))
        self.bias = np.zeros(out_channels)
        
        # For backpropagation
        self.last_input = None
        
    def add_padding(self, x):
        if self.padding == 0:
            return x
        return np.pad(x, ((0, 0), (0, 0), (self.padding, self.padding), 
                         (self.padding, self.padding)), mode='constant')
    
    def forward(self, x):
        """Forward pass through convolution layer"""
        self.last_input = x.copy()
        batch_size, in_channels, in_height, in_width = x.shape
        
        # Add padding
        x_padded = self.add_padding(x)
        
        # Calculate output dimensions
        out_height = (in_height + 2 * self.padding - self.kernel_size) // self.stride + 1
        out_width = (in_width + 2 * self.padding - self.kernel_size) // self.stride + 1
        
        # Initialize output
        output = np.zeros((batch_size, self.out_channels, out_height, out_width))
        
        # Convolution operation
        for b in range(batch_size):
            for c_out in range(self.out_channels):
                for h in range(out_height):
                    for w in range(out_width):
                        h_start = h * self.stride
                        h_end = h_start + self.kernel_size
                        w_start = w * self.stride
                        w_end = w_start + self.kernel_size
                        
                        # Extract receptive field
                        receptive_field = x_padded[b, :, h_start:h_end, w_start:w_end]
                        
                        # Compute convolution
                        output[b, c_out, h, w] = np.sum(receptive_field * self.weights[c_out]) + self.bias[c_out]
        
        return output

class MaxPool2DCustom:
    def __init__(self, kernel_size, stride=None):
        self.kernel_size = kernel_size
        self.stride = stride if stride is not None else kernel_size
        self.last_input = None
        self.mask = None
    
    def forward(self, x):
        """Forward pass through max pooling layer"""
        self.last_input = x.copy()
        batch_size, channels, in_height, in_width = x.shape
        
        # Calculate output dimensions
        out_height = (in_height - self.kernel_size) // self.stride + 1
        out_width = (in_width - self.kernel_size) // self.stride + 1
        
        # Initialize output and mask for backpropagation
        output = np.zeros((batch_size, channels, out_height, out_width))
        self.mask = np.zeros_like(x)
        
        # Max pooling operation
        for b in range(batch_size):
            for c in range(channels):
                for h in range(out_height):
                    for w in range(out_width):
                        h_start = h * self.stride
                        h_end = h_start + self.kernel_size
                        w_start = w * self.stride
                        w_end = w_start + self.kernel_size
                        
                        # Extract pooling region
                        pool_region = x[b, c, h_start:h_end, w_start:w_end]
                        
                        # Find max value and its position
                        max_val = np.max(pool_region)
                        output[b, c, h, w] = max_val
                        
                        # Create mask for backpropagation
                        mask_region = (pool_region == max_val)
                        self.mask[b, c, h_start:h_end, w_start:w_end] = mask_region
        
        return output

class CNNCustom:
    def __init__(self, input_shape, num_classes):
        self.input_shape = input_shape  # (channels, height, width)
        self.num_classes = num_classes
        
        # LeNet-like architecture
        self.conv1 = Conv2DCustom(input_shape[0], 6, kernel_size=5, padding=0)
        self.pool1 = MaxPool2DCustom(kernel_size=2)
        self.conv2 = Conv2DCustom(6, 16, kernel_size=5, padding=0)
        self.pool2 = MaxPool2DCustom(kernel_size=2)
        
        # Calculate flattened size
        self.flattened_size = self._calculate_flattened_size()
        
        # Fully connected layers
        self.fc1_weights = np.random.randn(self.flattened_size, 120) * 0.1
        self.fc1_bias = np.zeros(120)
        self.fc2_weights = np.random.randn(120, 84) * 0.1
        self.fc2_bias = np.zeros(84)
        self.fc3_weights = np.random.randn(84, num_classes) * 0.1
        self.fc3_bias = np.zeros(num_classes)
    
    def _calculate_flattened_size(self):
        """Calculate the size after conv and pooling layers"""
        # Simulate forward pass to get dimensions
        dummy_input = np.zeros((1, *self.input_shape))
        x = self.conv1.forward(dummy_input)
        x = self.pool1.forward(x)
        x = self.conv2.forward(x)
        x = self.pool2.forward(x)
        return np.prod(x.shape[1:])
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def softmax(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
    
    def forward(self, x):
        """Forward pass through the CNN"""
        # Convolutional layers
        x = self.conv1.forward(x)
        x = self.relu(x)
        x = self.pool1.forward(x)
        
        x = self.conv2.forward(x)
        x = self.relu(x)
        x = self.pool2.forward(x)
        
        # Flatten
        batch_size = x.shape[0]
        x = x.reshape(batch_size, -1)
        
        # Fully connected layers
        x = np.dot(x, self.fc1_weights) + self.fc1_bias
        x = self.relu(x)
        
        x = np.dot(x, self.fc2_weights) + self.fc2_bias
        x = self.relu(x)
        
        x = np.dot(x, self.fc3_weights) + self.fc3_bias
        
        return self.softmax(x)
    
    def calculate_receptive_field(self):
        """Calculate receptive field at each layer"""
        receptive_fields = []
        
        # Initial receptive field
        rf = 1
        stride_product = 1
        
        # Conv1: kernel=5, stride=1
        rf = rf + (5 - 1) * stride_product
        receptive_fields.append(('Conv1', rf))
        
        # Pool1: kernel=2, stride=2
        rf = rf + (2 - 1) * stride_product
        stride_product *= 2
        receptive_fields.append(('Pool1', rf))
        
        # Conv2: kernel=5, stride=1
        rf = rf + (5 - 1) * stride_product
        receptive_fields.append(('Conv2', rf))
        
        # Pool2: kernel=2, stride=2
        rf = rf + (2 - 1) * stride_product
        stride_product *= 2
        receptive_fields.append(('Pool2', rf))
        
        return receptive_fields

# Test with simple image data
print("Testing CNN Implementation...")

# Create synthetic image data
np.random.seed(42)
batch_size = 10
input_shape = (1, 28, 28)  # Single channel, 28x28 images
num_classes = 10

# Generate random data
X = np.random.randn(batch_size, *input_shape)
y = np.random.randint(0, num_classes, batch_size)

# Initialize CNN
cnn = CNNCustom(input_shape, num_classes)

# Forward pass
output = cnn.forward(X)
print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Output probabilities sum: {np.sum(output, axis=1)[:5]}")

# Calculate receptive fields
receptive_fields = cnn.calculate_receptive_field()
print("\nReceptive Field Analysis:")
for layer, rf in receptive_fields:
    print(f"{layer}: {rf}x{rf}")

In [None]:
# Compare with PyTorch implementation
class LeNetPyTorch(nn.Module):
    def __init__(self, num_classes=10):
        super(LeNetPyTorch, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)
        self.pool1 = nn.MaxPool2d(kernel_size=2)
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
        self.pool2 = nn.MaxPool2d(kernel_size=2)
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, num_classes)
    
    def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))
        x = self.pool2(F.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return F.softmax(x, dim=1)

# Test PyTorch version
lenet_pytorch = LeNetPyTorch(num_classes=10)
X_torch = torch.randn(batch_size, *input_shape)
with torch.no_grad():
    output_pytorch = lenet_pytorch(X_torch)

print(f"PyTorch output shape: {output_pytorch.shape}")
print(f"PyTorch probabilities sum: {torch.sum(output_pytorch, dim=1)[:5]}")

# Visualize feature maps
def visualize_feature_maps(model, input_data, layer_name):
    """Visualize feature maps from a specific layer"""
    if layer_name == 'conv1':
        features = model.conv1.forward(input_data)
        features = np.maximum(0, features)  # ReLU activation
    elif layer_name == 'conv2':
        x = model.conv1.forward(input_data)
        x = np.maximum(0, x)
        x = model.pool1.forward(x)
        features = model.conv2.forward(x)
        features = np.maximum(0, features)
    
    # Plot first 6 feature maps
    fig, axes = plt.subplots(2, 3, figsize=(12, 8))
    fig.suptitle(f'Feature Maps from {layer_name}')
    
    for i in range(6):
        row = i // 3
        col = i % 3
        if i < features.shape[1]:
            axes[row, col].imshow(features[0, i], cmap='viridis')
            axes[row, col].set_title(f'Filter {i+1}')
            axes[row, col].axis('off')
        else:
            axes[row, col].axis('off')
    
    plt.tight_layout()
    plt.show()

# Visualize feature maps
sample_input = X[:1]  # Take first sample
visualize_feature_maps(cnn, sample_input, 'conv1')
visualize_feature_maps(cnn, sample_input, 'conv2')

## Question 2: Recurrent Neural Networks (RNNs) and LSTM

**Question**: Implement RNNs and LSTM from scratch, explaining the vanishing gradient problem and how LSTM gates solve it. Demonstrate sequence prediction and analyze gradient flow.

### Theory: RNNs and LSTM

**1. Vanilla RNN:**
$$h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$$
$$y_t = W_{hy}h_t + b_y$$

**2. LSTM Gates:**
- Forget gate: $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
- Input gate: $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
- Candidate values: $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$
- Cell state: $C_t = f_t * C_{t-1} + i_t * \tilde{C}_t$
- Output gate: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
- Hidden state: $h_t = o_t * \tanh(C_t)$

**3. Vanishing Gradient Problem:**
$$\frac{\partial L}{\partial W} = \sum_{t=1}^T \frac{\partial L_t}{\partial W} = \sum_{t=1}^T \frac{\partial L_t}{\partial h_t} \prod_{k=1}^t \frac{\partial h_k}{\partial h_{k-1}} \frac{\partial h_1}{\partial W}$$

The product $\prod_{k=1}^t \frac{\partial h_k}{\partial h_{k-1}}$ can vanish exponentially as $t$ increases.

In [None]:
class VanillaRNNCustom:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Initialize weights
        self.Wxh = np.random.randn(input_size, hidden_size) * 0.1
        self.Whh = np.random.randn(hidden_size, hidden_size) * 0.1
        self.Why = np.random.randn(hidden_size, output_size) * 0.1
        
        self.bh = np.zeros((1, hidden_size))
        self.by = np.zeros((1, output_size))
        
        # For gradient analysis
        self.hidden_states = []
        self.gradients = []
    
    def tanh(self, x):
        return np.tanh(x)
    
    def forward(self, inputs, h_prev=None):
        """Forward pass through RNN"""
        seq_len, batch_size = inputs.shape[0], inputs.shape[1]
        
        if h_prev is None:
            h_prev = np.zeros((batch_size, self.hidden_size))
        
        self.hidden_states = [h_prev]
        outputs = []
        
        h = h_prev
        for t in range(seq_len):
            # RNN step
            h = self.tanh(np.dot(inputs[t], self.Wxh) + np.dot(h, self.Whh) + self.bh)
            y = np.dot(h, self.Why) + self.by
            
            self.hidden_states.append(h)
            outputs.append(y)
        
        return np.array(outputs), h
    
    def analyze_gradients(self, sequence_length=20):
        """Analyze gradient magnitudes to demonstrate vanishing gradient problem"""
        # Simulate gradient flow backward through time
        gradients = []
        
        # Start with unit gradient at final timestep
        grad = 1.0
        
        for t in range(sequence_length):
            # Approximate gradient flow: grad *= |dh/dh_prev|
            # For tanh activation: derivative is (1 - tanh²(x))
            # Multiply by weight matrix norm
            tanh_derivative = 1.0 - 0.5  # Approximate average derivative
            weight_norm = np.linalg.norm(self.Whh)
            
            grad *= tanh_derivative * weight_norm
            gradients.append(grad)
        
        return gradients

class LSTMCustom:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Initialize weights for gates
        concat_size = input_size + hidden_size
        
        # Forget gate
        self.Wf = np.random.randn(concat_size, hidden_size) * 0.1
        self.bf = np.zeros((1, hidden_size))
        
        # Input gate
        self.Wi = np.random.randn(concat_size, hidden_size) * 0.1
        self.bi = np.zeros((1, hidden_size))
        
        # Candidate gate
        self.Wc = np.random.randn(concat_size, hidden_size) * 0.1
        self.bc = np.zeros((1, hidden_size))
        
        # Output gate
        self.Wo = np.random.randn(concat_size, hidden_size) * 0.1
        self.bo = np.zeros((1, hidden_size))
        
        # Output projection
        self.Why = np.random.randn(hidden_size, output_size) * 0.1
        self.by = np.zeros((1, output_size))
        
        # For analysis
        self.gate_activations = {'forget': [], 'input': [], 'output': []}
        self.cell_states = []
        self.hidden_states = []
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def tanh(self, x):
        return np.tanh(np.clip(x, -500, 500))
    
    def forward(self, inputs, h_prev=None, c_prev=None):
        """Forward pass through LSTM"""
        seq_len, batch_size = inputs.shape[0], inputs.shape[1]
        
        if h_prev is None:
            h_prev = np.zeros((batch_size, self.hidden_size))
        if c_prev is None:
            c_prev = np.zeros((batch_size, self.hidden_size))
        
        # Reset tracking
        self.gate_activations = {'forget': [], 'input': [], 'output': []}
        self.cell_states = [c_prev]
        self.hidden_states = [h_prev]
        
        outputs = []
        h, c = h_prev, c_prev
        
        for t in range(seq_len):
            # Concatenate input and previous hidden state
            concat = np.concatenate([inputs[t], h], axis=1)
            
            # Forget gate
            f = self.sigmoid(np.dot(concat, self.Wf) + self.bf)
            
            # Input gate
            i = self.sigmoid(np.dot(concat, self.Wi) + self.bi)
            
            # Candidate values
            c_tilde = self.tanh(np.dot(concat, self.Wc) + self.bc)
            
            # Update cell state
            c = f * c + i * c_tilde
            
            # Output gate
            o = self.sigmoid(np.dot(concat, self.Wo) + self.bo)
            
            # Update hidden state
            h = o * self.tanh(c)
            
            # Output projection
            y = np.dot(h, self.Why) + self.by
            
            # Store for analysis
            self.gate_activations['forget'].append(f)
            self.gate_activations['input'].append(i)
            self.gate_activations['output'].append(o)
            self.cell_states.append(c)
            self.hidden_states.append(h)
            
            outputs.append(y)
        
        return np.array(outputs), h, c
    
    def analyze_gradient_flow(self, sequence_length=20):
        """Analyze how LSTM maintains gradient flow"""
        gradients = []
        
        # LSTM gradient flow is better preserved due to cell state
        grad = 1.0
        
        for t in range(sequence_length):
            # Simulate gradient flow through forget gate
            # Gradient flows through cell state with forget gate multiplication
            forget_gate_avg = 0.7  # Typical forget gate activation
            grad *= forget_gate_avg  # Much slower decay than vanilla RNN
            gradients.append(grad)
        
        return gradients

# Test RNN implementations
print("Testing RNN and LSTM Implementations...")

# Create sequence data
np.random.seed(42)
seq_len = 10
batch_size = 5
input_size = 3
hidden_size = 4
output_size = 2

# Generate random sequence
X_seq = np.random.randn(seq_len, batch_size, input_size)

# Test Vanilla RNN
rnn = VanillaRNNCustom(input_size, hidden_size, output_size)
outputs_rnn, final_h_rnn = rnn.forward(X_seq)

print(f"RNN - Input shape: {X_seq.shape}")
print(f"RNN - Output shape: {outputs_rnn.shape}")
print(f"RNN - Final hidden state shape: {final_h_rnn.shape}")

# Test LSTM
lstm = LSTMCustom(input_size, hidden_size, output_size)
outputs_lstm, final_h_lstm, final_c_lstm = lstm.forward(X_seq)

print(f"\nLSTM - Output shape: {outputs_lstm.shape}")
print(f"LSTM - Final hidden state shape: {final_h_lstm.shape}")
print(f"LSTM - Final cell state shape: {final_c_lstm.shape}")

# Analyze gradient flow
rnn_gradients = rnn.analyze_gradients(sequence_length=20)
lstm_gradients = lstm.analyze_gradient_flow(sequence_length=20)

print(f"\nGradient Analysis:")
print(f"RNN gradient after 20 steps: {rnn_gradients[-1]:.2e}")
print(f"LSTM gradient after 20 steps: {lstm_gradients[-1]:.2e}")

In [None]:
# Visualize gradient flow comparison
plt.figure(figsize=(12, 8))

# Plot gradient magnitudes
plt.subplot(2, 2, 1)
timesteps = range(1, len(rnn_gradients) + 1)
plt.semilogy(timesteps, rnn_gradients, 'r-', label='Vanilla RNN', linewidth=2)
plt.semilogy(timesteps, lstm_gradients, 'b-', label='LSTM', linewidth=2)
plt.xlabel('Timesteps Back')
plt.ylabel('Gradient Magnitude (log scale)')
plt.title('Gradient Flow Comparison')
plt.legend()
plt.grid(True)

# Plot LSTM gate activations
if lstm.gate_activations['forget']:
    forget_gates = np.array(lstm.gate_activations['forget'])
    input_gates = np.array(lstm.gate_activations['input'])
    output_gates = np.array(lstm.gate_activations['output'])
    
    plt.subplot(2, 2, 2)
    timesteps = range(forget_gates.shape[0])
    plt.plot(timesteps, np.mean(forget_gates, axis=(1, 2)), 'g-', label='Forget Gate', linewidth=2)
    plt.plot(timesteps, np.mean(input_gates, axis=(1, 2)), 'r-', label='Input Gate', linewidth=2)
    plt.plot(timesteps, np.mean(output_gates, axis=(1, 2)), 'b-', label='Output Gate', linewidth=2)
    plt.xlabel('Timestep')
    plt.ylabel('Average Gate Activation')
    plt.title('LSTM Gate Activations')
    plt.legend()
    plt.grid(True)

# Compare with PyTorch LSTM
lstm_pytorch = nn.LSTM(input_size, hidden_size, batch_first=False)
X_torch = torch.randn(seq_len, batch_size, input_size)

with torch.no_grad():
    outputs_pytorch, (h_final, c_final) = lstm_pytorch(X_torch)

plt.subplot(2, 2, 3)
plt.plot(outputs_lstm[:, 0, 0], 'b-', label='Custom LSTM', linewidth=2)
plt.plot(outputs_pytorch[:, 0, 0].numpy(), 'r--', label='PyTorch LSTM', linewidth=2)
plt.xlabel('Timestep')
plt.ylabel('Output Value')
plt.title('Output Comparison (First Feature)')
plt.legend()
plt.grid(True)

# Hidden state evolution
if lstm.hidden_states:
    hidden_norms = [np.linalg.norm(h, axis=1).mean() for h in lstm.hidden_states[1:]]
    plt.subplot(2, 2, 4)
    plt.plot(range(len(hidden_norms)), hidden_norms, 'purple', linewidth=2)
    plt.xlabel('Timestep')
    plt.ylabel('Average Hidden State Norm')
    plt.title('Hidden State Evolution')
    plt.grid(True)

plt.tight_layout()
plt.show()

print(f"\nComparison Summary:")
print(f"Custom LSTM output range: [{outputs_lstm.min():.3f}, {outputs_lstm.max():.3f}]")
print(f"PyTorch LSTM output range: [{outputs_pytorch.min():.3f}, {outputs_pytorch.max():.3f}]")

## Question 3: Attention Mechanisms and Transformers

**Question**: Implement self-attention and multi-head attention from scratch. Explain the mathematical foundations of the Transformer architecture and demonstrate its advantages over RNNs for sequence modeling.

### Theory: Attention and Transformers

**1. Scaled Dot-Product Attention:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$ = queries matrix $(n \times d_k)$
- $K$ = keys matrix $(m \times d_k)$
- $V$ = values matrix $(m \times d_v)$
- $d_k$ = dimension of keys/queries

**2. Multi-Head Attention:**
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

Where:
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

**3. Positional Encoding:**
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

**4. Self-Attention Benefits:**
- Parallelizable computation
- Direct connections between all positions
- No vanishing gradient problem across sequence length

In [None]:
class AttentionCustom:
    def __init__(self, d_model, d_k=None, d_v=None):
        self.d_model = d_model
        self.d_k = d_k if d_k is not None else d_model
        self.d_v = d_v if d_v is not None else d_model
        self.scale = np.sqrt(self.d_k)
        
        # Store attention weights for visualization
        self.attention_weights = None
    
    def softmax(self, x, axis=-1):
        """Numerically stable softmax"""
        x_max = np.max(x, axis=axis, keepdims=True)
        exp_x = np.exp(x - x_max)
        return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
    
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        """Scaled dot-product attention"""
        # Compute attention scores
        scores = np.matmul(Q, K.transpose(0, 1, 3, 2)) / self.scale
        
        # Apply mask if provided
        if mask is not None:
            scores = np.where(mask == 0, -1e9, scores)
        
        # Apply softmax
        attention_weights = self.softmax(scores, axis=-1)
        self.attention_weights = attention_weights
        
        # Apply attention to values
        output = np.matmul(attention_weights, V)
        
        return output, attention_weights

class MultiHeadAttentionCustom:
    def __init__(self, d_model, num_heads):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.d_v = d_model // num_heads
        
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Weight matrices for each head
        self.W_q = np.random.randn(num_heads, d_model, self.d_k) * 0.1
        self.W_k = np.random.randn(num_heads, d_model, self.d_k) * 0.1
        self.W_v = np.random.randn(num_heads, d_model, self.d_v) * 0.1
        
        # Output projection
        self.W_o = np.random.randn(d_model, d_model) * 0.1
        
        # Individual attention mechanisms
        self.attention = AttentionCustom(d_model, self.d_k, self.d_v)
        
        # Store attention weights for visualization
        self.all_attention_weights = []
    
    def forward(self, Q, K, V, mask=None):
        """Multi-head attention forward pass"""
        batch_size, seq_len = Q.shape[0], Q.shape[1]
        
        # Store attention weights for each head
        self.all_attention_weights = []
        head_outputs = []
        
        # Process each attention head
        for i in range(self.num_heads):
            # Linear projections for this head
            Q_head = np.matmul(Q, self.W_q[i])  # (batch_size, seq_len, d_k)
            K_head = np.matmul(K, self.W_k[i])  # (batch_size, seq_len, d_k)
            V_head = np.matmul(V, self.W_v[i])  # (batch_size, seq_len, d_v)
            
            # Add batch dimension for attention computation
            Q_head = Q_head[:, np.newaxis, :, :]  # (batch_size, 1, seq_len, d_k)
            K_head = K_head[:, np.newaxis, :, :]  # (batch_size, 1, seq_len, d_k)
            V_head = V_head[:, np.newaxis, :, :]  # (batch_size, 1, seq_len, d_v)
            
            # Compute attention for this head
            head_output, attention_weights = self.attention.scaled_dot_product_attention(
                Q_head, K_head, V_head, mask
            )
            
            # Remove the extra dimension
            head_output = head_output[:, 0, :, :]  # (batch_size, seq_len, d_v)
            
            head_outputs.append(head_output)
            self.all_attention_weights.append(attention_weights[:, 0, :, :])  # (batch_size, seq_len, seq_len)
        
        # Concatenate all heads
        multi_head_output = np.concatenate(head_outputs, axis=-1)  # (batch_size, seq_len, d_model)
        
        # Final linear projection
        output = np.matmul(multi_head_output, self.W_o)
        
        return output

class PositionalEncodingCustom:
    def __init__(self, d_model, max_seq_len=5000):
        self.d_model = d_model
        self.max_seq_len = max_seq_len
        
        # Create positional encoding matrix
        pe = np.zeros((max_seq_len, d_model))
        position = np.arange(0, max_seq_len)[:, np.newaxis]
        
        # Compute the positional encodings
        div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
        
        pe[:, 0::2] = np.sin(position * div_term)
        pe[:, 1::2] = np.cos(position * div_term)
        
        self.pe = pe
    
    def forward(self, x):
        """Add positional encoding to input"""
        seq_len = x.shape[1]
        return x + self.pe[:seq_len, :]

class TransformerBlockCustom:
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_ff = d_ff
        
        # Multi-head attention
        self.self_attention = MultiHeadAttentionCustom(d_model, num_heads)
        
        # Feed-forward network
        self.ff_W1 = np.random.randn(d_model, d_ff) * 0.1
        self.ff_b1 = np.zeros(d_ff)
        self.ff_W2 = np.random.randn(d_ff, d_model) * 0.1
        self.ff_b2 = np.zeros(d_model)
        
        # Layer normalization parameters (simplified)
        self.ln1_weight = np.ones(d_model)
        self.ln1_bias = np.zeros(d_model)
        self.ln2_weight = np.ones(d_model)
        self.ln2_bias = np.zeros(d_model)
    
    def layer_norm(self, x, weight, bias, eps=1e-6):
        """Layer normalization"""
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        normalized = (x - mean) / np.sqrt(var + eps)
        return normalized * weight + bias
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def feed_forward(self, x):
        """Position-wise feed-forward network"""
        # First linear layer + ReLU
        hidden = self.relu(np.matmul(x, self.ff_W1) + self.ff_b1)
        # Second linear layer
        output = np.matmul(hidden, self.ff_W2) + self.ff_b2
        return output
    
    def forward(self, x, mask=None):
        """Forward pass through transformer block"""
        # Self-attention with residual connection and layer norm
        attention_output = self.self_attention.forward(x, x, x, mask)
        x = self.layer_norm(x + attention_output, self.ln1_weight, self.ln1_bias)
        
        # Feed-forward with residual connection and layer norm
        ff_output = self.feed_forward(x)
        x = self.layer_norm(x + ff_output, self.ln2_weight, self.ln2_bias)
        
        return x

# Test implementations
print("Testing Attention and Transformer Implementations...")

# Create test data
np.random.seed(42)
batch_size = 2
seq_len = 8
d_model = 64
num_heads = 8
d_ff = 256

# Generate random input
X = np.random.randn(batch_size, seq_len, d_model)

# Test Multi-Head Attention
mha = MultiHeadAttentionCustom(d_model, num_heads)
attention_output = mha.forward(X, X, X)

print(f"Input shape: {X.shape}")
print(f"Multi-head attention output shape: {attention_output.shape}")
print(f"Number of attention heads: {len(mha.all_attention_weights)}")
print(f"Attention weights shape per head: {mha.all_attention_weights[0].shape}")

# Test Positional Encoding
pe = PositionalEncodingCustom(d_model)
X_with_pe = pe.forward(X)

print(f"\nInput with positional encoding shape: {X_with_pe.shape}")
print(f"Positional encoding added successfully")

# Test Transformer Block
transformer_block = TransformerBlockCustom(d_model, num_heads, d_ff)
transformer_output = transformer_block.forward(X_with_pe)

print(f"\nTransformer block output shape: {transformer_output.shape}")
print(f"Transformer processing completed")

In [None]:
# Visualize attention patterns and positional encodings
plt.figure(figsize=(15, 10))

# Visualize positional encodings
plt.subplot(2, 3, 1)
pe_matrix = pe.pe[:50, :50]  # First 50 positions and dimensions
plt.imshow(pe_matrix, aspect='auto', cmap='RdBu')
plt.title('Positional Encoding Pattern')
plt.xlabel('Dimension')
plt.ylabel('Position')
plt.colorbar()

# Visualize attention weights for different heads
for head_idx in range(min(4, num_heads)):
    plt.subplot(2, 3, head_idx + 2)
    attention_matrix = mha.all_attention_weights[head_idx][0]  # First batch
    plt.imshow(attention_matrix, cmap='Blues')
    plt.title(f'Attention Head {head_idx + 1}')
    plt.xlabel('Key Position')
    plt.ylabel('Query Position')
    plt.colorbar()

# Compare computational complexity
plt.subplot(2, 3, 6)
sequence_lengths = np.arange(10, 200, 10)
rnn_complexity = sequence_lengths * d_model  # O(n * d)
attention_complexity = sequence_lengths ** 2 * d_model  # O(n² * d)

plt.loglog(sequence_lengths, rnn_complexity, 'r-', label='RNN O(n·d)', linewidth=2)
plt.loglog(sequence_lengths, attention_complexity, 'b-', label='Attention O(n²·d)', linewidth=2)
plt.xlabel('Sequence Length')
plt.ylabel('Computational Cost')
plt.title('Complexity Comparison')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# Compare with PyTorch implementation
class TransformerPyTorch(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super().__init__()
        self.attention = nn.MultiheadAttention(d_model, num_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
    
    def forward(self, x):
        # Self-attention
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_output)
        
        # Feed-forward
        ff_output = self.ff(x)
        x = self.norm2(x + ff_output)
        
        return x

# Test PyTorch version
transformer_pytorch = TransformerPyTorch(d_model, num_heads, d_ff)
X_torch = torch.randn(batch_size, seq_len, d_model)

with torch.no_grad():
    output_pytorch = transformer_pytorch(X_torch)

print(f"\nComparison with PyTorch:")
print(f"Custom transformer output range: [{transformer_output.min():.3f}, {transformer_output.max():.3f}]")
print(f"PyTorch transformer output range: [{output_pytorch.min():.3f}, {output_pytorch.max():.3f}]")

# Analyze attention patterns
print(f"\nAttention Analysis:")
for i, attention_weights in enumerate(mha.all_attention_weights[:3]):
    # Calculate attention entropy (measure of how spread out the attention is)
    entropy = -np.sum(attention_weights * np.log(attention_weights + 1e-9), axis=-1)
    print(f"Head {i+1} - Average attention entropy: {entropy.mean():.3f}")
    print(f"Head {i+1} - Max attention weight: {attention_weights.max():.3f}")

print(f"\nAdvantages of Transformer over RNN:")
print(f"1. Parallelization: All positions processed simultaneously")
print(f"2. Direct connections: No information bottleneck through hidden states")
print(f"3. No vanishing gradients: Direct gradient flow to all positions")
print(f"4. Flexible attention: Can focus on relevant positions regardless of distance")

## Summary: Deep Learning Architectures

This notebook covered three fundamental deep learning architectures:

### 1. Convolutional Neural Networks (CNNs)
- **Key Concepts**: Convolution, pooling, feature maps, receptive fields
- **Mathematical Foundation**: Convolution operation preserves spatial relationships
- **Applications**: Image processing, computer vision, spatial pattern recognition
- **Architecture**: LeNet-style with conv → pool → conv → pool → FC layers

### 2. Recurrent Neural Networks and LSTM
- **Key Concepts**: Sequential processing, hidden states, vanishing gradients
- **Mathematical Foundation**: RNN processes sequences step-by-step, LSTM uses gates to control information flow
- **Problem Solved**: LSTM gates mitigate vanishing gradient problem in long sequences
- **Applications**: Natural language processing, time series, sequential data

### 3. Attention Mechanisms and Transformers
- **Key Concepts**: Self-attention, multi-head attention, positional encoding
- **Mathematical Foundation**: Scaled dot-product attention enables direct connections between all positions
- **Advantages**: Parallelizable, no vanishing gradients, flexible attention patterns
- **Applications**: Machine translation, language modeling, sequence-to-sequence tasks

### Architecture Evolution
1. **RNNs**: Sequential processing, information bottleneck
2. **LSTMs**: Better long-term dependencies, still sequential
3. **Transformers**: Parallel processing, direct global connections

Each architecture has specific strengths and is suited for different types of data and tasks.