# PyTorch Modules & Parameters: Neural Network Building Blocks

## 🎯 Introduction

Welcome to the architectural world of PyTorch! This notebook will transform you from someone who writes individual tensor operations into someone who designs elegant, reusable neural network components. Understanding modules and parameters is the key to building scalable deep learning systems.

### 🧠 What You'll Master

This comprehensive guide covers:
- **Module hierarchy**: How PyTorch organizes neural network components
- **Parameter management**: Tracking, initializing, and optimizing learnable weights
- **Module composition**: Building complex architectures from simple components
- **Memory efficiency**: Understanding parameter sharing and storage
- **Advanced patterns**: Custom modules, hooks, and state management

### 🎓 Prerequisites

- Solid understanding of tensors and basic operations
- Familiarity with autograd and gradient computation
- Basic knowledge of neural network concepts (layers, weights, biases)

### 🚀 Why Module Design Matters

Proper module design enables:
- **Modularity**: Reusable components that compose elegantly
- **Automatic differentiation**: Parameters tracked and optimized automatically
- **GPU acceleration**: Seamless device transfers and distributed training
- **Save/load functionality**: Model persistence and deployment
- **Debugging**: Clear component boundaries and parameter inspection

---

## 📚 Table of Contents

1. **[Module Fundamentals](#module-fundamentals)** - Understanding the nn.Module base class
2. **[Parameter Management](#parameter-management)** - How PyTorch tracks and optimizes parameters
3. **[Module Composition Patterns](#module-composition-patterns)** - Building complex architectures
4. **[Parameter Initialization Strategies](#parameter-initialization-strategies)** - Setting up weights for successful training
5. **[Advanced Module Patterns](#advanced-module-patterns)** - Custom components and optimization tricks

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Module Fundamentals

### 🏗️ The nn.Module Foundation

Every neural network in PyTorch inherits from `nn.Module`. This base class provides the infrastructure for parameter tracking, device management, training state, and much more. Let's explore what makes modules so powerful!

In [None]:
# =============================================================================
# UNDERSTANDING nn.MODULE ARCHITECTURE
# =============================================================================

print("🏗️ The Power of nn.Module")
print("=" * 50)

# Create a simple linear layer to explore module functionality
linear_layer = nn.Linear(4, 2)  # Input: 4 features → Output: 2 features

print(f"Module type: {type(linear_layer)}")
print(f"Module representation: {linear_layer}")

# nn.Module automatically tracks parameters
print(f"\nParameters found by PyTorch:")
for name, param in linear_layer.named_parameters():
    print(f"  {name}: shape {param.shape}, requires_grad={param.requires_grad}")

print(f"\nTotal parameters: {sum(p.numel() for p in linear_layer.parameters()):,}")

print(f"\n🔍 Module State Information")
print("=" * 50)

# Modules track training vs evaluation state
print(f"Training mode: {linear_layer.training}")
print(f"Device: {next(linear_layer.parameters()).device}")

# Submodules (empty for simple linear layer)
print(f"Named submodules: {list(linear_layer.named_children())}")

# Built-in methods for parameter management
print(f"\nBuilt-in parameter methods:")
print(f"  .parameters() - iterator over all parameters")
print(f"  .named_parameters() - parameter names and tensors")
print(f"  .state_dict() - complete state for saving/loading")
print(f"  .train()/.eval() - switch between training/inference modes")
print(f"  .to(device) - move all parameters to device")
print(f"  .zero_grad() - clear gradients of all parameters")

print(f"\n🎯 Why This Design is Brilliant")
print("=" * 50)
print("• Parameters automatically registered and tracked")
print("• Device management handled transparently")
print("• Training/evaluation state managed automatically")  
print("• Save/load functionality built-in")
print("• Gradient computation integrated seamlessly")
print("• Module composition works recursively")

## Parameter Management

### 🎛️ How PyTorch Tracks and Optimizes Parameters

Parameters are the heart of neural networks - they're the learnable weights that get updated during training. PyTorch's parameter management system is designed to make this process seamless and efficient.

In [None]:
# =============================================================================
# DEEP DIVE INTO PARAMETER MECHANICS
# =============================================================================

print("🎛️ Parameter Deep Dive")
print("=" * 50)

# Create a module to examine parameter behavior
class SimpleModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        # These become registered parameters automatically
        self.fc1 = nn.Linear(input_size, hidden_size)    # weight + bias parameters
        self.fc2 = nn.Linear(hidden_size, output_size)   # weight + bias parameters
        
        # Manual parameter registration (less common but useful)
        self.manual_param = nn.Parameter(torch.randn(hidden_size, 1))
        
        # Non-parameter tensors (not optimized)
        self.register_buffer('running_mean', torch.zeros(hidden_size))
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Instantiate model to explore parameter structure
model = SimpleModel(input_size=5, hidden_size=10, output_size=3)

print(f"Model architecture:")
print(model)

print(f"\n📊 Parameter Analysis")
print("=" * 50)

total_params = 0
trainable_params = 0

print("Layer-by-layer parameter breakdown:")
for name, param in model.named_parameters():
    param_count = param.numel()
    total_params += param_count
    if param.requires_grad:
        trainable_params += param_count
    
    print(f"  {name:15} | Shape: {str(param.shape):15} | Count: {param_count:6,} | Trainable: {param.requires_grad}")

print(f"\nParameter Summary:")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Non-trainable parameters: {total_params - trainable_params:,}")

# Memory usage calculation
param_memory_mb = sum(p.numel() * 4 for p in model.parameters()) / (1024**2)  # 4 bytes per float32
print(f"  Memory usage: {param_memory_mb:.2f} MB")

print(f"\n🔍 Buffers vs Parameters")
print("=" * 50)

print("Registered buffers (not optimized):")
for name, buffer in model.named_buffers():
    print(f"  {name}: shape {buffer.shape}, device {buffer.device}")

print("\nKey differences:")
print("• Parameters: Learnable, included in optimizer, require gradients")
print("• Buffers: Fixed values, moved with model, not optimized")
print("• Use buffers for: running statistics, lookup tables, constants")

print(f"\n⚙️ Parameter State Management")
print("=" * 50)

# Demonstrate state dictionary functionality
state_dict = model.state_dict()
print(f"State dict keys: {list(state_dict.keys())}")

# Show how parameters can be frozen/unfrozen
print(f"\nFreezing/unfreezing parameters:")
print("Before freezing fc1:")
for name, param in model.named_parameters():
    if 'fc1' in name:
        print(f"  {name}: requires_grad = {param.requires_grad}")

# Freeze fc1 parameters (common in transfer learning)
for param in model.fc1.parameters():
    param.requires_grad = False

print("\nAfter freezing fc1:")
for name, param in model.named_parameters():
    if 'fc1' in name:
        print(f"  {name}: requires_grad = {param.requires_grad}")

# Count trainable parameters after freezing
trainable_after_freeze = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTrainable parameters after freezing fc1: {trainable_after_freeze:,}")
print(f"Reduction: {total_params - trainable_after_freeze:,} parameters frozen")

In [None]:
## Module Composition Patterns

### 🧩 Building Complex Architectures

The real power of PyTorch modules comes from composition - building complex networks from simple, reusable components. This is how modern architectures like transformers are constructed!

# =============================================================================
# ADVANCED MODULE COMPOSITION PATTERNS
# =============================================================================

print("🧩 Module Composition Mastery")
print("=" * 50)

# Pattern 1: Sequential composition (most common)
class MLPBlock(nn.Module):
    """A reusable MLP block with configurable activation and dropout."""
    
    def __init__(self, input_dim, output_dim, dropout_rate=0.1, activation='relu'):
        super().__init__()
        self.linear = nn.Linear(input_dim, output_dim)
        self.dropout = nn.Dropout(dropout_rate)
        
        # Flexible activation selection
        if activation == 'relu':
            self.activation = nn.ReLU()
        elif activation == 'gelu':
            self.activation = nn.GELU()
        elif activation == 'tanh':
            self.activation = nn.Tanh()
        else:
            raise ValueError(f"Unknown activation: {activation}")
    
    def forward(self, x):
        # Standard pattern: linear → activation → dropout
        x = self.linear(x)
        x = self.activation(x)
        x = self.dropout(x)
        return x

# Pattern 2: Container-based composition
class FlexibleMLP(nn.Module):
    """MLP with variable depth using nn.ModuleList."""
    
    def __init__(self, input_dim, hidden_dims, output_dim, dropout_rate=0.1):
        super().__init__()
        
        # Input projection
        self.input_layer = MLPBlock(input_dim, hidden_dims[0], dropout_rate)
        
        # Hidden layers using ModuleList for dynamic composition
        self.hidden_layers = nn.ModuleList([
            MLPBlock(hidden_dims[i], hidden_dims[i+1], dropout_rate)
            for i in range(len(hidden_dims) - 1)
        ])
        
        # Output projection (no activation/dropout typically)
        self.output_layer = nn.Linear(hidden_dims[-1], output_dim)
        
        print(f"Created FlexibleMLP with {len(hidden_dims)} hidden layers")
        print(f"Architecture: {input_dim} → {' → '.join(map(str, hidden_dims))} → {output_dim}")
    
    def forward(self, x):
        # Process through all layers sequentially
        x = self.input_layer(x)
        
        for layer in self.hidden_layers:
            x = layer(x)
        
        x = self.output_layer(x)
        return x

# Demonstrate flexible architecture creation
print("Example 1: Simple 2-layer MLP")
simple_mlp = FlexibleMLP(
    input_dim=10,
    hidden_dims=[64, 32],  # Two hidden layers
    output_dim=5,
    dropout_rate=0.2
)

print(f"\nExample 2: Deep 5-layer MLP")
deep_mlp = FlexibleMLP(
    input_dim=50,
    hidden_dims=[256, 128, 64, 32, 16],  # Five hidden layers  
    output_dim=10,
    dropout_rate=0.1
)

print(f"\n📊 Parameter Comparison")
print("=" * 50)

simple_params = sum(p.numel() for p in simple_mlp.parameters())
deep_params = sum(p.numel() for p in deep_mlp.parameters())

print(f"Simple MLP parameters: {simple_params:,}")
print(f"Deep MLP parameters: {deep_params:,}")
print(f"Parameter ratio (deep/simple): {deep_params / simple_params:.1f}x")

# Pattern 3: Residual connections (transformer-style)
class ResidualBlock(nn.Module):
    """Block with residual connection - foundation of modern architectures."""
    
    def __init__(self, dim, expansion_factor=4, dropout_rate=0.1):
        super().__init__()
        hidden_dim = dim * expansion_factor
        
        # Two-layer feedforward with expansion and contraction
        self.ffn = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),  # Modern activation choice
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout_rate)
        )
        
        # Layer normalization (applied before FFN in modern architectures)
        self.norm = nn.LayerNorm(dim)
    
    def forward(self, x):
        # Pre-norm residual connection pattern
        # This is the pattern used in GPT, T5, and other modern models
        residual = x
        x = self.norm(x)        # Normalize first
        x = self.ffn(x)         # Apply transformation
        x = x + residual        # Add residual connection
        return x

# Pattern 4: Transformer-style stacking
class MiniTransformerBlock(nn.Module):
    """Simplified transformer block showing composition patterns."""
    
    def __init__(self, d_model, num_layers=3, expansion_factor=4, dropout_rate=0.1):
        super().__init__()
        
        # Stack multiple residual blocks
        self.layers = nn.ModuleList([
            ResidualBlock(d_model, expansion_factor, dropout_rate)
            for _ in range(num_layers)
        ])
        
        # Final normalization
        self.final_norm = nn.LayerNorm(d_model)
    
    def forward(self, x):
        # Process through each layer
        for layer in self.layers:
            x = layer(x)
        
        # Final normalization
        x = self.final_norm(x)
        return x

# Test the transformer-style architecture
print(f"\n🏗️ Advanced Architecture Example")
print("=" * 50)

mini_transformer = MiniTransformerBlock(d_model=128, num_layers=4)

print(f"Mini-transformer parameters: {sum(p.numel() for p in mini_transformer.parameters()):,}")

# Test with realistic input
batch_size, seq_len, d_model = 2, 10, 128
test_input = torch.randn(batch_size, seq_len, d_model)
output = mini_transformer(test_input)

print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"✓ Shape preserved through all layers (essential for residual connections)")

print(f"\n💡 Composition Principles")
print("=" * 50)
print("1. **Modularity**: Each component has a single, clear responsibility")
print("2. **Reusability**: Blocks can be used in different contexts")
print("3. **Composability**: Complex architectures built from simple parts")
print("4. **Parameter sharing**: Same block type, different instances")
print("5. **Shape consistency**: Outputs match expected inputs for next layer")
print("6. **Gradient flow**: Residual connections enable deep networks")

In [None]:
# Different initialization strategies
def xavier_init(m):
    """Xavier (Glorot) initialization - good for tanh/sigmoid activations"""
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        if m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Embedding):
        nn.init.xavier_uniform_(m.weight)

def kaiming_init(m):
    """Kaiming (He) initialization - good for ReLU activations"""
    if isinstance(m, nn.Linear):
        nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')
        if m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Embedding):
        nn.init.normal_(m.weight, std=0.1)

def normal_init(m):
    """Simple normal initialization"""
    if isinstance(m, nn.Linear):
        nn.init.normal_(m.weight, mean=0, std=0.01)
        if m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Embedding):
        nn.init.normal_(m.weight, std=0.1)

# Compare initialization effects
def compare_initializations():
    """Compare different initialization strategies"""
    
    # Create three identical models
    model_xavier = SimpleNet(100, 50, 10)
    model_kaiming = SimpleNet(100, 50, 10)
    model_normal = SimpleNet(100, 50, 10)
    
    # Apply different initializations
    model_xavier.apply(xavier_init)
    model_kaiming.apply(kaiming_init)
    model_normal.apply(normal_init)
    
    models = {
        'Xavier (Glorot)': model_xavier,
        'Kaiming (He)': model_kaiming,
        'Normal (0.01)': model_normal
    }
    
    # Test with random input
    x = torch.randn(32, 100)  # Batch of 32 samples
    
    print("Initialization comparison:")
    print("=" * 60)
    
    for name, model in models.items():
        model.eval()  # Set to eval mode for consistent comparison
        
        with torch.no_grad():
            output = model(x)
            
        # Analyze weight statistics
        fc1_weight = model.fc1.weight
        fc2_weight = model.fc2.weight
        
        print(f"\n{name}:")
        print(f"  FC1 weight stats: mean={fc1_weight.mean().item():.6f}, std={fc1_weight.std().item():.6f}")
        print(f"  FC2 weight stats: mean={fc2_weight.mean().item():.6f}, std={fc2_weight.std().item():.6f}")
        print(f"  Output stats: mean={output.mean().item():.6f}, std={output.std().item():.6f}")
        print(f"  Output range: [{output.min().item():.6f}, {output.max().item():.6f}]")

compare_initializations()

# Demonstrate the impact of bad initialization
print("\n" + "=" * 60)
print("Effect of bad initialization:")

def bad_init(m):
    """Intentionally bad initialization"""
    if isinstance(m, nn.Linear):
        nn.init.constant_(m.weight, 10.0)  # Too large!
        if m.bias is not None:
            nn.init.constant_(m.bias, 0.0)

model_bad = SimpleNet(10, 20, 5)
model_bad.apply(bad_init)

x = torch.randn(5, 10)
with torch.no_grad():
    output_bad = model_bad(x)
    
print(f"Bad initialization output stats:")
print(f"  Mean: {output_bad.mean().item():.2f}")
print(f"  Std: {output_bad.std().item():.2f}")
print(f"  Range: [{output_bad.min().item():.2f}, {output_bad.max().item():.2f}]")
print(f"  Contains NaN: {torch.isnan(output_bad).any().item()}")
print("\nNote: Large values can lead to vanishing/exploding gradients!")

In [None]:
# Visualize weight distributions after initialization
def visualize_weight_distributions():
    # Create models with different initializations
    model_xavier = nn.Linear(100, 100)
    model_kaiming = nn.Linear(100, 100)
    model_normal = nn.Linear(100, 100)
    
    nn.init.xavier_uniform_(model_xavier.weight)
    nn.init.kaiming_uniform_(model_kaiming.weight, nonlinearity='relu')
    nn.init.normal_(model_normal.weight, std=0.01)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    weights = {
        'Xavier': model_xavier.weight.detach().numpy().flatten(),
        'Kaiming': model_kaiming.weight.detach().numpy().flatten(),
        'Normal (0.01)': model_normal.weight.detach().numpy().flatten()
    }
    
    for i, (name, weight) in enumerate(weights.items()):
        axes[i].hist(weight, bins=50, alpha=0.7, density=True)
        axes[i].set_title(f'{name} Initialization')
        axes[i].set_xlabel('Weight Value')
        axes[i].set_ylabel('Density')
        axes[i].grid(True, alpha=0.3)
        
        # Add statistics
        mean, std = weight.mean(), weight.std()
        axes[i].axvline(mean, color='red', linestyle='--', label=f'Mean: {mean:.4f}')
        axes[i].legend()
        axes[i].set_title(f'{name}\nMean: {mean:.4f}, Std: {std:.4f}')
    
    plt.tight_layout()
    plt.show()
    
    print("Key insights:")
    print("• Xavier: Balanced for tanh/sigmoid activations")
    print("• Kaiming: Wider distribution for ReLU activations")
    print("• Normal (0.01): Very narrow, might cause vanishing gradients")

visualize_weight_distributions()

## Advanced Module Patterns

In [None]:
# Custom module with learnable parameters
class CustomLinear(nn.Module):
    """Custom linear layer to demonstrate parameter creation"""
    
    def __init__(self, in_features, out_features, bias=True):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        
        # Create learnable parameters
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        
        if bias:
            self.bias = nn.Parameter(torch.randn(out_features))
        else:
            # Register as None so it doesn't appear in parameters()
            self.register_parameter('bias', None)
        
        # Initialize parameters
        self.reset_parameters()
    
    def reset_parameters(self):
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            nn.init.uniform_(self.bias, -bound, bound)
    
    def forward(self, x):
        # Manual implementation of linear transformation
        output = torch.matmul(x, self.weight.t())
        if self.bias is not None:
            output = output + self.bias
        return output
    
    def extra_repr(self):
        # Custom string representation
        return f'in_features={self.in_features}, out_features={self.out_features}, bias={self.bias is not None}'

# Test custom linear layer
import math

custom_linear = CustomLinear(10, 5)
builtin_linear = nn.Linear(10, 5)

print("Custom linear layer:")
print(custom_linear)
print(f"Parameters: {sum(p.numel() for p in custom_linear.parameters())}")

# Test they produce similar results
x = torch.randn(3, 10)
output_custom = custom_linear(x)
output_builtin = builtin_linear(x)

print(f"\nOutput shapes - Custom: {output_custom.shape}, Built-in: {output_builtin.shape}")
print("Custom and built-in linear layers work equivalently!")

In [None]:
# Module with submodules and parameter sharing
class ModularNet(nn.Module):
    """Demonstrate modular architecture and parameter sharing"""
    
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Input projection
        self.input_proj = nn.Linear(input_size, hidden_size)
        
        # Shared transformation (parameter sharing)
        self.shared_transform = nn.Linear(hidden_size, hidden_size)
        
        # Layer-specific transformations
        self.layer_transforms = nn.ModuleList([
            nn.Linear(hidden_size, hidden_size) for _ in range(num_layers)
        ])
        
        # Layer normalization for each layer
        self.layer_norms = nn.ModuleList([
            nn.LayerNorm(hidden_size) for _ in range(num_layers)
        ])
        
        # Output projection
        self.output_proj = nn.Linear(hidden_size, input_size)
        
    def forward(self, x):
        # Input projection
        x = F.relu(self.input_proj(x))
        
        # Process through layers
        for i in range(self.num_layers):
            residual = x
            
            # Apply shared transformation (parameter sharing across layers)
            x = self.shared_transform(x)
            
            # Apply layer-specific transformation
            x = self.layer_transforms[i](x)
            
            # Residual connection and layer norm
            x = self.layer_norms[i](x + residual)
            x = F.relu(x)
        
        # Output projection
        x = self.output_proj(x)
        return x

# Create and analyze modular network
modular_net = ModularNet(input_size=64, hidden_size=128, num_layers=4)

print("Modular Network Architecture:")
print(modular_net)

print("\nParameter analysis:")
total_params = 0
for name, param in modular_net.named_parameters():
    print(f"{name:30s}: {str(param.shape):20s} {param.numel():>8,d}")
    total_params += param.numel()

print(f"\nTotal parameters: {total_params:,}")

# Note: shared_transform parameters are used across all layers
print(f"\nShared parameters (used {modular_net.num_layers} times): {modular_net.shared_transform.weight.numel() + modular_net.shared_transform.bias.numel():,}")

# Test forward pass
x = torch.randn(5, 64)
output = modular_net(x)
print(f"\nForward pass: {x.shape} -> {output.shape}")

In [None]:
# Hooks for monitoring activations and gradients
class MonitoredNet(nn.Module):
    """Network with built-in monitoring capabilities"""
    
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        
        # Storage for activations and gradients
        self.activations = {}
        self.gradients = {}
        
        # Register hooks
        self._register_hooks()
    
    def _register_hooks(self):
        def save_activation(name):
            def hook(module, input, output):
                self.activations[name] = output.detach()
            return hook
        
        def save_gradient(name):
            def hook(module, grad_input, grad_output):
                if grad_output[0] is not None:
                    self.gradients[name] = grad_output[0].detach()
            return hook
        
        # Register forward and backward hooks
        self.fc1.register_forward_hook(save_activation('fc1'))
        self.fc2.register_forward_hook(save_activation('fc2'))
        self.fc3.register_forward_hook(save_activation('fc3'))
        
        self.fc1.register_backward_hook(save_gradient('fc1'))
        self.fc2.register_backward_hook(save_gradient('fc2'))
        self.fc3.register_backward_hook(save_gradient('fc3'))
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    
    def get_activation_stats(self):
        """Get statistics about activations"""
        stats = {}
        for name, activation in self.activations.items():
            stats[name] = {
                'mean': activation.mean().item(),
                'std': activation.std().item(),
                'min': activation.min().item(),
                'max': activation.max().item(),
                'zeros': (activation == 0).float().mean().item()  # Sparsity for ReLU
            }
        return stats
    
    def get_gradient_stats(self):
        """Get statistics about gradients"""
        stats = {}
        for name, gradient in self.gradients.items():
            stats[name] = {
                'mean': gradient.mean().item(),
                'std': gradient.std().item(),
                'norm': gradient.norm().item()
            }
        return stats

# Test monitored network
monitored_net = MonitoredNet(20, 50, 5)
optimizer = optim.SGD(monitored_net.parameters(), lr=0.01)
criterion = nn.MSELoss()

# Forward and backward pass
x = torch.randn(10, 20)
y_true = torch.randn(10, 5)

# Forward pass
y_pred = monitored_net(x)
loss = criterion(y_pred, y_true)

# Backward pass
optimizer.zero_grad()
loss.backward()

# Analyze activations and gradients
print("Activation Statistics:")
print("-" * 50)
activation_stats = monitored_net.get_activation_stats()
for layer, stats in activation_stats.items():
    print(f"{layer}:")
    print(f"  Mean: {stats['mean']:.4f}, Std: {stats['std']:.4f}")
    print(f"  Range: [{stats['min']:.4f}, {stats['max']:.4f}]")
    print(f"  Sparsity (zeros): {stats['zeros']:.2%}")

print("\nGradient Statistics:")
print("-" * 50)
gradient_stats = monitored_net.get_gradient_stats()
for layer, stats in gradient_stats.items():
    print(f"{layer}:")
    print(f"  Mean: {stats['mean']:.6f}, Std: {stats['std']:.6f}")
    print(f"  Norm: {stats['norm']:.6f}")

print("\n🎉 Module exploration completed!")
print("\nKey takeaways:")
print("• Use nn.Module as base class for all models")
print("• Parameters are automatically tracked when using nn.Parameter")
print("• Proper initialization is crucial for training success")
print("• ModuleList and ModuleDict help organize complex architectures")
print("• Hooks enable monitoring and debugging of model internals")