# Module 11: Building Neural Networks in PyTorch

**From Scratch to Production-Ready Models**

---

## Objectives

By the end of this notebook, you will:
- Master the nn.Module class
- Build networks with nn.Linear and nn.Sequential
- Understand the forward method
- Work with parameters and buffers
- Create custom layers

**Prerequisites:** [Module 03 - PyTorch Fundamentals](../03_pytorch_fundamentals/03_pytorch_fundamentals.ipynb)

---

In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

torch.manual_seed(42)

<torch._C.Generator at 0x7e1c24016070>

---

# Part 1: nn.Module - The Building Block

---

## 1.1 What is nn.Module?

`nn.Module` is the base class for all neural network modules in PyTorch. It provides:
- Parameter management
- GPU transfer
- Saving/loading
- Training/eval modes

In [14]:
# Simplest possible module
class SimpleLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()  # ALWAYS call super().__init__()

        # Define learnable parameters using nn.Parameter
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bias = nn.Parameter(torch.zeros(out_features))

    def forward(self, x):
        # Define the forward pass
        return x @ self.weight.T + self.bias

# Test
layer = SimpleLinear(3, 2)
x = torch.randn(4, 3)  # Batch of 4 samples, 3 features each
output = layer(x)  # Calls forward() automatically

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

Input shape: torch.Size([4, 3])
Output shape: torch.Size([4, 2])


In [15]:
# Inspect parameters
print("Parameters:")
for name, param in layer.named_parameters():
    print(f"  {name}: shape={param.shape}, requires_grad={param.requires_grad}")

Parameters:
  weight: shape=torch.Size([2, 3]), requires_grad=True
  bias: shape=torch.Size([2]), requires_grad=True


## 1.2 Using nn.Linear

PyTorch provides `nn.Linear` which does the same thing but with proper initialization.

In [16]:
# nn.Linear is the standard way
linear = nn.Linear(in_features=3, out_features=2)

print(f"Weight shape: {linear.weight.shape}")
print(f"Bias shape: {linear.bias.shape}")

# Without bias
linear_no_bias = nn.Linear(3, 2, bias=False)
print(f"\nWithout bias: bias = {linear_no_bias.bias}")

Weight shape: torch.Size([2, 3])
Bias shape: torch.Size([2])

Without bias: bias = None


---

# Part 2: Building Networks

---

## 2.1 Class-Based Approach

In [17]:
class MLP(nn.Module):
    """Multi-Layer Perceptron with configurable layers."""

    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()

        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)

        # Activation functions (not layers, no parameters)
        self.relu = nn.ReLU()

    def forward(self, x):
        # Layer 1
        x = self.fc1(x)
        x = self.relu(x)

        # Layer 2
        x = self.fc2(x)
        x = self.relu(x)

        # Output layer (no activation for logits)
        x = self.fc3(x)
        return x

# Create and test
model = MLP(input_size=784, hidden_size=128, output_size=10)
x = torch.randn(32, 784)  # Batch of 32 images (flattened 28x28)
output = model(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

Input shape: torch.Size([32, 784])
Output shape: torch.Size([32, 10])

Total parameters: 118,282


## 2.2 Using nn.Sequential

In [18]:
# nn.Sequential: Quick way to build simple feedforward networks
model_seq = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

output = model_seq(x)
print(f"Output shape: {output.shape}")

# Access layers by index
print(f"\nFirst layer: {model_seq[0]}")

Output shape: torch.Size([32, 10])

First layer: Linear(in_features=784, out_features=128, bias=True)


In [19]:
# Named Sequential with OrderedDict
from collections import OrderedDict

model_named = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(784, 128)),
    ('relu1', nn.ReLU()),
    ('fc2', nn.Linear(128, 128)),
    ('relu2', nn.ReLU()),
    ('fc3', nn.Linear(128, 10))
]))

# Access by name
print(f"Access by name: {model_named.fc1}")

Access by name: Linear(in_features=784, out_features=128, bias=True)


## 2.3 Hybrid Approach (Recommended)

In [20]:
class FlexibleMLP(nn.Module):
    """MLP with variable number of hidden layers."""

    def __init__(self, input_size, hidden_sizes, output_size, dropout=0.0):
        super().__init__()

        # Build layers dynamically
        layers = []
        in_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(in_size, hidden_size))
            layers.append(nn.ReLU())
            if dropout > 0:
                layers.append(nn.Dropout(dropout))
            in_size = hidden_size

        layers.append(nn.Linear(in_size, output_size))

        # Wrap in Sequential
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

# Create networks with different architectures
model_shallow = FlexibleMLP(784, [128], 10)
model_deep = FlexibleMLP(784, [256, 128, 64], 10, dropout=0.2)

print(f"Shallow: {sum(p.numel() for p in model_shallow.parameters()):,} params")
print(f"Deep: {sum(p.numel() for p in model_deep.parameters()):,} params")

Shallow: 101,770 params
Deep: 242,762 params


---

# Part 3: Parameters and Buffers

---

## 3.1 Parameters vs Buffers

- **Parameters**: Learnable weights (have gradients)
- **Buffers**: Non-learnable state (e.g., running mean in BatchNorm)

In [21]:
class LayerWithBuffer(nn.Module):
    def __init__(self, size):
        super().__init__()

        # Learnable parameter
        self.weight = nn.Parameter(torch.randn(size))

        # Non-learnable buffer (still saved/loaded, moves to GPU)
        self.register_buffer('running_mean', torch.zeros(size))
        self.register_buffer('count', torch.tensor(0))

    def forward(self, x):
        # Update running mean during training
        if self.training:
            self.running_mean = 0.9 * self.running_mean + 0.1 * x.mean()
            self.count += 1
        return x * self.weight

layer = LayerWithBuffer(3)
print("Parameters:")
for name, p in layer.named_parameters():
    print(f"  {name}")

print("\nBuffers:")
for name, b in layer.named_buffers():
    print(f"  {name}")

Parameters:
  weight

Buffers:
  running_mean
  count


## 3.2 Training vs Evaluation Mode

In [22]:
model = FlexibleMLP(784, [128], 10, dropout=0.5)

# Training mode (default)
model.train()
print(f"Training mode: {model.training}")

# Evaluation mode
model.eval()
print(f"Training mode: {model.training}")

# This affects:
# - Dropout: disabled in eval mode
# - BatchNorm: uses running stats in eval mode

Training mode: True
Training mode: False


---

# Part 4: Custom Layers

---

In [23]:
class Swish(nn.Module):
    """Swish activation: x * sigmoid(x)"""
    def forward(self, x):
        return x * torch.sigmoid(x)

class ResidualBlock(nn.Module):
    """Residual block: output = F(x) + x"""

    def __init__(self, size):
        super().__init__()
        self.fc1 = nn.Linear(size, size)
        self.fc2 = nn.Linear(size, size)
        self.relu = nn.ReLU()

    def forward(self, x):
        identity = x
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = out + identity  # Skip connection!
        out = self.relu(out)
        return out

# Use in a network
class ResNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_blocks=3):
        super().__init__()
        self.input_layer = nn.Linear(input_size, hidden_size)
        self.blocks = nn.Sequential(*[ResidualBlock(hidden_size) for _ in range(n_blocks)])
        self.output_layer = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.input_layer(x)
        x = self.blocks(x)
        x = self.output_layer(x)
        return x

model = ResNet(784, 128, 10, n_blocks=3)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Total parameters: 200,842


---

# Part 5: Useful Patterns

---

In [24]:
# Moving to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Saving and loading
# Save
# torch.save(model.state_dict(), 'model.pth')

# Load
# model.load_state_dict(torch.load('model.pth'))

# Count parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Trainable parameters: {count_parameters(model):,}")

Trainable parameters: 200,842


---

# Key Points Summary

---

## nn.Module
- Base class for all neural networks
- Always call `super().__init__()` in constructor
- Define layers in `__init__`, computation in `forward`

## Building Networks
- Use `nn.Linear`, `nn.ReLU`, etc. for standard layers
- Use `nn.Sequential` for simple feedforward networks
- Use class-based approach for complex architectures

## Parameters
- Use `nn.Parameter` for learnable weights
- Use `register_buffer` for non-learnable state
- `model.train()` and `model.eval()` for mode switching

---

# Interview Tips

---

**Q: What is nn.Module?**
A: The base class for all neural networks in PyTorch. It handles parameter registration, GPU transfer, serialization, and training/eval modes.

**Q: When to use nn.Sequential vs class-based?**
A: Use Sequential for simple feedforward networks. Use class-based for custom logic like skip connections, multiple outputs, or conditional execution.

**Q: What's the difference between parameters and buffers?**
A: Parameters are learnable (have gradients). Buffers are non-learnable state that should still be saved/loaded (e.g., running statistics in BatchNorm).

---

## Next Module: [12 - Training Pipeline](../12_training_pipeline/12_training_pipeline.ipynb)

Now let's learn how to properly train these networks.