# Week 2: Multi-Layer Perceptrons for Molecular Property Prediction

## Key Concepts

### 1. Limitations of Linear Models
`nn.Linear(5, 1)` can only learn straight-line relationships between inputs and outputs.
Stacking multiple linear layers without activation functions reduces to a single linear transformation — depth adds nothing.
Real molecular properties involve non-linear relationships between descriptors that a single linear layer cannot capture.

### 2. Activation Functions
Activation functions introduce non-linearity between linear layers, allowing the network to learn complex patterns.

**ReLU** (Rectified Linear Unit):
```
ReLU(x) = max(0, x)
```
Negative inputs become zero, positive inputs pass through unchanged.
Linear layers rotate and scale data — ReLU bends it. Repeated bending allows the network to approximate any relationship.

### 3. Dropout
Technique to prevent overfitting: randomly deactivates neurons during training, forcing the remaining neurons to learn the relationship independently.
- **During training:** neurons are randomly turned off
- **During inference:** all neurons are active

PyTorch handles this automatically via `model.train()` and `model.eval()`.

### 4. Batch Normalisation
Normalises layer outputs to mean ~0 and standard deviation ~1 across the batch before passing to the next layer.
Keeps activations in a stable range throughout the network, preventing gradients from growing too large or too small.

PyTorch offers three variants depending on the dimensionality of the input data:

| Variant | Input Shape | Use Case |
|---|---|---|
| `nn.BatchNorm1d` | `(batch_size, features)` | Flat feature vectors — molecular descriptors, fingerprints |
| `nn.BatchNorm2d` | `(batch_size, channels, height, width)` | Image feature maps — CNNs |
| `nn.BatchNorm3d` | `(batch_size, channels, depth, height, width)` | Volumetric data — 3D protein structures |

For MLPs working with molecular descriptors, always use `nn.BatchNorm1d`.
`nn.BatchNorm2d` will appear later when working with CNNs for image-based molecular representations.


Typical layer block in a modern MLP:
```
Linear → BatchNorm → ReLU → Dropout
```

### 5. Input Normalisation
Molecular descriptors naturally span very different scales (MW: 200-500, HBD: 0-5).
Without normalisation, large-valued descriptors dominate the gradients, causing unstable training.
Normalising to mean 0 and std 1 puts all descriptors on equal footing.

**Critical rule:** always normalise validation and test data using training set statistics:
```python
mean = X_train.mean(dim=0)
std = X_train.std(dim=0)

X_train = (X_train - mean) / std
X_val = (X_val - mean) / std   # training stats, not val stats
X_test = (X_test - mean) / std # training stats, not test stats
```
At deployment, only training statistics are available — this is the only feasible approach.

## Part 1: MLP the Hard Way — Custom nn.Module

Before using `nn.Sequential`, we build a 3-layer MLP manually using a custom `nn.Module` class.
This makes the forward pass explicit, so we can see exactly how data flows through the network
before abstracting it away.

Architecture:
- Input layer: 5 → 16 neurons + ReLU
- Hidden layer: 16 → 16 neurons + ReLU
- Output layer: 16 → 1 neuron

In [None]:
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self):
        super().__init__() # Call the parent class's __init__ to initialize the module for its intrinsic function call
        # layer 1: input 5 features
        self.layer1 = nn.Linear(5, 16) # input layer with 5 features and 16 hidden neurons
        # layer 2: hidden layer with 16 neurons + ReLU
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(16, 16) # hidden layer with 16 neurons
        # layer 3: output layer with 1 neuron 
        self.layer3 = nn.Linear(16, 1) # output layer with 1 neuron
    
    def forward(self,x):
        '''
        x: input tensor of shape (batch_size, 5)
        Total number of parameters in the model:
        Layer 1: (5 weights + 1 bias) * 16 neurons = 96 parameters
        Layer 2: (16 weights + 1 bias) * 16 neurons = 272 parameters
        Layer 3: (16 weights + 1 bias) * 1 neuron = 17 parameters
        Total parameters = 96 + 272 + 17 = 385 parameters
        '''
        x = self.layer1(x) # 16 independent linear transformations were made: 16 groups of (5 weights + bias) each were used
        x = self.relu(x)
        x = self.layer2(x) # 16 independent linear transformations were made: 16 groups of (16 weights + bias) each were used
        x = self.relu(x)
        x = self.layer3(x) # 1 independent linear transformation was made: 1 group of (16 weights + bias) each were used
        return x
    


In [None]:
# Dummy input to test the model
model = MLP()
x = torch.rand(10, 5) 
out = model(x)
print(out.shape) 
print("Total number of parameters: ", sum(p.numel() for p in model.parameters()))

torch.Size([10, 1])
Total number of parameters:  385


## Part 2: MLP the Easy Way — nn.Sequential

Refactoring the same architecture using `nn.Sequential`.
The forward pass becomes implicit — data flows through layers in the order they are defined.

In [None]:
simplified_model = nn.Sequential(
    nn.Linear(5, 16), # input layer with 5 features and 16 hidden neurons
    nn.ReLU(),
    nn.Linear(16, 16), # hidden layer with 16 neurons
    nn.ReLU(),
    nn.Linear(16, 1) # output layer with 1 neuron
)

x = torch.rand(10, 5) 
out = simplified_model(x)
print(out.shape) 
print("Total number of parameters: ", sum(p.numel() for p in simplified_model.parameters()))

torch.Size([10, 1])
Total number of parameters:  385


## Part 3: Full MLP for Molecular Property Prediction

Scaling up to the B1 project architecture with:
- Input: 200 features (RDKit descriptors)
- Hidden layer 1: 512 neurons + BatchNorm + ReLU + Dropout(0.2)
- Hidden layer 2: 256 neurons + BatchNorm + ReLU + Dropout(0.2)
- Hidden layer 3: 128 neurons + BatchNorm + ReLU + Dropout(0.2)
- Output: 1 neuron


Order for setting up a layer: Linear → BatchNorm → ReLU → Dropout

In [6]:
# Building model

full_model = nn.Sequential(
    # --- Block 1 ---
    # Projects 200 raw input features (data, not neurons) into 512 neurons
    nn.Linear(200, 512),
    nn.BatchNorm1d(512),  # normalises the 512 neuron outputs
    nn.ReLU(),            # introduces non-linearity
    nn.Dropout(0.2),      # randomly deactivates 20% of neurons during training

    # --- Block 2 ---
    # Hidden layer: 512 neurons → 256 neurons
    nn.Linear(512, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Dropout(0.2),

    # --- Block 3 ---
    # Hidden layer: 256 neurons → 128 neurons
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Dropout(0.2),

    # --- Output ---
    # No activation, BatchNorm, or Dropout — raw value for regression
    nn.Linear(128, 1)
)




In [None]:
x = torch.randn(16, 200)
out = full_model(x)
print(out.shape)  # should be torch.Size([16, 1])
print("Total parameters:", sum(p.numel() for p in full_model.parameters())) # 269057 parameters

torch.Size([16, 1])
Total parameters: 269057


## Parameter Count

Total learnable parameters: 269,057

### Breakdown

| Layer | Calculation | Parameters |
|---|---|---|
| `Linear(200, 512)` | 512 × 200 + 512 (bias) | 102,912 |
| `BatchNorm1d(512)` | 512 × 2 (gamma + beta) | 1,024 |
| `Linear(512, 256)` | 256 × 512 + 256 (bias) | 131,328 |
| `BatchNorm1d(256)` | 256 × 2 (gamma + beta) | 512 |
| `Linear(256, 128)` | 128 × 256 + 128 (bias) | 32,896 |
| `BatchNorm1d(128)` | 128 × 2 (gamma + beta) | 256 |
| `Linear(128, 1)` | 1 × 128 + 1 (bias) | 129 |
| **Total** | | **269,057** |

### Notes
- `nn.Linear(in, out)` parameters: `out × in` (weights) + `out` (biases)
- `nn.BatchNorm1d(n)` parameters: `n × 2` — one **gamma** (scale) and one **beta** (shift) per feature
- BatchNorm normalises to mean 0 and std 1, then learns the optimal scale and shift via gamma and beta
- `nn.ReLU()` and `nn.Dropout()` have no learnable parameters