# Mini Project: Linear Binding Affinity Predictor

## Goal
Build a linear binding affinity predictor using multiple descriptors — extending everything from this week into one cohesive pipeline.

## Specifications
- 100 molecules, 5 descriptors (MW, logP, TPSA, HBD, HBA — dummy values for now)
- True relationship: `affinity = 0.3*MW - 0.5*logP + 0.1*TPSA + 0.2*HBD - 0.4*HBA + noise`
- `MoleculeDataset` + `DataLoader` with batch size 16
- `nn.Linear(5, 1)` model
- SGD optimizer, MSELoss
- 100 epochs, print epoch loss every 10
- At the end, print learned weights vs true weights

**Challenge:** with 5 input features, `nn.Linear(5, 1)` has 5 weights — one per descriptor.

---

## Key Reminders

### 1. Check tensor shapes before computing loss
`model(descriptors)` outputs shape `(batch_size, 1)` while labels are `(batch_size,)`.
Use `.squeeze(1)` to align shapes before passing to the loss function.

### 2. Divide epoch loss by number of batches, not batch size
`epoch_loss` accumulates the sum of per-batch losses across the epoch:
```
epoch_loss = loss_batch_1 + loss_batch_2 + ... + loss_batch_N
```
Each batch loss is already averaged over molecules within that batch by `nn.MSELoss()`.
Dividing by `len(dataloader)` gives the mean batch loss across the epoch — a stable, comparable number.
Dividing by batch size has no clean interpretation.

In [1]:
# Dataset Class for Batching
from torch.utils.data import Dataset

class BindingAffinityDataset(Dataset):
    def __init__(self, descriptors, affinities):
        self.descriptors = descriptors
        self.affinities = affinities

    def __len__(self):
        return len(self.descriptors)

    def __getitem__(self, idx):
        return self.descriptors[idx], self.affinities[idx]

In [39]:
# Initialize dummy descriptors

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

torch.manual_seed(42)

dummy_descriptors = torch.randn(100,5) # 100 molecules, 5 descriptors each: MW, logP, TPSA, HBD, HBA

# True relationship: affinity = 0.3*MW - 0.5*logP + 0.1*TPSA + 0.2*HBD - 0.4*HBA + noise
mw, logp, tpsa, hbd, hba = 0.3, -0.5, 0.1, 0.2, -0.4
true_weights = torch.tensor([mw, logp, tpsa, hbd, hba])
noise = 0.1 * torch.randn(100)

affinities = dummy_descriptors @ true_weights + noise

# Create Dataset and DataLoader
dataset = BindingAffinityDataset(dummy_descriptors, affinities)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)


In [47]:
# Set up a linear regression model via neural network
model = nn.Linear(5, 1) # 5 input features, 1 output (binding affinity)

# Define loss function and Single Gradient Descent optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100

for epoch in range(num_epochs):
    epoch_loss = 0
    for batch in dataloader:
        descriptors, affinities_true = batch
        affinities_pred = model(descriptors)
        # affinities_true: [16], affinities_pred: [16, 1]
        affinities_pred = affinities_pred.squeeze(1) # Reshape to [16]
        # print(affinities_true.shape, affinities_pred.shape) # Check shapes
        loss = criterion(affinities_pred, affinities_true)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    if epoch % 10 == 0:
        # epoch_loss = loss_batch_1 + loss_batch_2 + ... + loss_batch_16
        print(f'Epoch {epoch}, Loss: {epoch_loss/len(dataloader):.4f}')
        # Unpack learned weights and compare to true weights
        mw_, logP_, tpsa_, hbd_, hba_ = model.weight.data[0]
        print(f'Learned weights: MW={mw_:.4f}, logP={logP_:.4f}, TPSA={tpsa_:.4f}, HBD={hbd_:.4f}, HBA={hba_:.4f}')
        print(f'True weights: MW={mw:.4f}, logP={logp:.4f}, TPSA={tpsa:.4f}, HBD={hbd:.4f}, HBA={hba:.4f}')


Epoch 0, Loss: 0.5504
Learned weights: MW=0.2631, logP=0.0708, TPSA=-0.2832, HBD=0.0474, HBA=-0.4226
True weights: MW=0.3000, logP=-0.5000, TPSA=0.1000, HBD=0.2000, HBA=-0.4000
Epoch 10, Loss: 0.0582
Learned weights: MW=0.2412, logP=-0.3571, TPSA=-0.0525, HBD=0.1548, HBA=-0.3961
True weights: MW=0.3000, logP=-0.5000, TPSA=0.1000, HBD=0.2000, HBA=-0.4000
Epoch 20, Loss: 0.0131
Learned weights: MW=0.2740, logP=-0.4651, TPSA=0.0405, HBD=0.1868, HBA=-0.3928
True weights: MW=0.3000, logP=-0.5000, TPSA=0.1000, HBD=0.2000, HBA=-0.4000
Epoch 30, Loss: 0.0093
Learned weights: MW=0.2901, logP=-0.4983, TPSA=0.0731, HBD=0.1948, HBA=-0.3948
True weights: MW=0.3000, logP=-0.5000, TPSA=0.1000, HBD=0.2000, HBA=-0.4000
Epoch 40, Loss: 0.0082
Learned weights: MW=0.2954, logP=-0.5101, TPSA=0.0874, HBD=0.1989, HBA=-0.3936
True weights: MW=0.3000, logP=-0.5000, TPSA=0.1000, HBD=0.2000, HBA=-0.4000
Epoch 50, Loss: 0.0086
Learned weights: MW=0.2976, logP=-0.5131, TPSA=0.0880, HBD=0.1962, HBA=-0.3921
True wei