# Intro to PyTorch — Hands‑on Notebook (with NLP example)
*Generated on: 2025-09-24*

This notebook is a practical, end‑to‑end tour of PyTorch, focusing on:
- Core tensors, autograd, and modules (`torch`, `torch.nn`, `torch.optim`, `torch.utils.data`)
- Building neural networks in multiple styles (Sequential, subclassed `nn.Module`, functional API)
- Training pipelines (datasets, dataloaders, training/eval loops, saving/loading)
- Useful segments: initialization, regularization, scheduling, mixed precision, `torch.compile` (PyTorch 2.x)
- NLP mini‑project: sentiment classification on a tiny in‑notebook dataset using embeddings + RNN/Transformer

> **Tip:** Run the cells one by one. If PyTorch is not installed, install it with the cell below (adjust CUDA/CPU as needed).


## 1. Environment Setup

In [None]:
# If PyTorch is not installed, uncomment one of these (choose the right one for your machine).
# CPU-only (works everywhere):
# !pip install torch --index-url https://download.pytorch.org/whl/cpu

# If you have CUDA 12.1:
# !pip install torch --index-url https://download.pytorch.org/whl/cu121

# Optional utils used later:
# !pip install matplotlib scikit-learn


## 2. Imports & Version Check

In [None]:
import sys, platform
print('Python:', sys.version)
print('Platform:', platform.platform())

import torch
print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device


## 3. Tensors 101
Core creation, shapes, dtypes, basic math, broadcasting, indexing, and moving to devices.


In [None]:
import torch

# Creation
x = torch.tensor([[1., 2.],[3., 4.]])
a = torch.arange(10)
z = torch.zeros((2,3))
o = torch.ones((3,2))
r = torch.randn((2,2))

print('x:', x, x.shape, x.dtype)
print('a:', a)
print('z:', z)
print('o:', o)
print('r:', r)

# Basic ops & broadcasting
b = torch.randn(1,2)
print('b:', b)
print('x + b:', x + b)
print('x @ x.T:', x @ x.T)

# Indexing & slicing
print('a[2:7]:', a[2:7])
print('x[:, 1]:', x[:, 1])

# Devices
x_gpu = x.to(device)
print('x on device:', x_gpu.device)


## 4. Autograd (Automatic Differentiation)
Track gradients with `requires_grad=True`, compute loss, and backprop using `.backward()`.


In [None]:
w = torch.randn(3, requires_grad=True)
b = torch.randn(1, requires_grad=True)

x = torch.randn(5, 3)
y = torch.randn(5, 1)

# Simple linear model: y_hat = x @ w + b
y_hat = x @ w.unsqueeze(1) + b
loss = ((y_hat - y)**2).mean()
print('Loss:', loss.item())

loss.backward()
print('w.grad:', w.grad)
print('b.grad:', b.grad)

# Always zero grads before next backward pass in real training loops
w.grad.zero_()
b.grad.zero_()


## 5. `torch.nn` Modules
Three ways to define models:
1. **Sequential API**
2. **Subclass `nn.Module`**
3. **Functional API** with layers in `forward`


In [None]:
import torch.nn as nn
import torch.nn.functional as F

# 1) Sequential
seq_model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 1)
)
print(seq_model)

# 2) Subclassing nn.Module
class MLP(nn.Module):
    def __init__(self, in_dim=10, hidden=32, out_dim=1):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, out_dim)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

mlp = MLP()
print(mlp)

# 3) Functional: often similar to subclassing; you call F.* in forward as above.
# The MLP already demonstrates functional usage via F.relu in forward.


## 6. Optimizers & A Minimal Training Loop
How to wire loss, optimizer, zero‑grad → backward → step.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)

# Dummy regression data
X = torch.randn(256, 10)
true_w = torch.randn(10, 1)
y = X @ true_w + 0.1*torch.randn(256,1)

model = nn.Sequential(nn.Linear(10,64), nn.ReLU(), nn.Linear(64,1)).to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-2)

X, y = X.to(device), y.to(device)

for epoch in range(200):
    model.train()
    optimizer.zero_grad()
    preds = model(X)
    loss = criterion(preds, y)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 50 == 0:
        print(f'Epoch {epoch+1:03d} | loss={loss.item():.4f}')

# Evaluate
model.eval()
with torch.no_grad():
    mse = criterion(model(X), y).item()
print('Final MSE:', mse)


## 7. Data Pipeline: `Dataset` & `DataLoader`
- Use built‑in datasets or **create your own** by subclassing `torch.utils.data.Dataset`.
- Load with `DataLoader` for batching, shuffling, and parallel workers.


In [None]:
from torch.utils.data import Dataset, DataLoader

class ToyDataset(Dataset):
    def __init__(self, n=500, in_dim=10):
        super().__init__()
        torch.manual_seed(0)
        self.X = torch.randn(n, in_dim)
        self.y = (self.X.sum(dim=1, keepdim=True) > 0).float()
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_ds = ToyDataset(n=1000, in_dim=20)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)

for batch_X, batch_y in train_loader:
    print(batch_X.shape, batch_y.shape)
    break


## 8. Put It Together: Train a Classifier with DataLoader
Includes accuracy computation and evaluation loop.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

class Classifier(nn.Module):
    def __init__(self, in_dim=20, hidden=64, out_dim=1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, out_dim)
        )
    def forward(self, x):
        return self.net(x)

def accuracy_from_logits(logits, y):
    probs = torch.sigmoid(logits)
    preds = (probs > 0.5).float()
    return (preds.eq(y).float().mean().item())

train_ds = ToyDataset(n=2000, in_dim=20)
val_ds   = ToyDataset(n=400,  in_dim=20)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False)

model = Classifier(in_dim=20).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

best_val = 0.0
for epoch in range(10):
    # Train
    model.train()
    total_loss, total_acc, total_cnt = 0.0, 0.0, 0
    for Xb, yb in train_loader:
        Xb, yb = Xb.to(device), yb.to(device)
        optimizer.zero_grad()
        logits = model(Xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()*Xb.size(0)
        total_acc  += accuracy_from_logits(logits, yb)*Xb.size(0)
        total_cnt  += Xb.size(0)
    tr_loss = total_loss/total_cnt
    tr_acc  = total_acc/total_cnt

    # Val
    model.eval()
    val_loss, val_acc, val_cnt = 0.0, 0.0, 0
    with torch.no_grad():
        for Xb, yb in val_loader:
            Xb, yb = Xb.to(device), yb.to(device)
            logits = model(Xb)
            loss = criterion(logits, yb)
            val_loss += loss.item()*Xb.size(0)
            val_acc  += accuracy_from_logits(logits, yb)*Xb.size(0)
            val_cnt  += Xb.size(0)

    val_loss /= val_cnt
    val_acc  /= val_cnt
    print(f'Epoch {epoch+1:02d} | train_loss={tr_loss:.4f} tr_acc={tr_acc:.3f} | val_loss={val_loss:.4f} val_acc={val_acc:.3f}')

    best_val = max(best_val, val_acc)
print('Best Val Acc:', best_val)


## 9. Saving & Loading Models
Use `state_dict` for portability.


In [None]:
# Save
torch.save(model.state_dict(), 'classifier_state_dict.pt')

# Load (example)
loaded = Classifier(in_dim=20).to(device)
loaded.load_state_dict(torch.load('classifier_state_dict.pt', map_location=device))
loaded.eval()
print('Loaded model ready.')


## 10. Useful Building Blocks
- **Initialization**: Xavier/He init
- **Regularization**: Weight decay (L2), dropout
- **Schedulers**: `StepLR`, `CosineAnnealingLR`, etc.
- **Mixed Precision**: `torch.cuda.amp`
- **PyTorch 2.x**: `torch.compile` for speedups (if supported)


In [None]:
import torch.nn as nn
import torch.optim as optim

# Example: custom init
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight)
        if m.bias is not None:
            nn.init.zeros_(m.bias)

demo = nn.Sequential(nn.Linear(32,64), nn.ReLU(), nn.Linear(64,10))
demo.apply(init_weights);

# Dropout example
drop_model = nn.Sequential(
    nn.Linear(32, 128),
    nn.ReLU(),
    nn.Dropout(p=0.2),
    nn.Linear(128, 10)
)

# Schedulers
optimizer = optim.Adam(demo.parameters(), lr=1e-3, weight_decay=1e-4)  # weight decay = L2
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

# AMP mixed precision (requires CUDA)
scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())

# torch.compile (PyTorch 2.x)
try:
    compiled_demo = torch.compile(demo)  # falls back if not supported
except Exception as e:
    print('torch.compile not available:', e)


## 11. NLP Mini‑Project — Tiny Sentiment Classifier
We'll build a minimal text classification pipeline **without external downloads**:

**Steps**
1. Build a tiny in‑notebook dataset (positive/negative sentences)
2. Tokenize (basic), build a vocabulary, numericalize
3. Create a `collate_fn` to pad batches
4. Model 1: Embedding + Average pooling classifier (fast baseline)
5. Model 2: Embedding + LSTM/GRU
6. (Optional) Model 3: Tiny Transformer encoder

We'll measure accuracy on a simple train/val split. This is **illustrative** (not SOTA).


In [None]:
import re, random, math, torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

random.seed(0); torch.manual_seed(0)

# 1) Tiny dataset (toy)
pos = [
    'i love this movie',
    'this film was awesome',
    'such a great experience',
    'absolutely fantastic acting',
    'i really enjoyed it',
    'highly recommend this',
    'brilliant and heartwarming',
    'super fun and engaging',
]
neg = [
    'i hate this movie',
    'this film was terrible',
    'such a bad experience',
    'absolutely awful acting',
    'i really disliked it',
    'do not recommend this',
    'boring and disappointing',
    'super dull and messy',
]

all_text = [(t,1) for t in pos] + [(t,0) for t in neg]
random.shuffle(all_text)

# 2) Tokenize + vocab
def basic_tokenize(s):
    s = s.lower()
    s = re.sub(r'[^a-z0-9\s]+','', s)
    return s.strip().split()

# build vocab
from collections import Counter
counter = Counter()
for t,_ in all_text:
    counter.update(basic_tokenize(t))

itos = ['<pad>','<unk>'] + [w for w,cnt in counter.items() if cnt>=1]
stoi = {w:i for i,w in enumerate(itos)}

def numericalize(tokens):
    return [stoi.get(tok, stoi['<unk>']) for tok in tokens]

# 3) Dataset with numericalized text
class TextDataset(Dataset):
    def __init__(self, pairs):
        self.items = []
        for txt, lbl in pairs:
            toks = basic_tokenize(txt)
            ids = numericalize(toks)
            self.items.append((ids, lbl))
    def __len__(self):
        return len(self.items)
    def __getitem__(self, idx):
        return self.items[idx]

def pad_collate(batch, pad_idx=0):
    # batch: list of (ids, label)
    max_len = max(len(ids) for ids,_ in batch)
    padded = []
    labels = []
    for ids, lbl in batch:
        arr = ids + [pad_idx]*(max_len - len(ids))
        padded.append(arr)
        labels.append(lbl)
    return torch.tensor(padded), torch.tensor(labels).float().unsqueeze(1)

# Split
split = int(0.75 * len(all_text))
train_pairs = all_text[:split]
val_pairs   = all_text[split:]

train_ds = TextDataset(train_pairs)
val_ds   = TextDataset(val_pairs)

train_dl = DataLoader(train_ds, batch_size=4, shuffle=True, collate_fn=pad_collate)
val_dl   = DataLoader(val_ds,   batch_size=4, shuffle=False, collate_fn=pad_collate)

vocab_size = len(itos)
pad_idx = stoi['<pad>']
vocab_size, pad_idx, itos[:10]


In [None]:
class MeanPoolClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim=32, pad_idx=0):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.fc  = nn.Linear(emb_dim, 1)
    def forward(self, x):  # x: [B, T]
        e = self.emb(x)    # [B, T, E]
        mask = (x != 0).unsqueeze(-1)  # [B, T, 1]
        summed = (e * mask).sum(dim=1) # [B, E]
        count  = mask.sum(dim=1).clamp(min=1) # [B, 1]
        avg    = summed / count
        return self.fc(avg)

def train_eval(model, train_dl, val_dl, epochs=20, lr=1e-2, device='cpu'):
    model.to(device)
    opt = torch.optim.Adam(model.parameters(), lr=lr)
    loss_fn = nn.BCEWithLogitsLoss()
    def accuracy(logits, y):
        preds = (torch.sigmoid(logits) > 0.5).float()
        return (preds.eq(y).float().mean().item())

    for ep in range(epochs):
        model.train()
        total_loss, total_acc, total_n = 0.0, 0.0, 0
        for xb, yb in train_dl:
            xb, yb = xb.to(device), yb.to(device)
            opt.zero_grad()
            logits = model(xb)
            loss = loss_fn(logits, yb)
            loss.backward()
            opt.step()
            n = xb.size(0)
            total_loss += loss.item()*n
            total_acc  += accuracy(logits, yb)*n
            total_n    += n
        tr_loss = total_loss/total_n
        tr_acc  = total_acc/total_n

        model.eval()
        val_loss, val_acc, val_n = 0.0, 0.0, 0
        with torch.no_grad():
            for xb, yb in val_dl:
                xb, yb = xb.to(device), yb.to(device)
                logits = model(xb)
                loss = loss_fn(logits, yb)
                n = xb.size(0)
                val_loss += loss.item()*n
                val_acc  += accuracy(logits, yb)*n
                val_n    += n
        val_loss /= val_n
        val_acc  /= val_n
        if (ep+1)%5==0 or ep==0:
            print(f'Epoch {ep+1:02d} | train_loss={tr_loss:.4f} tr_acc={tr_acc:.3f} | val_loss={val_loss:.4f} val_acc={val_acc:.3f}')

# Train baseline
model = MeanPoolClassifier(vocab_size, emb_dim=32, pad_idx=pad_idx)
train_eval(model, train_dl, val_dl, epochs=20, lr=5e-3, device=device)


In [None]:
class RNNClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim=64, hidden=64, rnn_type='lstm', pad_idx=0):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        if rnn_type == 'gru':
            self.rnn = nn.GRU(emb_dim, hidden, batch_first=True)
        else:
            self.rnn = nn.LSTM(emb_dim, hidden, batch_first=True)
        self.fc = nn.Linear(hidden, 1)
    def forward(self, x):
        e = self.emb(x)
        out, h = self.rnn(e)
        if isinstance(h, tuple):  # LSTM: (h_n, c_n)
            h = h[0]
        last = h[-1]  # [B, hidden]
        return self.fc(last)

rnn_model = RNNClassifier(vocab_size, emb_dim=64, hidden=64, rnn_type='lstm', pad_idx=pad_idx)
train_eval(rnn_model, train_dl, val_dl, epochs=20, lr=5e-3, device=device)


In [None]:
class TinyTransformerClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim=64, nhead=4, num_layers=2, pad_idx=0):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        encoder_layer = nn.TransformerEncoderLayer(d_model=emb_dim, nhead=nhead, batch_first=True)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.fc = nn.Linear(emb_dim, 1)
        self.pad_idx = pad_idx

    def forward(self, x):
        e = self.emb(x)  # [B,T,E]
        # Create attention mask where True means "ignore"
        src_key_padding_mask = (x == self.pad_idx)  # [B,T]
        enc = self.encoder(e, src_key_padding_mask=src_key_padding_mask)
        # Mean pool over non-pad tokens
        mask = (~src_key_padding_mask).unsqueeze(-1)  # [B,T,1]
        summed = (enc * mask).sum(dim=1)
        count  = mask.sum(dim=1).clamp(min=1)
        avg    = summed / count
        return self.fc(avg)

tx_model = TinyTransformerClassifier(vocab_size, emb_dim=64, nhead=4, num_layers=2, pad_idx=pad_idx)
train_eval(tx_model, train_dl, val_dl, epochs=25, lr=3e-3, device=device)


## 12. Inference Utility
Test your trained model on custom sentences.


In [None]:
def predict_sentence(model, sentence, threshold=0.5):
    model.eval()
    with torch.no_grad():
        toks = basic_tokenize(sentence)
        ids  = torch.tensor([numericalize(toks)])
        logits = model(ids.to(device))
        prob = torch.sigmoid(logits).item()
        label = 'POSITIVE' if prob >= threshold else 'NEGATIVE'
        return {'prob_positive': prob, 'label': label}

# Example (after training)
print(predict_sentence(tx_model, "i really loved this fantastic film"))
print(predict_sentence(tx_model, "this movie was dull and terrible"))


## 13. Exercises (Try on Your Own)
1. Replace `nn.ReLU()` with `nn.LeakyReLU()` or `nn.GELU()` and compare accuracy.
2. Add **dropout** layers and observe overfitting changes.
3. Implement a **character‑level** tokenizer and compare to word‑level.
4. Add a **learning rate scheduler** to the NLP training function.
5. Use **weight decay** and compare validation performance.
6. Try **GRU** instead of LSTM (`rnn_type='gru'`).
7. Increase dataset size with more handcrafted sentences; does the transformer help more?
8. Add **early stopping** based on validation loss.
9. Save the best model’s `state_dict` and reload it for inference.
10. If on PyTorch 2.x with CUDA, try wrapping the model with `torch.compile` and compare speed.


## 14. Appendix — Quick Cheatsheet
- **Tensors**: `torch.tensor`, `torch.arange`, `.to(device)`, `.view()/.reshape()`, `.permute()`
- **Autograd**: `requires_grad`, `.backward()`, `with torch.no_grad()`
- **Modules**: `nn.Linear`, `nn.Conv2d`, `nn.RNN/LSTM/GRU`, `nn.Embedding`, `nn.Dropout`, `nn.LayerNorm`
- **Losses**: `nn.MSELoss`, `nn.CrossEntropyLoss`, `nn.BCEWithLogitsLoss`
- **Optim**: `optim.SGD/Adam/AdamW`, `optimizer.zero_grad()`, `loss.backward()`, `optimizer.step()`
- **Data**: `Dataset`, `DataLoader`, `collate_fn`, `num_workers`
- **Utilities**: `torch.save`, `torch.load`, `state_dict`
- **Advanced**: `torch.cuda.amp` for mixed precision, `torch.compile` (2.x), LR schedulers
