# Learning PyTorch
Here I will document my learning process of deep learning framework pytorch.
- **Tensors** – the core data structure in PyTorch
- **Autograd** – automatic differentiation for training
- **Building Models with nn.Module** – how to define neural network architectures
- **Common Layers & Activation Functions** – essential building blocks for networks
- **Loss Functions** – how to quantify model performance
- **Optimizers** – algorithms to update model parameters
- **Training Loop Structure** – putting it all together to train a model
- **Datasets & DataLoaders** – loading and batching data for training
- **Using GPUs** – leveraging CUDA for faster computation
- **Debugging & Common Pitfalls** – tips to avoid or fix common errors
## Tensors
**Tensors** are the fundamental data structure for storing and manipulating
data. Tensor Attributes: Every tensor has a **shape** (telling you its dimensions), a **dtype** (data type, e.g. float32, int64), and a **device** (CPU or GPU) where it’s stored

In [1]:
import torch
import numpy as np

# 1. Directly from data (e.g. list or nested lists)
data = [[1, 2], [3, 4]]
x_data = torch.tensor(data) # infers dtype automatically

# 2. From a NumPy array
np_array = np.array(data)
x_np = torch.from_numpy(np_array) # shares memory with NumPy when possible

# 3. Using built-in initializers
x_ones = torch.ones_like(x_data) # tensor of ones with same shape as x_data
x_rand = torch.rand_like(x_data, dtype=torch.float32) # random values, specifying dtype

# 4. With specific shapes and values
shape = (2, 3)
rand_tensor = torch.rand(shape) # random values in [0,1)
ones_tensor = torch.ones(shape) # all ones
zeros_tensor = torch.zeros(shape) # all zeros

tensor = torch.rand(3, 4)
print("Shape:", tensor.shape)
print("Datatype:", tensor.dtype)
print("Device:", tensor.device)

Shape: torch.Size([3, 4])
Datatype: torch.float32
Device: cpu


In [3]:
# Basic Operations

# Indexing and slicing
tensor = torch.ones(4, 4)
print(tensor[0]) # First row
print(tensor[:, 0]) # First column

# Elementwise operations
tensor = torch.tensor([[1.0, 2.0],[3.0, 4.0]])
tensor = tensor * 2 + 1
print(tensor)

# Matrix multiplication
A = torch.rand(2, 3)
B = torch.rand(3, 4)
C = A @ B # matrix product resulting in shape (2,4)
print(C)

tensor([1., 1., 1., 1.])
tensor([1., 1., 1., 1.])
tensor([[3., 5.],
        [7., 9.]])
tensor([[1.7984, 1.0619, 1.2744, 0.6247],
        [0.5072, 0.3412, 0.4018, 0.2094]])


## Automatic Differentiation (Autograd)
Autograd frees you from manually computing gradients. It records operations on tensors to build a computational graph, and then it can backpropagate gradients through this graph for you. If you have a tensor that requires gradients `requires_grad=True`, PyTorch will track all operations on it. When you call `.backward()` , it computes the gradient of a scalar output with respect to all tensors that have `requires_grad=True` and contributed to that output. Those gradients are then stored in the `.grad` attribute of each tensor. By default, Tensors do not compute gradients. You need to explicitly indicate which tensors require grad. Model parameters (weights and biases in neural networks) are set to require gradients by default when using `nn.Module`.


In [6]:
# Create a tensor and enable gradient tracking
x = torch.tensor([3.0], requires_grad=True) # a tensor with value 3.0
print(x.requires_grad) # True
# Define a simple function of x
y = x**2 + 2*x + 1 # y = x^2 + 2x + 1
# Compute gradient dy/dx by backpropagation
y.backward() # y is a scalar (1-element tensor), so we can call backward directly
print(x.grad) # prints the gradient (dy/dx) at x=3.0 (2*x + 2)

True
tensor([8.])


In [9]:
w = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([1.0], requires_grad=True)
# data point
X = torch.tensor([3.0])
Y = torch.tensor([12.0])
# Forward pass: compute prediction and loss
pred = w * X + b # model prediction
loss = (pred - Y)**2 # mean squared error (simplified for 1 point)
# Backward pass: compute gradients
loss.backward() # compute dloss/dw and dloss/db
print(w.grad) # gradient of loss w.rt w
print(b.grad) # gradient of loss w.rt b

tensor([-30.])
tensor([-10.])


## Building Neural Networks with `nn.Module`
An `nn.Module` is a base class for all neural network layers and models. Your custom models will also
inherit from `nn.Module`. Modules provide a convenient way to encapsulate learnable parameters, layers, and the forward computation. When you subclass `nn.Module` , you can define layers as class attributes and implement a forward method defining how data flows through those layers. PyTorch will automatically collect your parameters and provide utility methods behind the scenes

In [10]:
import torch.nn as nn
class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNet, self).__init__() # initialize base class
        # Define layers
        self.hidden = nn.Linear(input_size, hidden_size) # linear layer 1
        self.relu = nn.ReLU() # activation
        self.output = nn.Linear(hidden_size, output_size) # linear layer 2 (output)
    def forward(self, x):
        x = self.hidden(x) # apply first linear layer
        x = self.relu(x)   # apply ReLU activation
        x = self.output(x) # apply second linear layer
        return x


In `__init__` , after calling `super().__init__()` , we define our layers: two `nn.Linear`
layers and a `nn.ReLU` activation. Each of these is an nn.Module themselves. By assigning
them to `self.hidden` , `self.relu`, etc., PyTorch registers them as sub-modules of our
model. This means all their parameters (weights, biases) are now part of SimpleNet’s
parameters.

In the forward method, we define how the input tensor flows through the network: first
through the hidden linear layer, then a ReLU, then the output linear layer. The output of the
forward is the model’s output. Important: You don’t call the forward method manually;
instead, you call the model instance on an input, like `model(x)` , which will internally invoke
forward.

**nn.Module and nested modules**: You can compose modules within modules. For example,
`nn.Sequential` is a handy container module that you can use to stack layers in order without
explicitly writing a forward method. In fact, our SimpleNet above could be alternatively written using `nn.Sequential`

In [17]:
class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size)
        )
    def forward(self, x):
        return self.net(x)

model = SimpleNet(10, 5, 2)
print(model)


SimpleNet(
  (net): Sequential(
    (0): Linear(in_features=10, out_features=5, bias=True)
    (1): ReLU()
    (2): Linear(in_features=5, out_features=2, bias=True)
  )
)


**Moving the model to a device**: You can call `model.to(device)` to move all its parameters to, say, the GPU (`if device=torch.device('cuda')`).

## Common Layers and Activation Functions
- **Linear (Fully Connected) Layer**: `nn.Linear(in_features, out_features)` – Applies a
linear transformation. It has learnable weights of shape (out_features, in_features) and biases of shape (out_features). For example, `nn.Linear(28*28, 512)` would create a layer suitable for an input vector of size 784 (like a flattened 28×28 image) and produces an output of size 512
- **Convolutional Layer**: `nn.Conv2d(in_channels, out_channels, kernel_size, ...)`. It will convolve learned kernels over input feature maps. There are analogous `nn.Conv1d` and `nn.Conv3d` for 1D (sequence) or 3D (volumetric) data. Convolutional layers are key for image recognition tasks
- **Recurrent Layers**: `nn.LSTM`, `nn.GRU`, `nn.RNN` – Layers for sequence data (text, time series). These maintain internal state and are a bit more complex to use, but PyTorch’s implementations handle a lot of details for you. If you have a sequence of word embeddings, for example, an LSTM can process them and output a hidden state sequence.
- **Embedding Layer**: `nn.Embedding(num_embeddings, embedding_dim)` – A trainable lookup table for discrete inputs (e.g. word indices to dense vectors). Used in NLP tasks to map token IDs to vectors.
- **Dropout**: `nn.Dropout(p)` – Randomly zeroes out some fraction p of elements in the input (each forward pass) to help prevent overfitting. Dropout is only active in training mode (`model.train()`) and automatically de-activates in evaluation mode (`model.eval()`).
- **BatchNorm**: `nn.BatchNorm1d`, `nn.BatchNorm2d` – Normalize the activations of the previous layer to have stable mean and variance, which can help training. Like Dropout, BatchNorm behaves differently in training vs. evaluation (during training it uses batch statistics, during eval it uses learned running stats).



**Activation Functions**: Non-linear activations introduce the non-linearity needed for neural networks to learn complex patterns. PyTorch provides many common activations in `torch.nn` or in
`torch.nn.functional`. Some widely used ones: - **ReLU**: `nn.ReLU()`. Very common default activation for hidden layers. **Sigmoid**: `nn.Sigmoid()`. Useful for probabilities in binary
classification. - **Tanh**: `nn.Tanh()`. Used historically in some networks. **Softmax**: `nn.Softmax(dim)`. Often used for _multi-class_ output for interpretation, but note: for training classification models, you typically do not put a Softmax at the end if you’re using `nn.CrossEntropyLoss` (as that loss function expects raw logits and internally applies a LogSoftmax)

## Loss Functions
The goal of training is to minimize this loss. PyTorch provides many standard loss functions in torch.nn , typically as `nn.XLoss` classes (which can be used like functions).
- **Mean Squared Error (MSE) Loss**: `nn.MSELoss` – Used for regression tasks (continuous output). It computes the average of squared differences between predicted and actual values.
- **Binary Cross-Entropy Loss**: `nn.BCELoss` – Used for binary classification (often with a Sigmoid output). It calculates the binary cross-entropy between target and output probabilities. (There’s
also `nn.BCEWithLogitsLoss` which is numerically more stable if you combine a sigmoid and binary entropy in one step.)
- **Negative Log-Likelihood Loss**: `nn.NLLLoss` – Often used for classification in combination with a LogSoftmax output. It’s basically the negative log of the probability of the correct class.
- **Cross-Entropy Loss**: `nn.CrossEntropyLoss` – The most commonly used loss for multi-class classification. It combines a softmax and NLLLoss in one single class. Important: If you use
- **CrossEntropyLoss** , your model’s output should be raw, unnormalized scores (a.k.a. logits). `nn.CrossEntropyLoss` will internally apply F.log_softmax and compute NLLLoss . In
other words, do not apply Softmax on your output before feeding it to CrossEntropyLoss , or you’ll be effectively applying softmax twice.

## Optimizers and Updating Parameters
The next step is to use those gradients to update the model’s parameters in order to reduce the loss. This is done by an optimizer. PyTorch provides many optimization algorithms in the torch.optim package. The most basic one is _Stochastic Gradient Descent_ (SGD), but there are others like _Adam_, _RMSprop_, _Adagrad_, etc.

**Setting up an optimizer**: To create an optimizer, you need to specify which parameters it should
update and what learning rate (step size) to use. Typically, you pass `model.parameters()` to the optimizer, which grabs all the learnable parameters of the model.


In [20]:
import torch.optim as optim
model = SimpleNet(10, 5, 2)
optimizer = optim.SGD(model.parameters(), lr=0.01)

1. **Forward pass**: compute model predictions and loss.
2. **Backward pass**: `loss.backward()` to compute gradients.
3. **Update parameters**: `optimizer.step()` to adjust weights by gradients.
4. **Zero gradients**: `optimizer.zero_grad()` to clear accumulated gradients for the next iteration.

It’s **crucial** to zero the gradients after each update; otherwise, gradients from subsequent backward calls will accumulate (sum) on top of the current ones. Why do gradients accumulate? Because PyTorch, by default, adds the new gradient to any existing .grad value. This is useful for some algorithms (like accumulating gradients over multiple minibatches, or RNN TBPTT), but in standard training you want to update using the gradient of the current batch only. So you must zero out before computing the next batch’s gradient.

**Learning Rate Schedulers**: PyTorch also provides schedulers (`torch.optim.lr_scheduler`) to
adjust the learning rate during training (e.g., reduce it if validation loss plateaus). This can improve training.

## Training Loop Structure
1. Preparation: Decide number of epochs to train, ensure your model, optimizer, and data loader are ready. Optionally prepare a validation/test data loader for evaluation.
2. Set the model to training mode: model.train() . This enables dropout and batchnorm to behave in training mode (if you use them).
3. Loop over the training dataset:
- For each batch, load the data (inputs and targets). If using a GPU, move them to the GPU device: `inputs, targets = inputs.to(device)`, `targets.to(device)` so that they match the model’s device.
- Forward pass: outputs = model(inputs).
- Compute loss: `loss = loss_fn(outputs, targets)`.
- Backward pass: `optimizer.zero_grad()` (clear old gradients), then loss.backward() (compute new gradients).
- Update params: `optimizer.step()` (adjust model weights).
4. Evaluate on a validation set or test set to monitor performance: Set model to evaluation mode: `model.eval()`. This tells layers like dropout and batchnorm to fix their behavior (e.g., not dropping units, using running stats). Disable grad: `with torch.no_grad()`: – inside this block, loop over validation. DataLoader, compute outputs and loss/accuracy. Accumulate metrics. Print or record the validation loss/accuracy. You might also save the model checkpoint if it’s the best so far.
5. Perhaps adjust learning rate if using a scheduler (optional).
6. End of training: Save final model

In [22]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

BATCH_SIZE = 128
EPOCHS = 5
LR = 1e-3
SEED = 42
torch.manual_seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

transform = transforms.Compose([
    transforms.ToTensor(),                  # [0,255] -> [0,1], shape: (1,28,28)
    transforms.Normalize((0.1307,), (0.3081,))  # standard MNIST normalization
])

full_train = datasets.MNIST(root="./data", train=True, download=True, transform=transform)

test_ds = datasets.MNIST(root="./data", train=False, download=True, transform=transform)

# Split train into train/val
train_size = int(0.9 * len(full_train))
val_size = len(full_train) - train_size
train_ds, val_ds = random_split(full_train, [train_size, val_size])

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)

class SmallCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),  # 28x28 -> 14x14

            nn.Conv2d(16, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),  # 14x14 -> 7x7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),                 # (N,32,7,7) -> (N, 1568)
            nn.Linear(32 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 10)            # logits for 10 classes
        )
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = SmallCNN().to(device)

loss_fn = nn.CrossEntropyLoss()  # expects logits + class indices
optimizer = optim.Adam(model.parameters(), lr=LR)

def run_eval(loader, model):
    model.eval()
    total_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for x, y in loader:
            x = x.to(device, non_blocking=True)
            y = y.to(device, non_blocking=True)

            logits = model(x)
            loss = loss_fn(logits, y)

            total_loss += loss.item() * x.size(0)
            preds = logits.argmax(dim=1)
            correct += (preds == y).sum().item()
            total += y.size(0)

    avg_loss = total_loss / total
    acc = correct / total
    return avg_loss, acc

def train_one_epoch(loader, model):
    model.train()
    total_loss = 0.0
    total = 0

    for x, y in loader:
        x = x.to(device, non_blocking=True)
        y = y.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)
        logits = model(x)
        loss = loss_fn(logits, y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * x.size(0)
        total += y.size(0)

    return total_loss / total

best_val_acc = 0.0

for epoch in range(1, EPOCHS + 1):
    train_loss = train_one_epoch(train_loader, model)
    val_loss, val_acc = run_eval(val_loader, model)

    print(f"Epoch {epoch:02d}/{EPOCHS} | train loss: {train_loss:.4f} | val loss: {val_loss:.4f} | val acc: {val_acc*100:.2f}%")

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "mnist_smallcnn_best.pth")

print(f"\nBest val acc: {best_val_acc*100:.2f}%")
print("Saved best model to: mnist_smallcnn_best.pth")

model.load_state_dict(torch.load("mnist_smallcnn_best.pth", map_location=device))
test_loss, test_acc = run_eval(test_loader, model)
print(f"Test loss: {test_loss:.4f} | Test acc: {test_acc*100:.2f}%")

Device: cpu


100.0%
100.0%
100.0%
100.0%


Epoch 01/5 | train loss: 0.2420 | val loss: 0.0888 | val acc: 97.13%
Epoch 02/5 | train loss: 0.0697 | val loss: 0.0549 | val acc: 98.45%
Epoch 03/5 | train loss: 0.0493 | val loss: 0.0496 | val acc: 98.47%
Epoch 04/5 | train loss: 0.0387 | val loss: 0.0411 | val acc: 98.82%
Epoch 05/5 | train loss: 0.0323 | val loss: 0.0416 | val acc: 98.72%

Best val acc: 98.82%
Saved best model to: mnist_smallcnn_best.pth
Test loss: 0.0318 | Test acc: 98.91%


## Using GPUs for Acceleration
**Moving tensors/models to GPU**: To use a GPU for computation, you must explicitly move your data to it. This includes model parameters and any tensors used in computations. Important: All inputs to a model must be on the same device as the model’s parameters. If your model is on GPU, and you pass a CPU tensor, you’ll get a runtime error: “_Expected all tensors to be on the same device_” . So always move your inputs (and targets) to the GPU as well.


**Multiple GPUs**: PyTorch supports data parallelism (via `nn.DataParallel` or better, `DistributedDataParallel`). If you have one GPU, cuda:0 is the typical device. If multiple, you might specify device IDs or use DistributedDataParallel for best performance.


By writing device = torch.device("cuda" `if torch.cuda.is_available() else "cpu"`) and using `.to(device)` , your code will automatically work on CPU if no GPU is present. This is good practice to make your code portable.


## Debugging and Common Pitfalls
- Forgetting to switch between training/evaluation mode
- Not zeroing gradients
- Mixing up devices
- Calling `model.forward()` instead of `model()`
- Incorrect loss usage
- Not using `torch.no_grad()` correctly
- No improvement in training (model not learning)
- Learning rate issues
- Gradient issues
- Overfitting a single batch