# Introducing PyTorch

This section runs through the PyTorch API you will commonly use for building neural network.
We will start with how PyTorch works for an image prediction task and then move on to how they are used in LLMs.
The same concepts will be used for building the first LLM in this course.


In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

PyTorch offeres many datasets which can be used for training and testing. The `torchvision.datasets` module contains `Dataset` objects for many
real-world vision data like CIFAR, COCO ([full list
here](https://pytorch.org/vision/stable/datasets.html)). In this
tutorial, we use the FashionMNIST dataset. Every TorchVision `Dataset`
includes two arguments: `transform` and `target_transform` to modify the
samples and labels respectively.

For this example, the images are converted to tensors and the labels are converted to one-hot encoded vectors to be trained on a classification task.


In [None]:
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

We pass the `Dataset` as an argument to `DataLoader`. This wraps an
iterable over our dataset, and supports automatic batching, sampling,
shuffling and multiprocess data loading. Here we define a batch size of
64, i.e. each element in the dataloader iterable will return a batch of
64 features and labels.


In [None]:
batch_size = 64

train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

---


# Creating Models

When we are creating a neural network to train on a task in PyTorch, we create a class that inherits
from
[nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html).
We define the layers of the network in the `__init__` function and
specify how data will pass through the network in the `forward`
function. To accelerate operations in the neural network, we move it to
the
[accelerator](https://pytorch.org/docs/stable/torch.html#accelerators)
such as CUDA, MPS, MTIA, or XPU. If the current accelerator is
available, we will use it. Otherwise, we use the CPU.


In [None]:
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

---


# Optimizing the Model Parameters

To train a model, we need a [loss
function](https://pytorch.org/docs/stable/nn.html#loss-functions) and an
[optimizer](https://pytorch.org/docs/stable/optim.html).


In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

In a single training loop, the model makes predictions on the training
dataset (fed to it in batches), and backpropagates the prediction error
to adjust the model\'s parameters.


In [None]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader): # a batch is a parallel set of images and labels. Instead of computing the loss and gradient for each image individually, we compute the loss and gradient for a batch of images at once. This helps speed up the training process. Doing so became possible when GPUs became more powerful and more memory was available.
        X, y = X.to(device), y.to(device) # common practice to move data to the device (GPU/CPU)

        # Compute prediction error for the batch
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation (we will cover this in more detail later in the course, get the general intuition here)
        loss.backward() # compute the gradient of the loss function with respect to all the model parameters. 
        optimizer.step() # update all the weights of the model, to reduce the loss.
        optimizer.zero_grad() # zero the gradients after each iteration, so they can be computed again for the next iteration. If you don't do this, the gradients would accumulate and the model will not learn properly.

        if batch % 100 == 0: # we only print the loss every 100 batches to avoid cluttering the console with the loss value for every batch.
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

We check the model\'s performance against the test dataset to
ensure it is learning and not overfitting.


In [None]:
def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

The training process is conducted over several iterations (_epochs_).
During each epoch, the model learns parameters to make better
predictions. We print the model\'s accuracy and loss at each epoch;
we\'d like to see the accuracy increase and the loss decrease with every
epoch.


In [None]:
epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)
print("Done!")

---


# Saving Models

A common way to save a model is to serialize the internal state
dictionary (containing the model parameters).


In [None]:
torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")

# Loading Models

The process for loading a model includes re-creating the model structure
and loading the state dictionary into it.


In [None]:
model = NeuralNetwork().to(device)
model.load_state_dict(torch.load("model.pth", weights_only=True))

This model can now be used to make predictions.


In [None]:
classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

model.eval() # set the model to evaluation mode, this is important to do when we are not training the model
x, y = test_data[0][0], test_data[0][1]
with torch.no_grad():
    x = x.to(device)
    pred = model(x)
    predicted, actual = classes[pred[0].argmax(0)], classes[y]
    print(f'Predicted: "{predicted}", Actual: "{actual}"')

# MLPs for Regression Example

In this section, we'll explore how MLPs (Multi-Layer Perceptrons) can be used for regression tasks. Unlike the classification example above where we predicted discrete categories, regression predicts continuous values. This concept is fundamental to understanding how Language Models work and how we can adapt them for different tasks.

MLPs for regression follow the same basic architecture as those for classification, with a few key differences:
1. The output layer typically has a single neuron (for single-value prediction)
2. There's no activation function in the final layer (to allow for unbounded continuous output)
3. We use a different loss function, like Mean Squared Error (MSE) instead of Cross Entropy

This flexibility in the output layer design is especially relevant when we later explore Language Models, where we can replace or adapt the "head" of the model for different downstream tasks while keeping the transformer backbone intact.

## Dataset

In [None]:
import torch

X_train = torch.tensor(
    [258.0, 270.0, 294.0, 320.0, 342.0, 368.0, 396.0, 446.0, 480.0, 586.0]
).view(-1, 1)

y_train = torch.tensor(
    [236.4, 234.4, 252.8, 298.6, 314.2, 342.2, 360.8, 368.0, 391.2, 390.8]
)

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X_train, y_train)
plt.xlabel("Feature variable")
plt.ylabel("Target variable")
plt.show()

## Multilayer Perceptron
- No architecture changes besides the output unit

When adapting neural networks for regression, the key change is in the output layer. Notice how the PyTorchMLP class below uses a single output unit with no activation function, allowing it to predict continuous values instead of class probabilities.

This pattern of adapting the "head" (final layers) of a neural network while keeping the main architecture intact is essential in Language Model fine-tuning, where we often:
1. Keep the pre-trained transformer backbone frozen (containing the learned knowledge)
2. Replace or modify only the output layers for specific tasks (classification, regression, etc.)
3. Train only these new layers, or fine-tune the entire model with a small learning rate

#### Normalize data

Normalization is a critical step for both regression tasks and language modeling. In regression, we normalize inputs and targets to improve convergence and performance. Similarly, in language models:

1. Input text tokens are embedded and normalized within the model
2. Layer normalization is used throughout transformer architectures
3. Output logits may be scaled or normalized depending on the task

The normalization process below converts our data to have zero mean and unit variance, making the training process more stable and efficient. This same concept applies when working with language model outputs, where proper scaling helps maintain numerical stability.

In [None]:
x_mean, x_std = X_train.mean(), X_train.std()
y_mean, y_std = y_train.mean(), y_train.std()

X_train_norm = (X_train - x_mean) / x_std
y_train_norm = (y_train - y_mean) / y_std

#### Set up DataLoader

In [None]:
from torch.utils.data import DataLoader, Dataset


class MyDataset(Dataset):
    def __init__(self, X, y):

        self.features = X
        self.targets = y

    def __getitem__(self, index):
        x = self.features[index]
        y = self.targets[index]
        return x, y

    def __len__(self):
        return self.targets.shape[0]


train_ds = MyDataset(X_train_norm, y_train_norm)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=20,
    shuffle=True,
)

### Train Model

The training loop for regression models follows the same pattern used for classification and language models:

1. Forward pass: Generate predictions from inputs
2. Calculate loss: Measure the error between predictions and targets (using MSE for regression)
3. Backward pass: Compute gradients with respect to model parameters
4. Update weights: Adjust parameters to minimize the loss

When fine-tuning language models, we follow this exact same process, with these key differences:
- We typically use a language modeling loss (like cross-entropy over token predictions)
- We might freeze certain layers to preserve pretrained knowledge
- We often use specialized optimizers like AdamW with weight decay and learning rate schedulers

The principle remains consistent: iteratively update model parameters to minimize prediction error. This fundamental approach works across all neural network architectures, from simple MLPs to complex transformer-based language models.

In [None]:
import torch.nn.functional as F

torch.manual_seed(1)
model = PyTorchMLP(num_features=1)

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

num_epochs = 30

loss_list = []
train_acc_list, val_acc_list = [], []
for epoch in range(num_epochs):

    model = model.train()
    for batch_idx, (features, targets) in enumerate(train_loader):

        logits = model(features)
        loss = F.mse_loss(logits, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if not batch_idx % 250:
            ### LOGGING
            print(
                f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
                f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
                f" | Train Loss: {loss:.2f}"
            )
        loss_list.append(loss.item())

### Normalize and Generate Predictions

A key aspect of working with both regression models and language models is handling data at inference time. For regression, we need to:

1. Normalize new inputs using the same statistics from training
2. Run the inference in evaluation mode (model.eval())
3. Convert the normalized predictions back to the original scale

For language models, a similar process applies when generating text or making predictions:
1. We transform input text to token IDs
2. We run the model in evaluation mode, often with torch.no_grad() for efficiency
3. We may use sampling strategies like temperature scaling or top-k filtering on the output logits

This step of properly preparing inputs and processing outputs is crucial for applying models to real-world data. In later weeks, we'll see how this inference process works with language models when generating text or extracting embeddings for downstream tasks.

In [None]:
model.eval()

X_range = torch.arange(150, 800, 0.1).view(-1, 1)
X_range_norm = (X_range - x_mean) / x_std

# predict
with torch.no_grad():
    y_mlp_norm = model(X_range_norm)

# MLP returns normalized predictions
# undo normalization of preditions for plotting
y_mlp = y_mlp_norm * y_std + y_mean

In [None]:
# plot results
plt.scatter(X_train, y_train, label="Training points")
plt.plot(X_range, y_mlp, color="C1", label="MLP fit", linestyle="-")


plt.xlabel("Feature variable")
plt.ylabel("Target variable")
plt.legend()
# plt.savefig("mlp.pdf")
plt.show()

## Connection to Language Model Heads

The regression example above demonstrates several concepts that directly transfer to working with language models:

1. **Adaptable output layers**: Just as we modified the MLP's output layer for regression, language models can be adapted with different "heads" for various tasks:
   - Text generation: Uses the standard language modeling head
   - Classification: Replaces the head with a classifier (like sentiment analysis)
   - Regression: Uses a head similar to our example (for tasks like text quality scoring)
   - Embedding extraction: Often removes the head entirely to get vector representations

2. **Transfer learning pattern**: The workflow follows a consistent pattern:
   - Start with a pre-trained model (like our MLP architecture or a transformer)
   - Adapt the output layer for the specific task
   - Train on task-specific data, possibly freezing earlier layers
   - Apply the model to new data with proper pre/post-processing

3. **PyTorch patterns**: The same PyTorch patterns apply across all neural network types:
   - Building models by subclassing nn.Module
   - Using DataLoader for efficient batch processing
   - Following the forward-loss-backward-update training loop
   - Setting model.eval() for inference
   - Saving and loading model weights

In the coming weeks, we'll build on these foundations to work with transformer-based language models, where the principles remain the same but the architectures become more sophisticated.