# **Deep Learning**

The term **deep learning** refers to a subset of machine learning techniques that utilize artificial **neural networks** with multiple layers — hence "deep" — to model and understand complex patterns in data.

<div align="center">
    <img src="https://github.com/segusantos/deep-learning-notebook/blob/master/assets/ai-ml-dl.svg?raw=1" height="300" />
    <p><em>Deep learning (DL) is a subset of machine learning (ML), which is a subset of artificial intelligence (AI).</em></p>
</div>

Just as classic machine learning, deep learning algorithms can be classified into three main categories:
- **Supervised learning**: the model is trained on a labeled dataset, meaning that each input data point is paired with the correct output. The model learns to map inputs to outputs by minimizing the error between its predictions and the actual labels. Common applications include image classification, speech recognition, and natural language processing.
- **Unsupervised learning**: it involves training models on unlabeled data. The goal is to discover hidden patterns or structures within the data without predefined labels. Techniques such as autoencoders and generative adversarial networks (GANs) are often used for tasks like clustering, dimensionality reduction, and data generation.
- **Reinforcement learning**: models learn to make decisions by interacting with an environment. The model, often referred to as an agent, receives feedback in the form of rewards or penalties based on its actions. Deep reinforcement learning has been successfully applied in areas such as game playing (e.g., AlphaGo), robotics, and autonomous systems.

In particular, we'll motivate deep learning from an **image classification** perspective, which is a classic supervised learning task where the goal is to assign a label to an input image based on its content.

## **Handwriting Recognition**

Suppose we've been tasked with building a function that takes a $28 \times 28 $ grayscale image of a handwritten digit ($0-9$) as input and outputs the corresponding digit label.

<div align="center">
    <img src="https://github.com/segusantos/deep-learning-notebook/blob/master/assets/handwritten-digits.png?raw=1" height="300" />
    <p><em>Handwritten digits</em></p>
</div>

Let $x \in \mathbb{R}^{28 \times 28}$ be the input image, and $y \in \mathbb{Z}_{10}$ be the output label, where $\mathbb{Z}_{10} = \{0, 1, \ldots, 9\}$. Then, we want to write a function $f$ such that:

$$
f(x) = y.
$$

Similarly, from a programming perspective, we want to implement a function that takes an array as input and returns an integer as output:

```python
def f(x: np.ndarray) -> int:
    digit = ...
    return digit
```

As human programmers, we can easily recognize handwritten digits by looking at the images. However, developing an algorithm that can accurately perform this task is no trivial endeavor.

Without loss of generality, we consider the universe of all functions $f$ that can map grayscale images to digits, this is, matrices to scalars:

$$
f: \mathbb{R}^{28 \times 28} \to \mathbb{Z}_{10},
$$

so that the function $f^\ast$ that we are looking for — this is, the one that correctly maps images to the digits they display — is guaranteed to belong to this set.

In order to make searching for $f^\ast$ tractable, we reframe our problem as a **deep supervised learning** task.

## **Neural Networks**

Since the space of all functions that map images to digits is uncountably infinite, the **deep supervised learning** paradigm reduces the search space of all possible functions $f: \mathbb{R}^{28 \times 28} \to \mathbb{Z}_{10}$ to a smaller subset of this space specified by a given **neural network** architecture. Then, $f$ defines the family of functions parameterized by a set of weights $\theta$, according to the chosen architecture:

$$
f(x; \theta) = y.
$$

<div align="center">
    <img src="https://github.com/segusantos/deep-learning-notebook/blob/master/assets/nn.png?raw=1" height="300" />
    <p><em>Feedforward Neural Network (FNN)</em></p>
</div>

Now that our neural net depends both on the input $x$ and the parameters $\theta$, we need to find a way of finding the parameters $\theta^\ast$ such that $f(x; \theta^\ast)$ approximates $f^\ast(x)$ as closely as possible for all possible inputs $x$, this is, all possible $28 \times 28$ images of handwritten digits.

### **Dataset**

First, since optimizing $\theta$ over all possible inputs is infeasible, we  rely on a finite **dataset** of $N$ labeled examples:

$$
\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N,
$$

where each $x_i$ is a $28 \times 28$ grayscale image of a handwritten digit, and $y_i$ is the corresponding label.

For our **handwriting recognition** task, we use the public [**MNIST dataset**](http://yann.lecun.com/exdb/mnist/).

<div align="center">
    <img src="https://github.com/segusantos/deep-learning-notebook/blob/master/assets/mnist-dataset.png?raw=1" height="300" />
    <p><em>MNIST dataset samples</em></p>
</div>

### **Loss Function**

Second, we need to define a measure of how well our function $f(x; \theta)$ approximates the true function $f^\ast(x)$ within our dataset $\mathcal{D}$. For that,
we use a **loss function** that quantifies the difference between the predicted output $f(x; \theta)$ and the true label $y$:

$$
\mathcal{L}(f(x; \theta), y).
$$

Since we model our handwriting recognition task as a **regression** problem — we want to predict a continuous output (the digit label) which we round to the nearest integer — we use the **mean squared error (MSE)** loss function:

$$
\mathcal{L}(f(x; \theta), y) = [f(x; \theta) - y]^2.
$$

<div align="center">
    <img src="https://github.com/segusantos/deep-learning-notebook/blob/master/assets/loss.png?raw=1" height="300" />
    <p><em>Loss function illustration</em></p>
</div>

## **Optimization**

Having defined our dataset $\mathcal{D}$ and loss function $\mathcal{L}$, we can now formulate the optimization problem that will allow us to find the optimal parameters $\theta^\ast$ for our neural network:

\begin{align*}
\theta^\ast &= \arg \min_\theta \frac{1}{N} \sum_{i=1}^N \mathcal{L}(f(x_i; \theta), y_i),\\
&= \arg \min_\theta g(\theta).
\end{align*}

Therefore, given a fixed **network architecture** and **dataset**, our goal is to optimize the parameters $\theta$ so that the **loss function** is minimized.

### **Differentiation**

Then, finding the **neural network** $f(x; \theta^\ast)$ that best approximates the true function $f^\ast(x)$ is equivalent to minimizing $g(\theta)$ which, in mathematical terms, can be achieved by taking its **gradient** with respect to the parameters $\theta$ and setting it to zero:

$$
\nabla_\theta g(\theta) = 0.
$$

However, since neural networks are **non-linear** functions with potentially millions of parameters, solving this system of equations analytically is infeasible. We must therefore resort to **numerical optimization** techniques to find an approximate solution.

#### **Gradient Descent**

One of the most common numerical optimization techniques used in deep learning is **gradient descent**. This iterative algorithm updates the parameters $\theta$ in the direction of the negative gradient of the loss function, scaled by a learning rate $\alpha$:

$$
\theta_{t+1} = \theta_t - \alpha \nabla_\theta g(\theta_t).
$$

Hence, we now need an algorithm to efficiently compute the gradient of the loss function with respect to the parameters of the neural network at each iteration $t$.

<div align="center">
    <img src="https://github.com/segusantos/deep-learning-notebook/blob/master/assets/gradient-descent.png?raw=1" height="300" />
    <p><em>Gradient descent</em></p>
</div>

#### **Backpropagation**

First, we need to understand why **symbolic** or **numerical differentation** methods are not suitable for computing the gradient of the loss function in deep learning:
- **Symbolic differentiation**: while it can provide exact gradients, it is impractical for deep neural networks because expressions become extremely large and complex, making symbolic manipulation inefficient and memory-intensive.
- **Numerical differentiation**: although it is straightforward to implement, it suffers from numerical instability, it is computationally expensive and can introduce significant approximation errors, especially in high-dimensional parameter spaces.

Instead, if we leverage the fact that neural networks are composed of a sequence of **layers**, which are differentiable operations, we can use the **backpropagation** algorithm to efficiently compute the gradient of the loss function with respect to the network parameters.

**Automatic differentiation** constructs a **computational graph** that allows computing exact gradients efficiently by systematically applying the **chain rule** of calculus to the operations performed in the neural network.

<div align="center">
    <img src="https://github.com/segusantos/deep-learning-notebook/blob/master/assets/computational-graph.png?raw=1" height="300" />
    <p><em>Computational graph</em></p>
</div>

In particular, given that **neural nets** are compositions of elementary functions such as sums, products and non-linear activations, we can define the **forward** and **backward** passes of these operations and let the **backpropagation** algorithm compute the gradients automatically by composing the local gradients using the **chain rule** for any arbitrary neural network architecture.

In order to illustrate this point, in what follows we manually implement a simple **autodiff engine** with no dependencies and train a simple neural network for the **handwriting recognition** task. In this [notebook's GitHub repo](https://github.com/segusantos/deep-learning-notebook) you can find a more complete implementation with additional primitives and tests that validate its correctness against [PyTorch](https://pytorch.org/).

## **Autodiff Engine**

For simplicity, instead of defining a `Tensor` class as in PyTorch, we define a `Scalar` class that represents a scalar value and its gradient. This class will support basic arithmetic operations and will keep track of the computational graph to enable backpropagation. The `_backward` method refers to the local gradient computation for each operation, whereas `_prev` stores references to the parent nodes in the computational graph.

In [None]:
class Scalar:
    def __init__(self,
                 data: int | float,
                 _backward: function = lambda: None,
                 _prev: list[Scalar] = []) -> None:
        self.data: float = float(data)
        self.grad: float = 0.0
        self._backward: function = _backward
        self._prev: list[Scalar] = _prev

    def backward(self) -> None:
        topological_order = []
        visited = set()
        def dfs(scalar: Scalar) -> None:
            if scalar not in visited:
                visited.add(scalar)
                for prev in scalar._prev:
                    dfs(prev)
                topological_order.append(scalar)
        dfs(self)
        self.grad = 1.0
        for scalar in reversed(topological_order):
            scalar._backward()

    def __repr__(self) -> str:
        return f"Scalar(data={self.data}, grad={self.grad})"

Now, we define a base `Function` abstract class that will serve as a blueprint for all operations in our autodiff engine. Each operation will inherit from this class and implement the `forward` and `backward` methods.

In [None]:
from abc import ABC, abstractmethod


class Function(ABC):
    @staticmethod
    @abstractmethod
    def forward(*inputs: Scalar) -> float:
        pass

    @staticmethod
    @abstractmethod
    def backward(*inputs: Scalar, output: Scalar) -> None:
        pass

    @classmethod
    def apply(cls, *args: int | float | Scalar) -> Scalar:
        inputs = [
            arg if isinstance(arg, Scalar) else Scalar(arg)
            for arg in args
        ]
        output = Scalar(
            cls.forward(*inputs),
            _backward=lambda: cls.backward(*inputs, output),
            _prev=inputs
        )
        return output

### **Addition**

First, we implement the addition operation as a subclass of `Function`, where the sum of two `Scalar` objects is computed in the `forward` method, and the gradients are propagated in the `backward` method.

$$
\begin{align*}
\text{forward}(a, b) &= a + b, \\
\text{backward}() &: \frac{\partial \text{output}}{\partial a} = 1, \quad \frac{\partial \text{output}}{\partial b} = 1.
\end{align*}
$$

In [None]:
class Add(Function):
    def forward(a: Scalar, b: Scalar) -> float:
        return a.data + b.data

    def backward(a: Scalar, b: Scalar, output: Scalar) -> None:
        a.grad += 1.0 * output.grad
        b.grad += 1.0 * output.grad


Scalar.__add__ = lambda self, other: Add.apply(self, other)
Scalar.__radd__ = lambda self, other: Add.apply(other, self)

### **Multiplication**

Similarly, we implement the multiplication operation and define its `forward` and `backward` methods.

$$
\begin{align*}
\text{forward}(a, b) &= a \times b, \\
\text{backward}() &: \frac{\partial \text{output}}{\partial a} = b, \quad \frac{\partial \text{output}}{\partial b} = a.
\end{align*}
$$

In [None]:
class Mul(Function):
    def forward(a: Scalar, b: Scalar) -> float:
        return a.data * b.data

    def backward(a: Scalar, b: Scalar, output: Scalar) -> None:
        a.grad += b.data * output.grad
        b.grad += a.data * output.grad


Scalar.__mul__ = lambda self, other: Mul.apply(self, other)
Scalar.__rmul__ = lambda self, other: Mul.apply(other, self)

### **ReLU**

Finally, in order to implement a neural network, we need a non-linear activation function. We choose the **ReLU (Rectified Linear Unit)** activation function, which outputs the input directly if it is positive; otherwise, it outputs zero.

$$
\begin{align*}
\text{forward}(a) &= \max(0, a), \\
\text{backward}() &: \frac{\partial \text{output}}{\partial a} =
\begin{cases}
1 & \text{if } a > 0, \\
0 & \text{otherwise}.
\end{cases}
\end{align*}
$$

In [None]:
class ReLU(Function):
    def forward(a: Scalar) -> float:
        return (a.data > 0) * a.data

    def backward(a: Scalar, output: Scalar) -> None:
        a.grad += (a.data > 0) * output.grad


Scalar.relu = lambda self: ReLU.apply(self)

### **Neural Network**

Now that we have implemented our basic autodiff engine with addition, multiplication, and ReLU operations, we can proceed to build a simple **neural network** for the handwriting recognition task using these components.

In [None]:
class Module:
    def parameters(self) -> list[Scalar]:
        return []

    def zero_grad(self) -> None:
        for param in self.parameters():
            param.grad = 0.0

In [None]:
import random


class Neuron(Module):
    def __init__(self, nin: int, nonlin: bool = True) -> None:
        self.w = [Scalar(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Scalar(0.0)
        self.nonlin = nonlin

    def __call__(self, x: list[Scalar]) -> Scalar:
        act = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)
        return act.relu() if self.nonlin else act

    def parameters(self) -> list[Scalar]:
        return self.w + [self.b]

In [None]:
class Layer(Module):
    def __init__(self, nin: int, nout: int, **kwargs) -> None:
        self.neurons = [Neuron(nin, **kwargs) for _ in range(nout)]

    def __call__(self, x: list[Scalar]) -> list[Scalar]:
        return [neuron(x) for neuron in self.neurons]

    def parameters(self) -> list[Scalar]:
        params = []
        for neuron in self.neurons:
            params.extend(neuron.parameters())
        return params

In [None]:
class MLP(Module):
    def __init__(self, nin: int, nouts: list[int]) -> None:
        sz = [nin] + nouts
        self.layers = [Layer(sz[i], sz[i+1], nonlin=i != len(nouts) - 1) for i in range(len(nouts))]

    def __call__(self, x: list[Scalar]) -> list[Scalar]:
        for layer in self.layers:
            x = layer(x)
        return x

    def parameters(self) -> list[Scalar]:
        params = []
        for layer in self.layers:
            params.extend(layer.parameters())
        return params

## **Training**

Now that we have defined our `MLP` class, we can train a simple neural network on the MNIST dataset using our autodiff engine.

In [None]:
from pathlib import Path
from tqdm import tqdm

import torch
from torch import nn
from torch.utils.data import DataLoader, Subset
from torchvision import datasets
from torchvision.transforms import ToTensor

from lightning.pytorch.callbacks import EarlyStopping
from lightning.pytorch import Trainer
from lightning.pytorch import LightningModule
from lightning.pytorch import Callback


seed = 42
random.seed(42)
torch.manual_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)



cuda


In [None]:
training_data = datasets.MNIST(
    root="data",
    train=True,
    download=True,
)
test_data = datasets.MNIST(
    root="data",
    train=False,
    download=True,
)

### **Manual Training Loop**

We train the handmade autodiff MLP on a small MNIST subset to keep the example lightweight while still demonstrating gradient-based updates.

In [None]:
def _vectorize_image(image):
    tensor = ToTensor()(image).view(-1)
    return [Scalar(float(value)) for value in tensor]

def _mse_loss(outputs: list[Scalar], label: int) -> Scalar:
    target = [Scalar(1.0 if idx == label else 0.0) for idx in range(len(outputs))]
    diffs = [output + (Scalar(-1.0) * t) for output, t in zip(outputs, target)]
    squared_errors = [diff * diff for diff in diffs]
    return sum(squared_errors, Scalar(0.0))

def _manual_step(model: MLP, image, label: int, learning_rate: float) -> tuple[float, int]:
    model.zero_grad()
    inputs = _vectorize_image(image)
    outputs = model(inputs)
    scores = [output.data for output in outputs]
    loss = _mse_loss(outputs, label)
    loss.backward()
    for param in model.parameters():
        param.data -= learning_rate * param.grad
    prediction = max(range(len(scores)), key=scores.__getitem__)
    return loss.data, prediction

def _manual_predict(model: MLP, image) -> int:
    outputs = model(_vectorize_image(image))
    scores = [output.data for output in outputs]
    return max(range(len(scores)), key=scores.__getitem__)

In [30]:
manual_model = MLP(28 * 28, [64, 32, 10])
learning_rate = 0.05
epochs = 5
subset_size = 512
train_subset = [training_data[i] for i in range(subset_size)]

for epoch in range(epochs):
    epoch_loss = 0.0
    correct = 0
    for image, label in train_subset:
        loss_value, prediction = _manual_step(manual_model, image, label, learning_rate)
        epoch_loss += loss_value
        correct += int(prediction == label)
    avg_loss = epoch_loss / len(train_subset)
    accuracy = correct / len(train_subset)
    print(f"epoch {epoch + 1}: loss={avg_loss:.4f} accuracy={accuracy:.3f}")

test_subset = [test_data[i] for i in range(256)]
test_correct = 0
for image, label in test_subset:
    prediction = _manual_predict(manual_model, image)
    test_correct += int(prediction == label)
test_accuracy = test_correct / len(test_subset)
print(f"manual model test accuracy on {len(test_subset)} samples: {test_accuracy:.3f}")

KeyboardInterrupt: 

### **PyTorch Lightning Training Loop**

For comparison, we train a small neural network with PyTorch Lightning using dataloaders on manageable MNIST subsets.

In [None]:
transform = ToTensor()
train_dataset_pt = datasets.MNIST(root="data", train=True, download=False, transform=transform)
val_dataset_pt = datasets.MNIST(root="data", train=False, download=False, transform=transform)
train_subset_indices = list(range(4096))
val_subset_indices = list(range(1024))
train_loader = DataLoader(Subset(train_dataset_pt, train_subset_indices), batch_size=64, shuffle=True)
val_loader = DataLoader(Subset(val_dataset_pt, val_subset_indices), batch_size=256)

In [None]:
class LitMNIST(LightningModule):
    def __init__(self, lr: float = 1e-2) -> None:
        super().__init__()
        self.save_hyperparameters()
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 128),
            nn.ReLU(),
            nn.Linear(128, 10),
        )
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)

    def training_step(self, batch, batch_idx: int):
        x, y = batch
        logits = self(x)
        loss = self.loss_fn(logits, y)
        preds = logits.argmax(dim=1)
        acc = (preds == y).float().mean()
        self.log("train_loss", loss, prog_bar=True, on_step=False, on_epoch=True)
        self.log("train_acc", acc, prog_bar=True, on_step=False, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx: int) -> None:
        x, y = batch
        logits = self(x)
        loss = self.loss_fn(logits, y)
        preds = logits.argmax(dim=1)
        acc = (preds == y).float().mean()
        self.log("val_loss", loss, prog_bar=True, on_step=False, on_epoch=True)
        self.log("val_acc", acc, prog_bar=True, on_step=False, on_epoch=True)

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=self.hparams.lr)

In [None]:
lit_model = LitMNIST(lr=1e-2)
early_stopping = EarlyStopping(monitor="val_loss", patience=2, mode="min")
trainer = Trainer(
    max_epochs=3,
    accelerator="auto",
    devices=1,
    logger=False,
    enable_checkpointing=False,
    callbacks=[early_stopping],
)
trainer.fit(lit_model, train_loader, val_loader)