# 3 | Deep Learning with PyTorch



## Tensors

### The Creation of Tensors


In [2]:
import torch as th
import numpy as np

a = th.FloatTensor(3, 2)
a

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

The script import PyTorch and NumPy a d create a new float tensor of size 3×2.
PyTorch now initialises memory with zeros, a different behaviour from previous versions.
Originally it just allocated memory and kapt it uninitialised: faster but less safe.
This behaviour might change again in the future so it is good practice to always initialise tensors:

In [8]:
th.zeros(3, 4)

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

Alternatively call the tensor modification method:

In [7]:
th.zero_(a)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

Notice that there are two types of operation for tensors: **inplace** and **functional**.
Inplace operations have an underscore appended to their name, operate on the tensor's content, and eventually return the modified tensor itself.
The functional equivalent creates a copy of the tensor's content, applies the operation and returns the copy leaving the original tensor untouched.
Inplace operations are usually faster but less safe (the same tensor shared in different places where changes are not expected).

In [9]:
th.FloatTensor([[1,2,3],[3,2,1]])

tensor([[1., 2., 3.],
        [3., 2., 1.]])

Alternatively use the tensor constructor passing a Python iterable (i.e.: a list or tuple) to it:

In [10]:
n = np.zeros(shape=(3, 2))
n


array([[0., 0.],
       [0., 0.],
       [0., 0.]])

or by passing a NumPy array to it:

In [12]:
n = np.zeros(shape=(3, 2))
b = th.tensor(n)
b


tensor([[0., 0.],
        [0., 0.],
        [0., 0.]], dtype=torch.float64)

Computing a gradient in deep learning means calculating how much a model’s output (typically a loss) changes with respect to each model parameter (weights and biases). Formally, it’s the vector of partial derivatives of the loss with respect to the parameters. These gradients tell us the direction and magnitude to adjust parameters to reduce the loss during training. In practice, frameworks use automatic differentiation (backpropagation) to compute these efficiently.

Key ideas:
- **Loss function**: A scalar measuring model error (e.g., cross-entropy, MSE).
- **Parameters**: Trainable tensors (weights/biases) the model learns.
- **Gradient**: For each parameter θ, compute ∂Loss/∂θ.
- **Backpropagation**: Apply the chain rule from outputs back to inputs/parameters to get all gradients.
- **Optimiser step**: Update parameters, e.g., θ ← θ − η ∂Loss/∂θ with learning rate η.

Below are concise Python examples showing how gradients are computed and used.

PyTorch: automatic differentiation and a simple training step
```python
import torch
import torch.nn as nn
import torch.optim as optim

# Simple linear model: y_hat = xW + b
torch.manual_seed(0)
model = nn.Linear(in_features=3, out_features=1)  # parameters: weight (1x3), bias (1)
criterion = nn.MSELoss()
optimiser = optim.SGD(model.parameters(), lr=0.1)

# Dummy data
x = torch.randn(5, 3)          # 5 samples, 3 features
y = torch.randn(5, 1)          # targets

# Forward pass: compute predictions and loss (computational graph is built)
y_hat = model(x)
loss = criterion(y_hat, y)

print("Loss (forward):", loss.item())

# Backward pass: compute gradients dLoss/dParam for all Params in the graph
optimiser.zero_grad()          # clear previous gradients
loss.backward()                # backprop: populates .grad fields
for name, param in model.named_parameters():
    print(name, "grad shape:", param.grad.shape)

# Parameter update: gradient descent step
optimiser.step()
```

What’s happening:
- During the forward pass, PyTorch builds a graph of operations.
- `loss.backward()` applies the chain rule to compute gradients for every leaf parameter.
- Each parameter tensor gets a `.grad` attribute holding its gradient.
- The optimiser updates parameters using these gradients.

PyTorch: inspecting a manual gradient for a simple function
```python
import torch

# f(w) = (w^2).mean(), compute df/dw at some w
w = torch.tensor([[1.0, -2.0], [0.5, 3.0]], requires_grad=True)
f = (w ** 2).mean()
f.backward()  # df/dw = (2w)/N where N is number of elements

print("w:", w)
print("f:", f.item())
print("df/dw:", w.grad)  # gradient matches analytic derivative
```

TensorFlow (Keras): using GradientTape
```python
import tensorflow as tf

tf.random.set_seed(0)
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(3,))])
loss_fn = tf.keras.losses.MeanSquaredError()
optimiser = tf.keras.optimizers.SGD(learning_rate=0.1)

x = tf.random.normal((5, 3))
y = tf.random.normal((5, 1))

with tf.GradientTape() as tape:
    y_hat = model(x, training=True)
    loss = loss_fn(y, y_hat)

# Compute gradients of loss w.r.t. model trainable variables
grads = tape.gradient(loss, model.trainable_variables)
for var, g in zip(model.trainable_variables, grads):
    print(var.name, "grad shape:", g.shape)

# Apply gradients (parameter update)
optimiser.apply_gradients(zip(grads, model.trainable_variables))
```

Small, from-scratch example: manual backprop for a single neuron
```python
import numpy as np

# y_hat = ReLU(x @ w + b), loss = 0.5 * (y_hat - y)**2
x = np.array([0.2, -0.5, 1.0])
y = 0.8
w = np.array([0.1, -0.3, 0.5])
b = 0.0
lr = 0.1

# Forward
z = x.dot(w) + b
y_hat = max(0.0, z)  # ReLU
loss = 0.5 * (y_hat - y)**2

# Backward (chain rule)
dL_dyhat = (y_hat - y)
dyhat_dz = 1.0 if z > 0 else 0.0
dL_dz = dL_dyhat * dyhat_dz
dL_dw = dL_dz * x          # vector, same shape as w
dL_db = dL_dz

# Update
w -= lr * dL_dw
b -= lr * dL_db
```

Common practical points:
- **Retaining graphs**: In PyTorch, the computation graph is freed after `backward()` unless `retain_graph=True`, which you rarely need for standard training loops.
- **No grad mode**: Wrap evaluation/inference in `torch.no_grad()` or `tf.inference_mode()` to skip gradient tracking for speed and memory.
- **Exploding/vanishing gradients**: Use normalisation, residual connections, appropriate initialisation, or gradient clipping.
- **Non-scalar losses**: Reduce to a scalar (e.g., mean) before calling `backward()`; PyTorch allows a vector-Jacobian product if you pass `gradient=...`.


## GAN on Atari Images

Instead of relying on the MNIST dataset to show the power of DL, this section will use Generative Adversarial Networks (GANs) to generate screenshots of various Atari games.

Generative Adversarial Networks (GANs) use two competing neural networks.
One is a **generator** that creates synthetic data, a _"cheater"_.
The other is a **discriminator** that tries to tell real data from fake, a _"detective"_.
Through this adversarial process, the generator learns to produce increasingly realistic samples.
Conversely, the discriminator becomes better at spotting fakes.
Over time, both models improve by challenging each other.

GANs are used to improve image quality, generate realistic images, and support feature learning for downstream tasks.
In this case, there is practical value if not reinforcing what we have learned so far on PyTorch.


In [13]:
import gymnasium as gym

In [14]:
class InputWrapper(gym.ObservationWrapper):
    """
    Preprocessing of input numpy array:
    1. resize image into predefined size
    2. move color channel axis to a first place
    """
    def __init__(self, *args):
        super(InputWrapper, self).__init__(*args)
        old_space = self.observation_space
        assert isinstance(old_space, spaces.Box)

        self.observation_space = spaces.Box(
            self.observation(old_space.low),
            self.observation(old_space.high),
            dtype=np.float32
        )