# Hand-On Deep Learing - Introduction Session

This session will introduce you to the basic concepts of differentiable programming and traning neural networks from scratch. We will be using the popular deep learning framework PyTorch.


In [8]:
import torch
import torch.nn as nn


We will first talk about the bread-and-butter data type of deep learning - tensors. Once done, we will do a quick introduction to differentiable programs. On a simple example, we will show how to compute gradients in pytorch, and then lead you towards an implementation of a naive gradient descent training loop.

Having got a better feel of how one trains the parameters of a differentiable program, we will introduce neural networks. We will provide you with a full implementation of a training loop, and help you train an approximation of a Boolean function.

We will then steer away from illustrative examples and get started with practical, application-oriented deep learning. We will train both shallow and deep neural networks for image classification, tune the parameters of training and the sizes of architectures, and observe how the individual properties of our training setup influence the quality of the learning outcomes.

We will conclude this session with the introduction of convolutional layers. The challenge of the day will be to use convolutional neural networks to correctly classify greyscale images of fashion articles. 

## Prelude: Tensors

For training of deep models, `torch` uses a special data type: `torch.tensor`. `torch.tensor` is an encapsulation of uniform nested lists that allows for some of these lists to be *trainable*, meaning that their values can be figured out in training on data.

Before we can move on to do anything more exciting, one has to know a bit about tensors. All you need to know is that....

#### Tensors are an enhanced, uniform variant of multi-dimensional lists that torch operations can eat.

In [9]:
A = [[ 0.0, 1.0], [ 1.0 , 0.0]]

try:
  torch.matmul(A, A) # throws a TypeError
except TypeError as error:
  print(f"An error occured in {error}")

An error occured in matmul(): argument 'input' (position 1) must be Tensor, not list


In [10]:
A_tensor = torch.tensor(A)

torch.matmul(A_tensor, A_tensor)

tensor([[1., 0.],
        [0., 1.]])

Notice that nested lists that are candidates for tensors must be uniform in every dimension

In [11]:
B = [
    [[1, 2, 3], [4, 5, 6]],
    [[0, 0], [1, 1]]
]


try:
  B_tensor = torch.tensor(B) # throws a ValueError
except ValueError as error:
  print(f"Could not form a tensor: {error}")

Could not form a tensor: expected sequence of length 3 at dim 2 (got 2)


**Exercise.** Create a tensor `I_tensor`, which is a 3x3 identity matrix.

In [12]:
# Write your code here

I = None 

I_tensor = None

#### Tensors have a multi-dimensional `size`, also known as `shape`
The shape of a tensor describes the sizes of its individual tensor dimensions (also known as *axes*).

In [13]:
A_tensor.shape

torch.Size([2, 2])

In [14]:
A_tensor.size()

torch.Size([2, 2])

Notice that `A` consists of two lists containing two elements each.
Correspondingly, `A_tensor` has size of `[2, 2]`, meaning that `A_tensor` consists of to sub-tensors, namely `A_tensor[0]` and `A_tensor[1]`, containing two elements each.

In [15]:
A_tensor[0]

tensor([0., 1.])

If we take `B` such that `B` contains two lists of lists, such as in

In [16]:
B = [
    [[1, 2, 3], [4, 5, 6]],
    [[0, 0, 0], [1, 1, 1]]
]

Then we can turn it into a `B_tensor` of size `[2,2,3]`,

In [17]:
B_tensor = torch.tensor(B)
B_tensor.shape

torch.Size([2, 2, 3])

... meaning that `B_tensor` consits of two two-dimensional sub-tensors, `B_tensor[0]`, and `B_tensor[1]`.

**Exercise.** What should be the shape of `I_tensor`?

In [18]:
I_tensor_shape_intended = torch.Size( [  ] ) # modify this line with your guess

**Exercise.** Retrieve the shape of `I_tensor`.

In [19]:
I_tensor_shape = None # modify this line with the correct code

**Exercise.** Are they the same?

In [20]:
# Run this code block.

if I_tensor_shape_intended == I_tensor_shape:
  print("I_tensor has shape as intended.")
else:
  print("I_tensor does not have shape as intended.")

I_tensor does not have shape as intended.


#### You can access arbitrary sub-tensors of every tensor

For this, you can use the python's usual slicing notation. For example, to get the second element of each of the deepest lists of `B` in the corresponding tensor, one can simply write

In [21]:
B_tensor[:,:,1]

tensor([[2, 5],
        [0, 1]])

#### Tensors can be either trainable or non-trainable

The trainable tensors are the ones that have `require_gradient` set to `True`.

In [22]:
A_trainable = torch.tensor(A, requires_grad=True)

print(f"A_tensor is trainable: {A_tensor.requires_grad}")
print(f"A_trainable is trainable: {A_trainable.requires_grad}")

A_tensor is trainable: False
A_trainable is trainable: True


#### Tensors can be used in computations element-wise, as long as the dimensions match

In [23]:
B_tensor + B_tensor ** 2 - 0.3 * B_tensor

tensor([[[ 1.7000,  5.4000, 11.1000],
         [18.8000, 28.5000, 40.2000]],

        [[ 0.0000,  0.0000,  0.0000],
         [ 1.7000,  1.7000,  1.7000]]])

**Exercise.** With the help of PyTorch documentation online, find the square root of $B^3$.

In [24]:
# your solution


## Programs

You are certainly familiar with the notion of a classical computer *program*. For our purposes, a program $f$ is an information processing device that takes some inputs $x$ and produces outputs $f(x)$.

Programs can be *pure*, meaning they have no side effects

In [25]:
def is_large(x):
  threshold = 10
  if x > threshold:
    return True
  else:
    return False

... or "impure", meaning that executing them alters some fixed memory state in the computer.

In [26]:
large_number_count = 0

def impure_is_large(x):
  threshold = 10
  if x > threshold:
    large_number_count += 1
    return True
  else:
    return False

Notice that the variable `threshold` in both `is_large` and `impure_is_large` does not really encode a state of the program, but is a parameter determining which numbers will and which numbers won't be considered "large".

Throughout this session, we will only be dealing with pure programs.

## Differentiable Programs and Why They Are So Special

The entire world is now interested in a particular sub-class of programs, called *differentiable programs*.

A differentiable program is a program $f$ such that $f$ is differentiable with respect to its parameters. Here is an example of a differentiable program $f$ taking $x$ as input and multiplying it by a parameter $p$.

In [27]:
p = torch.tensor([ 1.0 ], requires_grad=True)

def f(x):
  global p
  return p * x

Apart from forcing software engineers to dust off their high-school calculus knowledge, what are these differentiable programs actually good for? Why has the entire software engineering and data science world gone crazy over them?

We won't keep you in suspense, here's the "secret":

> Given input-output data, differentiable programs can be taught, through trial and error, to use the right parameters.

So, in the example of `f` above, we could train the program to learn the "true" value of `p`.

The method enabling this is the one of gradient descent training. This has been covered well in many lectures and online resources. If you need a quick refresher, have a look through at the corresponding videos from the Computational Thinking course, available [here](https://).

### Gradient Computation in PyTorch
At the helm of gradient-descent training in PyTorch is the `autograd` module. `torch.autograd` is PyTorch's automatic differentiation engine that powers gradient-descent training.

Suppose you take some trainable tensor $x$ and pass it through $f$.

In [28]:
x = torch.tensor([ 5.2 ], requires_grad=True)
output = f(x)
output

tensor([5.2000], grad_fn=<MulBackward0>)

Notice that `output` now has an additional field, `grad_fn`, that was set by the `autograd` system to keep track of what operations have been performed on `x` to arrive at output.

Now, given some expected output for $f(x)$, say $1$, `autograd` allows you to compute an indication of how `p` needs to be changed in order for $f(x)$ to eventually yield the correct outputs. This is done in a process called *backward pass*.

In [29]:
expected_output = torch.tensor([1.0])
loss = (output - expected_output) ** 2
loss.backward(retain_graph=True)

This indication can then by inspected by asking `p` what its gradient is by reading `p.grad`.

In [30]:
p.grad

tensor([43.6800])

This indication can be interpreted as

> Decreasing `p` by some small $\epsilon$ will decrease the loss by $87.36\epsilon$.

And, as you already know, leveraging the negation of gradient as the indication of the direction in which one should modify the parameters in order to descent towards lower values of the loss, gradient descent training is simply the routine under which one iteratively computes and then applies the gradient of the loss function with respect to parameters of the computation to minimise the loss.

**Exercise.** Fill in the code below to compute the gradients for `p` equal to `0.75`, `0.5`, `0.25`, and `0.20`. What do you observe?

In [31]:
# case p = 0.75
p = torch.tensor([ 0.75 ], requires_grad=True)

#  - you want something gradienty here :)

print(f"For p = 0.75, the gradient is {p.grad}")

# case p = 0.50
p = torch.tensor([ 0.50 ], requires_grad=True)

#  - also here

print(f"For p = 0.50, the gradient is {p.grad}")

# case p = 0.25
p = torch.tensor([ 0.25 ], requires_grad=True)

#  - ...

print(f"For p = 0.25, the gradient is {p.grad}")

# case p = 0.20
p = torch.tensor([ 0.20 ], requires_grad=True)

print(f"For p = 0.20, the gradient is {p.grad}")

For p = 0.75, the gradient is None
For p = 0.50, the gradient is None
For p = 0.25, the gradient is None
For p = 0.20, the gradient is None


### A Basic Training Loop
As hinted on by the above example, modifying the parameters of a differentiable program in the direction opposite to its gradient (i.e. in the direction in which the loss decreases most rapidly) generally guides the differentiable programs towards a minimum in the loss.

This process can be repeated iteratively, to form what is called a *training loop*. A typical training procedure of a differentiable program looks as follows:


1. Initialise the parameters of the differentiable program according to an appropriate scheme.
2. Take the inputs provided and perform a *forward pass* -- apply the program to the inputs.
3. Compute the loss between the expected outputs and the actual outputs of the program.
4. Compute the gradient of the loss with respect to the program's parameters.
5. Scale the gradients by the desired pace of descent -- *the learning rate** -- and update the parameters accordingly.
6. If not done yet, go back to 2..


You already possess all the basic ingredients necessary to implement such a training procedure yourself. Let's do that.

**Exercise.** Fill in the code below to arrive at a working implementation of a gradient descent training loop for $f$.


In [32]:
def train_f(x, y, learning_rate: float = 0.01, number_of_iterations: int = 50):
  global p
  # TODO: initialise p to a random tensor between 0 and 1 (hint: use torch.rand)
  # remember that you need p to be trainable!

  for iteration in range(1, number_of_iterations+1):
    # TODO: perform a "forward pass" (apply f to x)
    output = None
    
    # compute the loss
    loss = torch.sum((output - y) ** 2) / output.size(0)

    # TODO: compute the gradient of `loss` given p

    # TODO: subtract learning_rate*(gradient of p) from p ...
    with torch.no_grad():
      # ... here

    # finally, erase the gradients for the next iteration
    p.grad.data.zero_()

IndentationError: ignored

**Exercise.** Choose a value of `p_true` -- the parameter value for $f$ to be learned. Then, run the code below to check the correctness of your training loop from above. If you struggle to get the right answer, consider decreasing the learning rate and increasing the number of iterations.

In [None]:
p_true = None # the parameter value of p to be learned
datapoint_count = None # the number of datapoints to use for training in every iteration of `train_f`

x = torch.rand((datapoint_count,))
y = x * p_true

train_f(x, y)
print(f"The true value is {p_true.item()}, the value learned by gradient descent is {p.item()}")

## Introducing Neural Networks
Neural networks are a particular class of differentiable programs, for which it has been theoretically proven that they can learn to approximate an arbitrary integrable function arbitrarily well, as long as they are given enough *representational power*.

*Here is where the deep learning black magic begins.* 

Classical programs consist of a sequence of specific operations such as addition or conditional value assignment. Neural networks are differentiable programs that consist of a sequence of amenable elementary building blocks, traditionally referred to as *layers*, that can ultimately perform a wide variety of operations. The "bigger" these layers are, the more complex behaviour they can learn to exhibit.

There exists several popular types of neural network building blocks, including the trainable *linear layer*, or the non-trainable *activation*, *softmax*, and *dropout layers*, to name but a few. The combination of a linear layer and an activation layer is sometimes referred to as *dense layer* and is the basic building block of a *deep neural network*.

The amount of *representational power* network has is determined by the sizes of its trainable layers. Linear layers have a "width" (the number of constituent neurons). The wider the layer, the more fine-grained operation it is capable of representing. Whether it can learn to represent this operation is, however, an entirely different question.

### Constructing Neural Networks
Without further ado, let use these building blocks to form neural networks.

Knowing the format of the operation of individual layers, you could go ahead and implement them directly. To avoid uncanny detail, we will instead use the ready-made implementations of these layers from the `torch.nn` module.

In general, the [documentation](https://pytorch.org/docs/stable/nn.html) of the `torch.nn` module is what you want to turn to to understand a new layer type.

We walk you through creating instances of various layer types in the code below. We directly use the instances to operate input data.

#### Linear Layers

Let us begin with the most basic layer in deep learning, the Linear layer.

In [None]:
# construct a linear layer that takes a tensor of size (3,) and produces a 
#  tensor of size (5,)
linear_layer = torch.nn.Linear(3, 5)

# pass [1, 2, 3] through the layer
example_input = torch.tensor([ 1, 2, 3 ], dtype=torch.float)
linear_layer(example_input)

Notice that as promised in the call to `nn.Linear(3, 5)`, the output tensor has 5 entries. Its output values are the result of an internal state (the layer *weights*) that has been initialised at random.

#### Activation Layers

Several types of activation layers exist, most notably the logistic sigmoid, rectified linear unit (ReLU), and the hyperbolic tangent. Each of these has a layer in `torch.nn`.

In [None]:
# construct a ReLU layer
relu_layer = torch.nn.ReLU()

# pass [ -3, -2, -1, 0, +1, +2, +3 ] through the ReLU layer
example_input = torch.tensor([ -3, -2, -1, 0, +1, +2, +3 ], dtype=torch.float)
relu_layer(example_input)

In [None]:
# construct a sigmoid layer
sigmoid_layer = torch.nn.Sigmoid()

# pass [ -3, -2, -1, 0, +1, +2, +3 ] through the sigmoid layer
sigmoid_layer(example_input)

In [None]:
# construct a tanh layer
tanh_layer = torch.nn.Tanh()

# pass [ -3, -2, -1, 0, +1, +2, +3 ] through the tanh layer
tanh_layer(example_input)

As you can see, going from minus infinity towards infinity around 0, the ReLU transits from constant 0 to linear behaviour at 0, the logistic sigmoid proceeds to climb from 0 towards 1, and tanh climbs from -1 towards +1.

#### Dropout Layers

It is sometimes to the advantage of model training to "drop out" some of the incoming values at random. To this end, `torch.nn` provides the `Dropout` layer, which can be parametrised at construction with the probability of an input value being dropped out.

In [None]:
# construct a dropout layer with probability 0.0
dropout_layer = torch.nn.Dropout(p=0.5)

# pass [ -3, -2, -1, 0, +1, +2, +3 ] through the dropout layer
dropout_layer(example_input)

# with p=0.5, roughly half of the inputs should be dropped out on average, 
#  and the remaining outputs are scaled up by 1/(1-p) == 2 
# run this snippet multiple times to observe the effects of random dropout

#### Softmax
Sometimes we wish to interpret an $n$-dimensional vector of real values as scores in favour of a single one of $n$ discrete elements possessing a certain property. To this end, we often use the "softmax" layer.

The softmax layer takes the $n$-dimensional vector of real values and produces an $n$-dimensional vector of values between $0$ and $1$, whose individual entries sum up to $1$.

The bigger an entry of the input vector is relative to other entries, the closer its corresponding value in the output vector is to $1$.

In [None]:
# construct a softmax layer
softmax_layer = torch.nn.Softmax(dim=0)

# pass [ -3, -2, -1, 0, +1, +2, +3 ] through the softmax layer
softmax_layer(example_input)

In [None]:
# pass [ -1, 2, 5, 100 ] through the softmax layer
softmax_layer(torch.tensor([ -1, 2, 5, 100 ], dtype=torch.float))

### Putting the Layers Together -- Implementing a DNN


Neural networks are graphs of layers. In PyTorch, we generally tend to implement our neural networks as classes whose constructors construct the constituent parts of the network, and whose `forward` function passes the data through these parts.

Below is an example implementation of a shallow neural network. This network is *shallow* as it contains only one hidden trainable layer (=layer that is not an input or output layer).


In [None]:
class ShallowNeuralNet(nn.Module):
    def __init__(self, input_width: int, hidden_layer_width: int, output_width):
        super().__init__()
        self.hidden_layer = nn.Linear(input_width, hidden_layer_width)
        self.hidden_relu = nn.ReLU()
        self.output_layer = nn.Linear(hidden_layer_width, output_width)

    def forward(self, input):
        hidden_trainable_output = self.hidden_layer(input)
        hidden_relu_output = self.hidden_relu(hidden_trainable_output)
        output = self.output_layer(hidden_relu_output)
        return output

Once the network behaviour has been described in this fashion, we can create an instance of the entire network at once and use it to process data in exactly the same way as we would use layers.

In [None]:
shallow_nn_instance = ShallowNeuralNet(5, 10, 2)

example_input = torch.ones(5)
shallow_nn_instance(example_input)

**Exercise.** Fill in the snippet below to arrive at an implementation of a deep ReLU neural networks with layer profile given by a list of integers.

In [None]:
class DeepNeuralNet(nn.Module):
  def __init__(self, input_width, hidden_layer_profile, output_width, output_activation=None):
    super().__init__()
    self.layers = nn.ModuleList()

    # create the first hidden layer
    self.layers.append(nn.Linear(input_width, hidden_layer_profile[0]))
    self.layers.append(nn.ReLU())

    # create the internal hidden layers
    for in_width, out_width in zip(hidden_layer_profile[0:-1], hidden_layer_profile[1:]):
      self.layers.append(nn.Linear(in_width, out_width))
      self.layers.append(nn.ReLU())

    # create the output layer
    self.output_layer = nn.Linear(hidden_layer_profile[-1], output_width)
    self.output_activation = nn.Identity() if not output_activation else output_activation
  
  def forward(self, input):
    x = input

    # loop through the layers to produce the output of the hidden network
    for layer in self.layers:
      # TODO: pass the intermediate output of the previous layer through the current layer
      x = None

    # TODO: produce the output of the network from the intermediate output of the last hidden layer
    output_before_activation = None

    # TODO: engage the optional activation in self.output_activation on the output_before_activation
    output = None

    return output

**Exercise.** Test the class for generic deep neural networks below. Does everything work as expected?


In [None]:
# try passing a tensor of random numbers through the network
input1 = torch.rand((10,))
deep_nn_instance = DeepNeuralNet(input_width=10, hidden_layer_profile=[10, 7, 5], output_width=1)

deep_nn_instance(input1)

A typical neural network design pattern appearing in search and recommender systems is that of "two towers". Explained in brief, the network consists of two separate sub-networks, one for queries and one for results. In some special cases when the modalities of the queries and results are the same, the towers can be made to "share weights" (use the same architecture and parameters to process their respective inputs). In such case, one might talk of the "twin tower" architecture being used.

**Exercise (Weight Sharing).** Using the class `DeepNeuralNet` you implemented above, fill in the code below to produce an implementation of the twin tower architecture.

In [None]:
class TwinDeepNeuralNet(nn.Module):
  def __init__(self, input_width, hidden_layer_profile, output_width):
    super().__init__()
    
    # TODO: use DeepNeuralNet to construct a network that can perform the function of a "twin tower" network

  def forward(self, input):
    # identify the query and the value as sub-tensors of the input tensor
    input_query = input[0,:]
    input_value = input[1,:]

    # TODO: use the layer(s) or sub-network(s) initialised in the constructor to implement the functionality of a "twin tower" network
    output_query = None
    output_value = None


    # form a single output tensor as a disjoint union of the query and value tensors
    output_query = output_query.unsqueeze(0)
    output_value = output_value.unsqueeze(0)
    output = torch.cat([output_query, output_value], dim=0)

    return output

On the simple example below, test whether your implementation works.

In [None]:
testing_input = torch.ones((2, 5))

twin_nn_instance = TwinDeepNeuralNet(5, [10, 10], 1)
testing_output = twin_nn_instance(testing_input)

equality = torch.all(testing_output[0] == testing_output[1])

print(f"The outputs of the twins are {'equal' if equality else 'not equal'}")

### A Training Loop for Neural Networks
In a previous section concerning differentiable programs, we introduced the intuition for using the gradient information due to a choice of loss function to find optimal parameters of a differentiable program. 

This is exactly what we do for neural networks as well in order to train them to have the behaviour we desire of them.

We give code for optimisation of a neural network `net` with particular loss function `loss` on dataset loaded by a `dataloader` below, and comment on it step by step.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

def training_loop(dataloader, net, loss_fn, optimiser, verbosity=3):
    size = len(dataloader.dataset)
    last_print_point = 0
    current = 0

    acc_loss = 0
    acc_count = 0

    # for every slice (X, y) of the training dataset
    for batch, (X, y) in enumerate(dataloader):
        X = X.to(device)
        y = y.to(device)

        # perform a forward pass to compute the outputs of the net
        pred = net(X)

        # calculate the loss between the outputs of the net and the desired outputs
        loss_val = loss_fn(pred, y)
        acc_loss += loss_val.item()
        acc_count += 1

        # zero the gradients computed in the previous step 
        optimiser.zero_grad()

        # calculate the gradients of the parameters of the net
        loss_val.backward()

        # use the gradients to update the weights of the network
        optimiser.step()

        # compute how many datapoints have already been used for training
        current = batch * len(X)

        # report on the training progress roughly every 10% of the progress
        if verbosity >= 3 and (current - last_print_point) / size >= 0.1:
            loss_val = loss_val.item()
            last_print_point = current
            print(f"loss: {loss_val:>7f}  [{current:>5d}/{size:>5d}]")

    return acc_loss / acc_count

We now possess all the tools necessary for constructing simple neural networks and for training them towards some particular behaviour by gradient descent loss minimisation. Let us put these tools to good use.

### Learning a Boolean Function



Let us consider the problem of learning a random Boolean function $f: \left\{ 0,1 \right\}^n \to \left\{ 0,1 \right\}^n$. It might sound a bit dry at first, but bear with us, it is a very natural and tractable example for the examination of the representational power of various types of neural networks.

In [None]:
def make_binary_array(number: int, length: int) -> list:
	return [ (number>>k)&1 for k in range(0, length) ]

data_x_list = []

for i in range(0, 256):
  data_x_list.append(make_binary_array(i, 8))

data_x = torch.tensor(data_x_list, dtype=torch.float)
data_y = torch.randint(low=0, high=2, size=(256, 8)).type(torch.float)
dataset = torch.utils.data.TensorDataset(data_x, data_y)

The above code generates a dataset of $(x, y)$ pairs where $x$ is any $n=8$-bit signal and $y = f(x)$, with $f$ chosen uniformly at random. 

Given these examples, can we learn a neural network that performs the function of $f$? Yes! Just run the code below

In [None]:
net = DeepNeuralNet(input_width=8, hidden_layer_profile=[256], output_width=8, output_activation=nn.Sigmoid())

In [None]:
def testing_loop(dataloader, net):
  size = len(dataloader.dataset)
  last_print_point = 0
  current = 0

  acc_correct = 0
  acc_count = 0

  # for every slice (X, y) of the training dataset
  with torch.no_grad():
    for batch, (X, y) in enumerate(dataloader):
        X = X.to(device)
        y = y.to(device)
        
        # perform a forward pass to compute the outputs of the net
        pred = net(X)

        # round the predictions (0 - 0.5 towards zero, >0.5 towards one)
        pred_rounded = torch.round(pred)

        # compute the number of correct entries
        acc_correct += torch.count_nonzero(pred_rounded == y).item()
        acc_count += y.numel()

  return acc_correct / acc_count

In [None]:
def train(dataloader, net, loss_fn, optimiser, epochs, verbosity=3):
  least_loss = None

  for t in range(epochs):
    if verbosity >= 3:
      print(f"Epoch {t+1}\n-------------------------------")
    
    mean_loss = training_loop(dataloader, net, loss_fn, optimiser, verbosity=verbosity)
    accuracy = testing_loop(dataloader, net)
    if not least_loss or mean_loss < least_loss:
      least_loss = mean_loss
    if verbosity >= 2:
      print((f"Epoch {t+1}: " if verbosity >= 3 else "") + f"mean loss {mean_loss}, validation accuracy {accuracy:.2%}")
    if verbosity >= 3:
      print("\n")
  
  if verbosity >= 1:
    print(f"Training complete, least loss {least_loss}, final validation accuracy {accuracy:.2%}")
  return least_loss

In [None]:
training_dataloader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True)
loss_fn = nn.BCELoss()
optimiser = torch.optim.Adam(net.parameters(), lr=5e-3)

least_loss = train(training_dataloader, net, loss_fn, optimiser, epochs=250, verbosity=2)


Okay, this is encouraging. We are getting a loss in the order of $10^{-3}$ (which is relatively little for mean binary cross-entropy) and 100% accuracy. You will notice that as the training progresses, the loss tends to decrease and the accuracy increases. You will also notice that our neural network has only one hidden layer of 256 neurons. But are all those neurons really necessary?

Let's push things to an extremum and consider a network that has exactly one neuron in its hidden layer. In other words, all of the information about the input the output neurons have must be contained in exactly one activated number, and the hidden layer of such a network is an information bottleneck. All other parameters constant, what sort of loss values and accuracies will we be getting under such circumstances?

In [None]:
slender_net = DeepNeuralNet(input_width=8, hidden_layer_profile=[ 1 ], output_width=8, output_activation=nn.Sigmoid())

training_dataloader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True)
loss_fn = nn.BCELoss()
optimiser = torch.optim.Adam(slender_net.parameters(), lr=5e-3)

least_slender_loss = train(training_dataloader, slender_net, loss_fn, optimiser, epochs=250, verbosity=1)

We see that with one hidden neuron, we learn to predict the values of $f(x)$ only marginally better than a coin flip, and that this corresponds to a relatively large value of the binary cross-entropy loss.

Okay. So there is a number of hidden neurons (256) that is sufficient for learning $f$ with 100% accuracy, and there is a number of hidden neurons (1) that is clearly insufficient to learn anything but some rough indication of the correct output. 

*   With one hidden neuron, we have starved the network of representational power to the extent that it is only slightly better than tossing a fair coin at predicting $f$.
*   With 256 hidden neurons, we have given the network enough representational power to learn $f$. Perhaps even too much.

What happens in between these two extrema? And, is there a point - a number of neurons - beyond which the network fails to learn $f$ correctly but for which $f$ can still be learned?

**Exercise.** Find the least number of neurons $w_1$ such that the training of a shallow neural network with $w$ hidden neurons can still learn to execute $f$ with loss of at most $0.001$. Remember that you can adjust the learning rate and the number of epochs to get finer and more resource-efficient training. You can also set `verbosity=1` to avoid long listings of losses, though verbosity of above `1` might help with the investigation of whether the training losses plateau out.

In [None]:
w = None
net = DeepNeuralNet(input_width=8, hidden_layer_profile=[ w_1 ], output_width=8, output_activation=nn.Sigmoid())

training_dataloader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True)
loss_fn = nn.BCELoss()
optimiser = torch.optim.Adam(net.parameters(), lr=5e-3)
least_loss = train(training_dataloader, net, loss_fn, optimiser, epochs=250, verbosity=2)

**Exercise.** It has been theoretically proven that deep neural networks (that is, networks with more than one hidden layer) can learn the same functions as shallow neural nets while using comparatively fewer neurons and trainable weights. Can you find a `w_2`, a minimal number of neurons sufficient to learn the function $f$ with loss of at most $0.001$, such that the $w_2$ neurons can be distributed in multiple hidden layers? The number of layers you end up using is up to you.

In [None]:
net = DeepNeuralNet(input_width=8, hidden_layer_profile=[ None, None, None, ], output_width=8, output_activation=nn.Sigmoid())

training_dataloader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True)
loss_fn = nn.BCELoss()
optimiser = torch.optim.Adam(net.parameters(), lr=5e-3)
least_loss = train(training_dataloader, net, loss_fn, optimiser, epochs=250, verbosity=2)

### Section Takeaway

We have seen that neural networks can be constructed as directed graphs of more elementary building blocks (shallow and deep nets) and other networks (two-tower nets).

We have also introduced the training loop for a neural network that uses much of PyTorch machinery to perform gradient descent.

We have used the above to train networks that learn a Boolean function. We observed that not all networks can learn all functions, and that the number of neurons and their arrangement (or, more precisely, trainable weights and their role within the network) influence the ability of a network to learn to approximate a function. Experimenting, we got the intuitive feel of the notion of *representational power*.

Using all of that has been learned, we can now go and train networks that are perhaps more suitable for real-world applications.

## Image Classification with DNNs


Most machine learning workflows involve working with data, creating models, optimizing model parameters, and saving the trained models. This section introduces you to a complete ML workflow implemented in PyTorch, with links to learn more about each of these concepts.

We will use the FashionMNIST dataset to train a neural network that predicts if an input image belongs to one of the following classes: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, or Ankle boot.

### Working with Data

PyTorch has two primitives to work with data: `torch.utils.data.DataLoader` and `torch.utils.data.Dataset`. `Dataset` stores the samples and their corresponding labels, and `DataLoader` wraps an iterable around the `Dataset`.

In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import Compose, ToTensor, Lambda
import matplotlib.pyplot as plt

PyTorch offers domain-specific libraries such as TorchText, TorchVision, and TorchAudio, all of which include datasets. For this tutorial, we will be using a TorchVision dataset.

The `torchvision.datasets` module contains `Dataset` objects for many real-world vision data like CIFAR, COCO. In this tutorial, we use the `FashionMNIST` dataset. Every TorchVision `Dataset` includes two arguments: `transform` and `target_transform`, to modify the samples and labels respectively.

In [None]:
# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=Compose([
      ToTensor(),
      Lambda(lambda x: torch.flatten(x, start_dim=0))
    ]),
)

# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=Compose([
      ToTensor(),
      Lambda(lambda x: torch.flatten(x, start_dim=0))
    ]),
)

We pass the `Dataset` as an argument to `DataLoader`. This wraps an iterable over our dataset, and supports automatic batching, sampling, shuffling and multiprocess data loading. Here we define a batch size of 64, i.e. each element in the dataloader iterable will return a batch of 64 features and labels.

In [None]:
batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

We can also have a quick peek at the data

In [None]:
images, labels = next(iter(train_dataloader))
print('Shape of input tensor:', list(images.shape))
ii = torch.reshape(images[0],(28,28))
plt.imshow(ii, cmap='gray')
plt.show()
print('Label: ', int(labels[0]))

### Creating Models
As we have in previous sections when learning Boolean functions, to define a neural network in PyTorch we create a class that inherits from `nn.Module`. We define the layers of the network in the `__init__` function (the constructor) and specify how data will pass through the network in the forward function. To accelerate operations in the neural network, we move it to the GPU if available.

In [None]:
# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

net = DeepNeuralNet(input_width=28 * 28, hidden_layer_profile=[512, 512], output_width=10).to(device)
print(net)

### Optimising Model Parameters

As illustrated before, to train a model, we need a loss function and an optimiser.

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimiser = torch.optim.SGD(net.parameters(), lr=1e-3)

The training loop from above will serve us well even in our current tasks. We also check the model’s performance against the test dataset to ensure it is learning -- in order to do so, we re-define the `testing_loop` function we used to learn Boolean functions in the new context of image classification.

In [None]:
def testing_loop(dataloader, net,):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    net.eval()
    correct = 0

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = net(X)
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    
    return correct / size

The training process is conducted over several iterations (epochs). During each epoch, the model learns parameters to make better predictions. We print the model's accuracy and loss at each epoch; we would like to see the accuracy increase and the loss decrease with every epoch.

In [None]:
epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    training_loop(train_dataloader, net, loss_fn, optimiser)
    validation_accuracy = testing_loop(train_dataloader, net)
    print(f"Validation Accuracy: {validation_accuracy:.2%}\n")
print("Training Done!")

testing_accuracy = testing_loop(test_dataloader, net)
print(f"\nTest Accuracy: {testing_accuracy:.2%}")

**Exercise.** Play around with different learning setups. Modifying just the learning rate and the number of epochs, how high can you take the validation accuracy? While doing so, do you also observe similar improvements in the test accuracy? (Remember that you can adjust the verbosity level not to have to read through long listings).

In [None]:
train(train_dataloader, net, loss_fn, optimiser, epochs, verbosity=3)

**Exercise.** Play around with the architecture of the neural network that you train. Adding more layers and more neurons, can you take the test performance even higher than in the previous exercise?

In [None]:
# define the architecture of your neural network
best_net = DeepNeuralNet(input_width=28 * 28, hidden_layer_profile=[ None ], output_width=10).to(device)
print(best_net)

# train it
train(train_dataloader, net, loss_fn, optimiser, epochs, verbosity=3)

# test it
testing_accuracy = testing_loop(test_dataloader, net)
print(f"\nTest Accuracy: {testing_accuracy:.2%}")

### Saving and Loading Models
Quite often you want to save your model, either to be later deployed in practice (on a website or in a mobile device, for example), or to be able to evaluate it later, in a different workflow. A common way to save a model is to serialise the internal state dictionary (containing the model parameters).

In [None]:
torch.save(net.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")

The process for loading a model includes re-creating the model structure and loading the state dictionary into it.

In [None]:
model = DeepNeuralNet(28 * 28, [512, 512], 10)
model.load_state_dict(torch.load("model.pth"))

This model can now be used to make predictions.

In [None]:
classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

model.eval()
x, y = test_data[0][0], test_data[0][1]
with torch.no_grad():
    pred = model(x)
    predicted, actual = classes[pred[0].argmax(0)], classes[y]
    print(f'Predicted: "{predicted}", Actual: "{actual}"')

### MNIST from Scratch

You have now seen the entire pipeline, going from the exploration of training data to the model evaluation.

The final task for today is to use your new knowledge to get a running model that can classify MNIST dataset digits.

In [None]:
train_data = datasets.MNIST(
    root='data',
    train=True,
    download=True,
    transform=Compose([
      ToTensor(),
      Lambda(lambda x: torch.flatten(x, start_dim=0))
    ]),
)

test_data = datasets.MNIST(
    root='data',
    train=False,
    download=True,
    transform=Compose([
      ToTensor(),
      Lambda(lambda x: torch.flatten(x, start_dim=0))
    ]),
)

In [None]:
# Hint: you can follow and re-use the above code step-by-step.

Once you have a working instance, try varying the learning rate, batch size, and the sizes and numbers of individual layers in your network in order to get the best possible training accuracy.
