<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">COMP5611M - Data Structures in Pytorch</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Marc de Kamps and University of Leeds</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

## Objectives

In this notebook we will explore data structures in PyTorch, highlight some of the commonalities and differences with Numpy arrays and will discuss *autograd*, a package that allows automatic differentiation so that gradient
based optimisation rules can be derived automatically, rather than by the modeller. This implies that novel loss functions and architectures can be implemented easily without any need to adapt backpropagation algorithms.

At the end of the nodebook you should be able to
- Define data structures commonly used in neural networks as Pytoch tensors
- Explain the main commonalities and differences between Pytorch tensors and numpy arrays
- Define mathematical expressions commonly used in neural networks in Pytorch
- Combine the expressions in network classes
- Define a simple training loop

### PyTorch introductory materials

Much of the material presented here is a condensed version of the official PyTorch introduction, which can be found
here: https://pytorch.org/tutorials/beginner/nlp/pytorch_tutorial.html

This is also the authorative source: changes to PyTorch are likely to find their way into that material.

### Tensors

Take a few minutes to go through this page:
https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html

We will repeat some of the information below and then use Tensors to implement a network that can implement the XOR problem. The XOR is a logical gate. Like the AND, learning the values of its truth table can be considered
a data classification problem, but it cannot be learnt by a single perceptron.

The truth table is as follows:


| $x_1$ | $x_2$| o |
| ------| -----|---|
|  0    |    0 | 0 |
|  0    |    1 | 1 |
|  1    |    0 | 1 |
|  1    |    1 | 0 |

To create a network that is capable of handling the XOR problem we create a network with two input nodes, two hidden nodes and one output node. Moreover, we need an extra input in the input layer to represent the bias, and ditto for the hidden layer. This creates an architecture of (3, 3, 1), which means 3 nodes for the input, three nodes for the hidden node and 1 for the output.

This in turn means that we need a $3 \times 2$ matrix to connect the input layer to the hidden layer and a $3 \times 1$ matrix to connect the hidden layer to the output layer.

It is not very difficult to design a network by hand that solves this problem.


<img src="xor.png" width="200" height="200" />

The two decision lines are $x+y = \frac{1}{2}$ and $x+y = \frac{3}{2}$. We construct a hidden layer of three nodes.
One is a perceptron that implements the first decision line, the second implements the second decision line. The
third is clamped to 1, so as to implement a bias for the output layer.

We can then build an *OR* gate on top of the hidden layer ('the input lies below $x+y = \frac{1}{2}$ or
above $x+y=\frac{3}{2}$; we can implement 'below' by multiplying all weights of the decision line by -1).

## Constructing a Network by Hand

Below, we will construct a network by hand, based on the solution principle sketched above. It is general not a good idea to create neural networks by hand, but this example will show that the essential calculations can be performed in PyTorch data structures that are not very different from Numpy.

In [1]:
import torch
import numpy as np
torch.manual_seed(1)

# we adopt the convention that node 0 is the threshold
tensor_v = torch.tensor([[0.5, -1., -1.],[-1.5, 1., 1.]])
tensor_w = torch.tensor([[0.5, -1., -1.]])


One of the most important operations in neural networks is matrix vector multiplication. Let us calculate the hidden layer values for a given input pattern. We will start with the pattern (0,0). Since we clamp node 0 to the value 1 in order to implement a bias, we create (1,0,0). *torch.matmul* provides matrix vector multiplication

In [2]:
input = torch.tensor([1.,0.,0.])
hi=tensor_v@input
print(hi)

tensor([ 0.5000, -1.5000])


On this output, we need to apply the step, or Heaviside, function.


**Exercise**: Research the Pytorch documentation to find out what role the argument *values* plays.

In [3]:
values = torch.tensor([0.5])
h=torch.heaviside(hi, values)
print(h)

tensor([1., 0.])


Now let's investigate if our hidden layers comes out correctly: we have four input patterns
and therefore four cases. You should verify that we expect the following:

|input pattern|desired hidden layer response|
|-------------|-----------------------------|
| (1, 0, 0)   |         (1,0)               |
| (1, 0, 1)   |         (0,0)               |
| (1, 1, 0)   |         (0,0)               |
| (1, 1, 1)   |         (0,1)               |

We can interpret the hidden layer information as follows:  of the two nodes the first one expresses
'below the line $x+y -\frac{1}{2}$'. The second node expresses 'above the line $x+y -\frac{3}{2}$'.

We will now implement the hidden layer as a function for a given input pattern, and see if it predicts the correct value for each of the input patterns.

In [4]:
def hidden(input, tensor_v):
    hi=tensor_v@input
    values = torch.tensor([0.5])
    h=torch.heaviside(hi, values)
    return h

input=torch.Tensor([1., 0., 0.])
print(hidden(input,tensor_v))
input=torch.Tensor([1., 0., 1.])
print(hidden(input,tensor_v))
input=torch.Tensor([1., 1., 0.])
print(hidden(input,tensor_v))
input=torch.Tensor([1., 1., 1.])
print(hidden(input,tensor_v))


tensor([1., 0.])
tensor([0., 0.])
tensor([0., 0.])
tensor([0., 1.])


Great! We have a hidden layer that already implicitly codes all information we need. However, we want a single output that classifies the input pattern with a single value that represents the output of the XOR gate. We can get what we want by implementing an OR gate. An OR gate can be implemented by a single perceptron:
$$
\mathcal{H}(w_0\cdot 1 + w_1 x_1 + w_2 x_2)
$$
You should verify that $w_0 = -\frac{1}{2}, w_1 = 1, w_2 =1$ implements an OR gate.


We need to extend the hidden node vector. We have calculated two hidden nodes, but node 0 is clamped to 1 by definition. And observe that we already took this into account when we defined the *w* tensor, which expects three nodes. Simply applying *matmul* to *h* will produce an error.

In [5]:
try:
    o=tensor_w@h
except RuntimeError:
    print('Pytorch picks up on the size mismatch')
    

Pytorch picks up on the size mismatch


Several solutions are conceivable. One is to pre-create the hidden layer as a tensor of shape (1,3) and
to update its last node by using an appropriate slicing.

In [6]:
h=torch.ones(3)
print(h)
h[1:]=torch.heaviside(hi,values)
print(h)

tensor([1., 1., 1.])
tensor([1., 1., 0.])


Now we can evaluate the last layer.

In [7]:
output = torch.heaviside(tensor_w@h,values)
print(output)

tensor([0.])


In [8]:
def nn(input):
    h=torch.ones(3)
    h[1:]=torch.heaviside(tensor_v@input, values)
    return torch.heaviside(tensor_w@h, values)


In [9]:
input1 = torch.tensor([1.,0.,0.])
print(nn(input1))
input2 = torch.tensor([1.,0.,1.])
print(nn(input2))
input3 = torch.tensor([1.,1.,0.])
print(nn(input3))
input4 = torch.tensor([1.,1.,1.])
print(nn(input4))

tensor([0.])
tensor([1.])
tensor([1.])
tensor([0.])


### Tensors and the GPU

We absolutely don't recommend that you try to create networks like this. Setting weights by hand is impractical and you don't really want your networks to consists of a loosely coupled set of tensors. We will improve on these things below. Nonetheless, these calculations are representative for what happens under the hood of PyTorch. And already simply using Tensors allows you to involve the GPU, if you have one.

In [10]:
print(torch.cuda.is_available())
if torch.cuda.is_available():
    torch.cuda.current_device()
    torch.cuda.device(0)
    torch.cuda.device_count()
    torch.cuda.get_device_name(0)

    # setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

torch.rand(10).to(device)     # move a tensor to device
torch.rand(10, device=device) # create a tensor on device

False
Using device: cpu



tensor([0.6387, 0.5247, 0.6826, 0.3051, 0.4635, 0.4550, 0.5725, 0.4980, 0.9371,
        0.6556])

## A Sigmoid Version of the XOR Network

### Using Torch to represent a network

To have network represented by a collection of matrices without any further organisation is undesirable. Torch offers a neural network module and it is good practice to derive from that. The code below is adapted from here:
https://gist.githubusercontent.com/user01/68514db1127eb007f24d28bfd11dd60e/raw/f9d19c595aaf43dbd23ee1c1457eecbf3be59ae1/torch.xor.py

In [11]:
import torch.nn as nn


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(2, 3, True)
        self.fc2 = nn.Linear(3, 1, True)

    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        x = self.fc2(x)
        return x

net = Net()

This network derives from an object called nn.Module, which is a neural network base class. If you haven't seen subclassing in Python before, just consider it as a bit of extra magic that you have to include, just as the call to the base class initialisation method, which is done through the *super* statement. Once these statements are included, building the network is straightforward enough.

This network is able to process input patterns of the right shape. Although the forward method is not called explicitly, it will be used in the evaluation of the input pattern.

**Exercise**: Think of a way of convincing yourself (and possibly others) of a way to see that forward is being called in the statement below.

In [12]:
input = torch.Tensor([1.,0.])
net(input)

tensor([0.8218], grad_fn=<AddBackward0>)

If we look at the documentation for nn.Linear, which is here:
https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear,
we find that Linear applies the following transformation:
$$
\boldsymbol{y} = \boldsymbol{x} \boldsymbol{A}^T + b
$$

The dimensions are sepcified in the call to Linear:
> <code>torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)</code>

This call specifies the dimensions of both $\boldsymbol{A}$ and $\boldsymbol{b}$. Next, the documentation states something very important:

The shape of the input is $(N, *, H_{in})$, and the output shape is $(N, *, H_{out})$, where $H_{in}$ is equal to
<code>in_features</code> and $H_{out}$ is equal to <code>out_features</code>. 

In our particular example this means that nn.Linear(2,3) is, as expected something which takes  two dimensional
pattern and spits out a three dimensional one.

In [13]:
layer = nn.Linear(2,3)
input = torch.Tensor([0.,0.])
output = layer(input)
print(output)

tensor([-0.3286,  0.6938, -0.2992], grad_fn=<AddBackward0>)


But it also works on a *set* of input patterns of arbitrary size!

In [14]:
inputs = torch.Tensor([[0.,0.],[0.,1.],[1.,0.],[1.,1.]])
output = layer(inputs)
print(output)

tensor([[-0.3286,  0.6938, -0.2992],
        [-1.0009,  1.3147,  0.0034],
        [-0.9729,  0.3528, -0.4170],
        [-1.6452,  0.9737, -0.1143]], grad_fn=<AddmmBackward>)


Where did we actually set the numerical values of this linear layer? We didn't! The documentation states that the numerical values are set by sampling from a uniform distribution when we create the layers. This makes perfect sense in the context of neural networks where we usually initialise weights by random values.

Can we get access to the values though? Yes:

In [15]:
print(layer.weight),
print(layer.bias)

Parameter containing:
tensor([[-0.6443, -0.6723],
        [-0.3411,  0.6209],
        [-0.1178,  0.3026]], requires_grad=True)
Parameter containing:
tensor([-0.3286,  0.6938, -0.2992], requires_grad=True)


**Exercise**: Verify by hand on an input pattern of your choice that the calculation of output patterns is as you expect.

### Gradient information

Just in passing, we mention that the data structures carry gradient information. In another notebook, we will discuss PyTorch's *autograd* functionality. Here, we just note that any expression that expression like linear
which have been created wih <code>requires_grad=True</code> carry gradient information, and compound statements, such as neural network also carry gradient information. In other words, the neural network we created is not capable of calculating its output in response to input pattern, but also evaluate the gradient. This explains why in the code below optimisers like the <code>SGD</code> object can implement stochastic gradient descent almost for free.

In [16]:

inputs =  [torch.Tensor(input) for input in [
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
] ]
           
targets = [ torch.Tensor(output) for output in [
    [0],
    [1],
    [1],
    [0]
] ]


### The Training Loop

The training loop is very simple. In the notebook on *autograd* we will explain the use of *forward* and *backward* calculations, but the upshot is that <code>loss.backward()</code> calculates the gradient of the loss function, which the optimizer then uses to perform a training step. In its construction, the network parameters have been defined as the thing to be optimised and a learning rate has been set.



In [17]:
import torch.optim as optim

EPOCHS_TO_TRAIN = 5000

criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.05)

print("Training loop:")
for idx in range(0, EPOCHS_TO_TRAIN):
    for input, target in zip(inputs, targets):
        optimizer.zero_grad()   # zero the gradient buffers
        output = net(input)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()    # Does the update
    if idx % 500 == 0:
        print("Epoch {: >8} Loss: {}".format(idx, loss.data.numpy()))


Training loop:
Epoch        0 Loss: 0.5276403427124023
Epoch      500 Loss: 0.2909414768218994
Epoch     1000 Loss: 0.22681206464767456
Epoch     1500 Loss: 0.0005891557666473091
Epoch     2000 Loss: 4.389457686215792e-10
Epoch     2500 Loss: 6.417089082333405e-12
Epoch     3000 Loss: 6.417089082333405e-12
Epoch     3500 Loss: 6.417089082333405e-12
Epoch     4000 Loss: 6.417089082333405e-12
Epoch     4500 Loss: 6.417089082333405e-12


In [18]:
print("Final results:")
for inputpattern, target in zip(inputs, targets):
    output = net(inputpattern)
    print(inputpattern.data.numpy())
    print("Input:[{},{}] Target:[{}] Predicted:[{}] Error:[{}]".format(
        int(inputpattern.data.numpy()[0]),
        int(inputpattern.data.numpy()[1]),
        int(target.data.numpy()[0]),
        round(float(output.data.numpy()[0]), 4),
        round(float(abs(target.data.numpy()[0] - output.data.numpy()[0])), 4)
    ))

Final results:
[0. 0.]
Input:[0,0] Target:[0] Predicted:[0.0] Error:[0.0]
[0. 1.]
Input:[0,1] Target:[1] Predicted:[1.0] Error:[0.0]
[1. 0.]
Input:[1,0] Target:[1] Predicted:[1.0] Error:[0.0]
[1. 1.]
Input:[1,1] Target:[0] Predicted:[0.0] Error:[0.0]


### Disclaimer

XOR is not a very realistic data science problem. And as mentioned in the main text, we would not normally use the MSE loss function in classification.  The purpose of this exercise is to display the data structures that you may encounter in PyTorch and because theoretically steepest gradient descent is easy to understand on this problem - you could easily implement it in even a spreadsheet - it is easy to see what's going on.