In [1]:
%matplotlib widget
from matplotlib.pyplot import *
from matplotlib import animation
import ipywidgets as widgets
import math, random, torch, numpy

# Neural network

A neural network is a flexible parameterised function used for fitting data. Typically it consists of a number of layers. Usually, each layer performs a linear transformation on its input followed by a nonlinear function. The input to a layer is an $N$ dimensional vector $\mathbf{x}$, and the output is a $M$ dimensional vetor $\mathbf{y}$. The transformation is usually of the form:
$$
\mathbf{y} = S(\mathbf{W}x + \mathbf{b})
$$
where $\mathbf{W}$ is an $N\times M$ matrix of weights, $\mathbf{b}$ is an $M$ dimensional vector of biases and $S$ applies a nonlinear function independently to each element of its input. Neural network consists of several (or many) layers. The function $S$ might vary between layers, and often $M$ has various special forms. This structure, consisted of a stack of layers is known as a *feedforward neural network*.

This is a bit abstract so we'll work through one from scratch.

First let's start by making some noisy simulated data and fitting it with a neural network:

In [2]:
x = torch.tensor(range(0,4000))/4000
y = torch.tensor([math.sin(i*32*2/math.pi) + i*8 + random.random() for i in x])
figure()
plot(x, y)
show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Now for the linear transform, or $\mathbf{W}x + \mathbf{b}$ bit above.

First an important thing to note. For this to be practical we need to write a function that operates on multiple data elements in one go. This is known as a batch of data. In pytorch, by convention the first dimension of a tensor is the batch if applicable.

Beyond that, we can do the computation with a combination of adding dimensions, duplicating the data (using `expand` this is free in pytorch and crucially keeps derivatives, generally you need to be careful with gradients when copying data), elementwise multiplication and addition of tensors and summing along dimension. 

In [14]:
def linear_transform(x, weights, bias):
    # B is the batch size, C is the number of channels in, D is the number of channels out
    # x.size() == [B,C]
    # weights.size() == [C,D]
    # bias.size() == D
    assert x.ndim == 2
    assert weights.ndim == 2
    assert bias.ndim == 1
    [B, C] = x.size()
    [D] = bias.size()
    assert list(weights.size()) == [C,D]
    
    # x1.size() == [B,C,1]
    x1 = x.unsqueeze(2)
    
    # x2.size() == [B, C, D], where x is effectively duplicated D times
    x2 = x1.expand([B, C, D])
    
    #Ignoring the batch dimension, if x = [1, 2] and D is 3 then x2 is
    #
    # [[1 1 1]
    #  [2 2 2]]
    #
    # Note that I've drawn the last dimension as a row. So a 1D tensor is 
    # just a row. A 2D tensor has thw row indexed first then the column within the row
    
    # Duplicate weights and biases across the batch dimension
    weights2 = weights.unsqueeze(0).expand([B, C, D])
    # Ignoring the batch, the weights, w, may be:
    # [[1 2 3]
    #  [4 5 6]]
    
    bias2 = bias.unsqueeze(0).expand([B, D])
    
    # x2 * weights will then be:
    # [1  2  3]
    # [8 10 12]
    # We then sum down dimension 1, which since we're ignoring the batch gives:
    # [9 12 15]
    linear = (x2 * weights).sum(1) + bias2
    return linear
    

Now I've made a basic linear transform layer, the next step is to use it to create a network. For this we'll need the parameters in this case, the weights and biases of each layer, and of course a function that applies the layers and a nonlinear transform. I've used two different nonlinear transforms, ReLU and Sigmoid:


In [15]:
xval = torch.tensor(np.arange(-10,10,0.01))
figure()
#sigmoids map all values from -inf < z < +inf to 0 < a < 1
plot(xval, torch.sigmoid(xval), 'r', label="Sigmoid")
#relus map all <0 values to 0 and leave the rest unchanged
plot(xval, torch.relu(xval), 'b', label="relu")
axis([-10, 10, 0, 4])
legend()
show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

There are many other types too. 

The most convient way to build the network is to use a class to store the parameters. I've defined a `forward` function to apply the net to an input and `parameters` to return all the parameters in a list, so I can pass them to an optimizer. I've defined a 4 layer network (the inputs are conventionally called a layer too). The last layer doesn't have a nonlinearity applied, since I don't want to limit the range of the output.

In [16]:
class network:
    def __init__(self):
        #Input one channel, output 40 channels
        self.layer1_weights = torch.rand([1, 40])-.5
        self.layer1_bias = torch.rand([40])-.5
        
        #Input 40 channels, output 60
        self.layer2_weights = torch.rand([40, 60])-.5
        self.layer2_bias = torch.rand([60])-.5
        
        #input 60 channels, output 1
        self.layer3_weights = torch.rand([60, 1])-.5
        self.layer3_bias = torch.rand([1])-.5
        
        
        for i in self.parameters():
            i.requires_grad = True
    def parameters(self):
        return [
            self.layer1_weights, self.layer1_bias,
            self.layer2_weights, self.layer2_bias,
            self.layer3_weights, self.layer3_bias,
        ]
    
    def forward(self, x):
        # Each datapoint x is scalar, so X is a 1D tensor containing a batch of B datapoints
        # linear_transform expects a 2D input, so this transforms it into a 2D, Bx1 tensor
        l0 = x.unsqueeze(1) 
        l1 = torch.relu(linear_transform(l0, self.layer1_weights, self.layer1_bias))
        l2 = torch.sigmoid(linear_transform(l1, self.layer2_weights, self.layer2_bias))
        l3 = (linear_transform(l2, self.layer3_weights, self.layer3_bias))
        # We want the output to be scalar for each datapoint, so we drop the last dimension
        # of the 2D Bx1 tensor giving a 1D B dimensional tensor
        return l3.squeeze(1)
    

Now to use the network.

I've implemented optimization using Stochastic Gradient Descent variant Adam. For SGD, optimization also only proceeds on a small subset of the data at each iteration. This has two advantages: one is that you don't need all the data at once which helps if the dataset is large, the other is that it is much more resistant to getting stuck in local minima. I've cheated and used a random subset with replacement each time. In reality you would pihc random subsets without replacement until all the data is exhausted (that's called an Epoch) and then repeat for many epochs.

In [17]:
net = network()

optimizer = torch.optim.Adam(net.parameters(), lr=0.1)

figure()
def update_graph(_):
    global net, x, y, optimizer
    clf()
    
 
    #Gradient descent is slow, so run for many iterations on each button press 
    for i in range(100):
        # pick a random subset of the data
        ind = random.sample(list(range(len(x))), 100)
        xr = x[ind]
        yr = y[ind]
        
        optimizer.zero_grad()
        y_pred = net.forward(xr)
        # Least squares fitting
        loss = ((yr - y_pred)**2).sum() / len(yr)
        loss.backward()
        optimizer.step()
    
    title(str(loss))
    plot(x, y)
    plot(x, net.forward(x).detach().numpy(), 'r')
    

    
update_graph(None)
button = widgets.Button(description="Click me a lot")
output = widgets.Output()
display(button, output)
button.on_click(update_graph)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Button(description='Click me a lot', style=ButtonStyle())

Output()

## And now in pytorch

Much like gradients and optimizers, PyTorch has a lot of tools and features to make this process easier and take out the repetitive code. The first part is dealing wiht the data. I avoided writing a proper SGD because it's a bit awkward.  In PyTorch, I'll do it properly.

### Data

There are two important concepts:
* Datasets
* Data Loaders

A *dataset* is simply something you can index to get an item of data. A list could be a dataset. In a case like ours where we're fitting a function, a datum has two elements, the $x$ and $y$ coordinates. Since they are going to be plumbed into pytorch, you want them to come in as tensors. We could turn `x` and `y` into a dataset like this, where each datum is a tuple containg x and y each as a tensor:

In [18]:
dataset = [ (xe, ye) for (xe, ye) in zip(x, y)]
print(dataset[0])

(tensor(0.), tensor(0.8140))


It probably won't come as much surprise that pytorch can already assemble data held in tensors into a dataset:

In [19]:
dataset = torch.utils.data.TensorDataset(x, y)
print(dataset[0])

(tensor(0.), tensor(0.8140))


There's a good chance you will need to write a data set.

The next concept is a *data loader*. This is responsible for loading data in a useful way: it fetches data from the provided dataset, assembles it into batches, and can optional randomize the order for a proper implementation of SGD (as well as many other things). You can even perform transformations on your data, e.g. converting it to tensors if the dataset is in the wrong format.

In [20]:
loader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, drop_last=True)
print(loader.__iter__().__next__()) #This incantation prints the first item that would be fetched by a for loop

[tensor([0.0027, 0.2025, 0.6432, 0.5550, 0.8813, 0.5742, 0.5347, 0.1065, 0.7265,
        0.1200]), tensor([0.2987, 1.0748, 5.7999, 4.1302, 6.9558, 4.3736, 3.4775, 1.7983, 7.4662,
        1.7759])]


You can see it's loaded 10 random $x$ values (which is samples without replacement) and the corresponding $y$ values. Since the individual data elements were tuples of tensors, the results is tuples of tensors with a batch dimension added. `drop_last` tells the loader not to give you a partial batch at the end if the amount of data is not divisible by the batch size.

### Layers and Networks

The useful concept is a helper class called `Module`. This essentially defines `parameters()` for you and will automatically search through member variables for you and assemble all the parameters in them into a list to pass to the optimizer. Since it can get all the paramerers easily, it also gives you helper functions for loading and saving trained networks.

Pytorch defines many ready made modules for you, e.g. `Linear` which implements the linear transform you have already seen.

In [21]:
import torch.nn
class network2(torch.nn.Module):
    def __init__(self):
        super(network2, self).__init__() #Python has the best OO syntax...
        
        self.linear1 = torch.nn.Linear(1, 40)
        self.linear2 = torch.nn.Linear(40, 60)
        self.linear3 = torch.nn.Linear(60, 1)
        
    def forward(self, x):
        l1 = torch.relu(self.linear1(x.unsqueeze(1)))
        l2 = torch.sigmoid(self.linear2(l1))
        l3 = self.linear3(l2).squeeze(1)
        return l3

In [24]:
from tqdm import tqdm #For a nice progress bar
loader = torch.utils.data.DataLoader(dataset, batch_size=100, shuffle=True, drop_last=True)
net2 = network2()
optimizer = torch.optim.Adam(net2.parameters(), lr=.1)
for epoch in tqdm(range(100)):
    for xdata, ydata in loader:
        optimizer.zero_grad();
        loss = torch.nn.functional.mse_loss(net2.forward(xdata), ydata)
        loss.backward()
        optimizer.step()

figure()
plot(x, y, 'b')
plot(x, net2.forward(x).detach(), 'r')
show()

ModuleNotFoundError: No module named 'tqdm'

The results may be different from before due to differences in the random initialisation and sampling. Try rerunning a few times.