In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd

Intro to CNN (must read if need clarification on some of the terminology) https://arxiv.org/pdf/1603.07285.pdf

https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md

In the last decade, we witnessed great strides in the field of computer vision through the utilization and development of deep convolutional neural network architectures. Thus, much attention has been given to the development and manipulation of these convolutional models both in terms of theory and ease of implementation/prototyping (think PyTorch, TensorFlow, etc).

However, time series modeling can also use this and is really can just be thought of as a simpler 1D version of the 2D.  We would like to  

This tutorial will focus on providing a top down approach to building a 1D convolutional model in PyTorch.  We will walk through the arithmetic of the different hyperparameters of such models in order to provide a solid foundation for more sophisticated hypertuning of these parameters to build highly complex and powerful neural network architectures for time series analysis.

## PyTorch's `nn.Conv1d` Class

We will focus on a code-first, top-down approach and fill in theoretical details as needed. The hope is that by the end of this you will feel comfortable working with all the levers made available by a modern, open-source, machine learning libraries. Thus, we begin with noting that PyTorch provides a native class for a 1D convolutional layer:

`torch.nn.Conv1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')`

We begin with the required arguments (`in_channels`, `out_channels`, `kernel_size`) of the `torch.nn.Conv1d` class before moving on to the keyword arguments.  The documentation defines them in the following way:
```
in_channels (python:int) – Number of channels in the input "image"   <----(comon you guys!!)
out_channels (python:int) – Number of channels produced by the convolution
kernel_size (python:int or tuple) – Size of the convolving kernel
```
Notice the slight Freudian slip of the architects assuming that the input would be an "image."  Since we're undyingly grateful for the work they've done, we'll let this slide and instead focus on discussing the implication of being able to specify the number of channels in a "so-called" 1D model. 

## The Univariate Problem: 1 channel in, 1 channel out

Again, we're taking a build-first approach so let's just start with a simple example: 1 channel in, 1 channel out, and a kernel_size of 5.  This is a realistic starting point since, oftentimes, a time series problem is framed such that a univariate (single channel input) dataset is used to predict the future value of itself (single channel output).  (Aside: we will also remove the bias term from the model for now).

In [3]:
kernel_size = 5
conv1d_layer = torch.nn.Conv1d(in_channels = 1, out_channels = 1, 
                               kernel_size =kernel_size, bias = False)
conv1d_layer

Conv1d(1, 1, kernel_size=(5,), stride=(1,), bias=False)

And that's it, we've just created our first 1D convolutional layer that inherently leverages all the power of PyTorch's autograd functionality, GPU utilization, etc.  But... what was actually created? Well let's take a look at the kernel (or weight matrix if you prefer):

In [4]:
conv1d_layer.weight

Parameter containing:
tensor([[[-0.0071,  0.1819,  0.4029, -0.4130,  0.0206]]], requires_grad=True)

Unsurprisingly, PyTorch has randomly generated 5 numbers to act as our initial weight tensor.  However, if you'd like to use your own weight initialization logic you can overlay this by accessing the values of the tensor using the `data` attribute.  Let's say we'd previously optimized this model and knew that it should simply take an unweighted moving average of the last 5 data points. We would modify all the weights to be 1/5 like so:

In [5]:
conv1d_layer.conv.weight.data = torch.ones(1,1,kernel_size) / kernel_size
conv1d_layer.weight

Parameter containing:
tensor([[[0.2000, 0.2000, 0.2000, 0.2000, 0.2000]]], requires_grad=True)

Of course, you can use similar logic to overlay any number of weight initialization protocols.

OK, so we have a model (which takes an unweighted average over the 5 most recent data points), let's create a fake stream of time series data and pass it in to see what happens (of course, reshaping to the expected dimensions of `[batch_size, channels_in, seq_length]` and make sure that the type is `float32`).

In [20]:
data = torch.tensor([4, 9, 3, 6, 4, 7, 9, 2, 6, 5, 6, 8, 9, 1, 4]).float()
# reformat data [batch, channels, seq_len]
X = data.view(1,1,-1)
print('input shape :',X.shape)

# pass into our convolutional layer
output = conv1d_layer(X)
print('output shape:',output.shape)
output

input shape : torch.Size([1, 1, 15])
output shape: torch.Size([1, 1, 11])


tensor([[[5.2000, 5.8000, 5.8000, 5.6000, 5.6000, 5.8000, 5.6000, 5.4000,
          6.8000, 5.8000, 5.6000]]], grad_fn=<SqueezeBackward1>)

Ahh the simplicity of PyTorch.  All we needed to do was pass our input data in as an argument to the model we created and it generates the output series based on the weights stored (or optimized as that may be the case).  But you may notice that there are only 11 values in the output series whereas our original dataset had 15.  Recall that we are taking a moving average of 5 data points which means that, in order to make an inference at a particular point in time, we need to know the previous 4 values.  Thus, our first output can only come once we have "seen" 5 data points. The figure provides a visual of how a 1d convolutional kernel slides along the time axis of our input series to provide a single output value at every timestep after skipping the first 4.  Again, PyTorch handles all of this logic internally for us but understanding the expected dimensional output is still very important for model scoring, prototyping, etc.

##### Picture of sliding conv on time axis

11 small pictures with shading to show where kernel is focusing, and then one of them will be a zoomed in to show the math

input
|
kernel
|
output

note the directionality of it

## The Multivariate Problem: `n` input channels, `m` output channels





For example, let's say that we were interest in predicting the future value of a particular stock and only wanted to use past-values of that stock to do so.  That would mean that both `in_channels` and `out_channels` would be set to 1.  However, if we wanted to use the 5 most similar company's stock prices to predict the value of a single stock then `out_channels` would stay 

Ok enough talk, let's start building...

the input of a 1D model would ever have more  why a time series would ever be

"The bread and butter of neural networks is affine transformations: a vector
is received as input and is multiplied with a matrix to produce an output (to
which a bias vector is usually added before passing the result through a nonlinearity)." https://arxiv.org/pdf/1603.07285.pdf

## Basic 1d Conv

Look at what the basic output looks like

1 channel in and 1 channel out, and a kernel length of 5,

no extra striding (stride =1), no extra padding (paadding = 0), no dilation (dilation =1)

note that we are only required to give `in_channels, out_channels, kernel_size`. left the others there with their defaults (except bias, we're turning that off for now)

In [3]:
model = torch.nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size =5,
                        stride=1, padding=0, dilation=1, groups=1, bias=False, padding_mode='zeros')

print(model.weight)

Parameter containing:
tensor([[[ 0.2755,  0.1295, -0.0642,  0.2511,  0.3457]]], requires_grad=True)


Lets create some simple data, 2 channels with sequence length: 15 (obviously not very interesting data here)

In [7]:
data = torch.ones((1,15))
data

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

Lets grab the first channel and add 2 dummy dimensions to it since that's what's expected

In [5]:
inp = data[0].view(1,1,-1)
# inp[0,0,-1] = 4
# inp[0,0,4] = 4
inp

tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]])

Pass it into the model and get 11 numbers out which makes sense because the first four inputs didn't have enough past history to make a prediction (kernel size = 5) so it's only at the 5th element that we get our first prediction

In [6]:
out = model(inp)
print(out.shape)
out

torch.Size([1, 1, 11])


tensor([[[0.9376, 0.9376, 0.9376, 0.9376, 0.9376, 0.9376, 0.9376, 0.9376,
          0.9376, 0.9376, 0.9376]]], grad_fn=<SqueezeBackward1>)

As for the value, it's not surprisingly constant but its also just the sum of all the elements in the kernel since our data is simply ones ($\sum_{i=0}^k w_i$ where $k$ is kernal length and $w_i$ refers to each weight-element in the kernel). 

One can access the kernel weights with the `weight` attribute for the layer in pytorch

In [61]:
model.weight.sum()

tensor(0.6312, grad_fn=<SumBackward0>)

### kernel is not a multiple of sequence length

What happens if we change the kernal length to be a non-multiple of the sequence length (k = 4 and seq_len = 15)

Not a problem since our stride is 1 and we don't have any padding, it just follows the same pattern above where the first k-1 points don't recieve predictions

In [11]:
model = torch.nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size =4,
                        stride=1, padding=0, dilation=1, groups=1, bias=False, padding_mode='zeros')

print(model.weight)

Parameter containing:
tensor([[[-0.4973, -0.2145,  0.1587, -0.0901]]], requires_grad=True)


Again since the kernel is overlaid on top of the sequence, the first 3 values in the sequence don't have corresponding outputs from the model (since it is a requirement to have 3 

In [12]:
out = model(inp)
print(out.shape)
out

torch.Size([1, 1, 12])


tensor([[[-0.6433, -0.6433, -0.6433, -0.6433, -0.6433, -0.6433, -0.6433,
          -0.6433, -0.6433, -0.6433, -0.6433, -0.6433]]],
       grad_fn=<SqueezeBackward1>)

## Multiple inputs

So what happens when we just change the number of input channels but still expect 1 channel in the output. We see that the weight tensor now has 2 kernels of length 5 (so the shape is (1,2,5))

In [14]:
data = torch.ones((2,15))
data[1] += 1
data

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]])

In [15]:
model = torch.nn.Conv1d(in_channels = 2, out_channels = 1, kernel_size =5,
                        stride=1, padding=0, dilation=1, groups=1, bias=False, padding_mode='zeros')

print(model.weight)

Parameter containing:
tensor([[[ 5.5815e-02,  6.2070e-02,  1.1655e-01, -1.6826e-01, -2.7581e-01],
         [ 2.0035e-01, -1.8993e-04,  1.6400e-01,  4.2376e-02,  1.4429e-01]]],
       requires_grad=True)


Add 2 dummy dimension to the tensor since that's what's expected by torch (shape = (1,2,15))

In [17]:
inp = data.view(1,2,15)
inp

tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]])

Pass it into the model

In [18]:
model(inp)

tensor([[[0.8920, 0.8920, 0.8920, 0.8920, 0.8920, 0.8920, 0.8920, 0.8920,
          0.8920, 0.8920, 0.8920]]], grad_fn=<SqueezeBackward1>)

The results are constant again (as to be expected) and simply the sum of kernel1 convolving the first series (of ones) and kernel2 convolving the second series (of 2s).  Again, the first convolution is just the sum of the weight elements in the kernel ($\sum_{i=0}^k w_i$) whereas the the second

Note that it's just $\sum_{i=0}^k 1*w_i$ where $k$ is kernal length and $w_i$ refers to each weight. It's mult by 1 right now since our series is just ones [1,1,1....,1]

In [27]:
kernel1 = model.weight[0,0]
kernel2 = model.weight[0,1]

kernel1.sum() * 1 + kernel2.sum() * 2

tensor(0.8920, grad_fn=<AddBackward0>)

In [19]:
a.view(1,1,-1)

tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2.,
          2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]])

stride is how many steps to take before making a new calculation (if you say 0 it will just stay in one spot and obviously kill the kernel you're working on)

In [46]:
model = torch.nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size =5,
                stride=1, padding=0, dilation=2, groups=1, 
                bias=False, padding_mode='zeros')

print(model.weight)
model(a.view(1,1,-1))

Parameter containing:
tensor([[[-0.4267,  0.4077, -0.1108,  0.2961,  0.1737]]], requires_grad=True)


tensor([[[0.8097, 0.8097, 0.8097, 0.8097, 0.8097, 0.6989, 0.6989, 0.6989,
          0.6989, 0.6989]]], grad_fn=<SqueezeBackward1>)

In [39]:
model.weight[0,0,1:].sum()

tensor(0.4398, grad_fn=<SumBackward0>)

In [33]:
model.weight[].sum()*2


tensor(-0.0668, grad_fn=<MulBackward0>)

In [47]:
torch.tensor([ 0.2745, -0.1809,  0.0806, -0.3044, -0.2521]).sum()

tensor(-0.3823)

In [48]:
torch.tensor([ 0.2901,  0.1385, -0.2273, -0.1210,  0.2087]).sum()

tensor(0.2890)

In [50]:
-0.3823 + 0.2890*2

0.19569999999999999

In [54]:
torch.tensor([0.0807,  0.1279, -0.1774, -0.0982, -0.0345]).sum() + 2 * torch.tensor([-0.0052,  0.0016,  0.1569,  0.2386,  0.1325]).sum()

tensor(0.9473)