<h1> Convolutional Arthematic for Multivariate, Multi-Output Time Series in PyTorch</h1>
<h2> Parameterizing the Causal Autoencoder Problem Statement</h2>

Explain autoencoder use

figure 1: consider 1d series...show that first of target isn't in future (needs to be shifted back to be causal) and because of that last of input doesn't have a future prediction (draw line to show where to lop off both series and note that it's "1" value at the moment)

in convolution, you actually need to provide some padding to only the front (which we'll show you how to do) to make sure you are causal, it's dependent on the receptive field of the model which is dependent on the dilation and kernel size (equation)

figure 2: show some greyed out zeros on the image above and preds are now added (maybe a single kernel grabbing inputs and pointing to pred it creates)

NOte that because of this, you might want to add a little priming during training before you start scoring the results (cut off the ones that were influenced by padding if you so choose, this will be easy to do though because both the targets and the preds would be the same amount to cut off - "padding" below)

Now remember as we do this that this can easily be extended to multivariate (multi-channel) data and of course pytorch can handle batches as well. so as we do this with one example, just realize the the blocks could extend into the page like so... (just don't feel like drawing it)...theoretically the channels in are the same amount as the channels out but technically you could only want say 1 out channel for 5 in channels but of course for an autoencoder that doesn't work the other way around (although if you did have a target that was different than the input the logic in this article should help you think through the indexing of that as well)

figure 3: show that the channels extend into board (maybe add another one)

we may want to guess farther into future (say next month if these are values taken weekly), so let's see what that does to the image below

figure 4: modify fig 2 show guessing 5? into the future (again with the lines and showing how much to lop off begin/end of respective series), greyed out bits that aren't used, note that inputs no longer have targets at the end because they got shifted, note that predictions don't need to be modified (as you cut the input len, predictions follow)

or you may want to predict each week for the next 4 weeks

figure 5: same as fig 2 but now outputs have been stacked

or you may want to combine these (either to make it harder or because you have an actual use age for the information)

figure 6: both fig 4 and 5

finally a limited use case I imagine but just in case it comes up, covering bases, what if you strided your kernel. well this is basically just subsampling the outputs which means that with no modification of your input series, your preds will have fewer values so you need to also remove the corresponding values from the targets array

figure 7: show stride 2 and 3 on both of the above (or just one example and say trust me it works the same for figures 2,4,5,6 equally)

ok so using this reshaping logic, you should have the necessary framework to start reshaping/indexing properly to create a causal autoencoder... now I'm going to show you how this works in pytorch. Going to be laborious about the indexing because it's really important to know how to access things when you're training/rebuilding/adding-to/evaluating a model

but first here are the answers (note that I want to leave padding for the individual layers)

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd

Lets make autoencoder class out of conv1d, typically means you use past to predict next or several next in a time series (or several if you have multivariate).  TYpically autoencoders are made to be used to pretrain several layers of another project but sometimes they are useful just as is (think stock market forecasting, supply/demand predictions, etc).  Regardless of your application, you may not want to stop 

However, deep learning aside, wrapping your mind around/keeping straight the indexing logic is enough for me to never want to do again (after doing once) so I'm writing this to keep track of it and develop the logic piecemeal. In the end we'll end up with something that was built from scratch so we know what each piece does so when you (or I) go back and want to change bits of it (say add layers), no one ever has to think through this logic again.

Unfortunately we know that most of the deep learning field has been driven by computer vision related work so we have to modify a bit but most of the tools in pytorch are satisfactory, we just need to wrap them

We know that to make it causal, we pad the front bit of the input before sending into the forward pass (see wavenet implementation https://github.com/NVIDIA/nv-wavenet/blob/master/pytorch/wavenet.py#L23)

In [3]:
class Causal_1dConv(torch.nn.Module):
    """
    A convolution with the option to be causal and use xavier initialization
    """
    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
                 dilation=1, bias=True, w_init_gain='linear', is_causal=True):
        super(Causal_1dConv, self).__init__()
        self.is_causal = is_causal
        self.kernel_size = kernel_size
        self.dilation = dilation

        self.conv = torch.nn.Conv1d(in_channels, out_channels,
                                    kernel_size=kernel_size, stride=stride,
                                    dilation=dilation, bias=bias)

#         torch.nn.init.xavier_uniform(
#             self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))

    def forward(self, signal):
        if self.is_causal:
                padding = (int((self.kernel_size - 1) * (self.dilation)), 0)
                signal = torch.nn.functional.pad(signal, padding) 
        return self.conv(signal)

As a reminder, the `torch.nn.Conv1d` class pads both sides which doesn't make sense for us, so we still us the class but add the padding onto the front of the input ourselves (turns out the function `torch.nn.functional.pad` allows you to specify how much to put on each end with a `tuple` argument instead of an `int`).

In [5]:
signal = torch.arange(10)+5
print('Original:',signal)
padding = (2,0)
signal = torch.nn.functional.pad(signal, padding)
print('Padded:  ',signal)

Original: tensor([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
Padded:   tensor([ 0,  0,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])


We also know that the kernel will be of the shape `out_channels, in_channels, kernel_size` because what it really does is create multiple (`out_channels`) kernels of size `in_channels, kernel_size` and so your output is of the shape `batch_size, out_channels, seq_len`

In [73]:
kernel_size = 3
dilation = 1
stride = 1
out_channels = 4
in_channels = 2

conv1d_layer = Causal_1dConv(in_channels = in_channels, 
                             out_channels = out_channels,
                             kernel_size = kernel_size, 
                             bias = False,
                             dilation = dilation,
                            stride = stride)

# make weights be an averaging function or simple sum
conv1d_layer.conv.weight.data = torch.ones(out_channels,in_channels,kernel_size)
conv1d_layer.conv.weight

Parameter containing:
tensor([[[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]]], requires_grad=True)

So if we wanted to change the output of the second channel to be twice as great as the others (remember our kernel is all 1's right now so it basically just sums up everything within it). Note that these are definitely not what the kernels of any model would look like but their simplicity allow for our human brains to quickly figure out what the output should be and thus give a sense for what the model is doing under the hood. For example, the kernels of 1's simply mean that the 6 numbers it touches will be added together. This way we can make sure that the results are, say, causal.

In [74]:
conv1d_layer.conv.weight.data[1] = 2
conv1d_layer.conv.weight

Parameter containing:
tensor([[[1., 1., 1.],
         [1., 1., 1.]],

        [[2., 2., 2.],
         [2., 2., 2.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]]], requires_grad=True)

To see the effect on that, we have to create some fake data. Let's create two channels that basically just count up from 0 by 10s. The first channel will do that but the second will just be offset by 5.  Finally, to show that pytorch can handle (and in fact, expects) batches, we can create another example that is simply shifted up by 100 and when we combine the two we have a "mini batch"

In [75]:
seq_len = 6
channels = 2

example1 = torch.zeros((1,channels,seq_len)) + torch.arange(seq_len).float() * 10
example1[:,1] = example1[:,1] + 5
print(f'1st Example:\n{example1}\n{example1.shape}\n')

example2 = example1 + 100
print(f'2nd Example:\n{example2}\n{example2.shape}\n')

examples = torch.cat((example1, example2), 0)
print(f'Batched Examples:\n{examples}\n{examples.shape}')

1st Example:
tensor([[[ 0., 10., 20., 30., 40., 50.],
         [ 5., 15., 25., 35., 45., 55.]]])
torch.Size([1, 2, 6])

2nd Example:
tensor([[[100., 110., 120., 130., 140., 150.],
         [105., 115., 125., 135., 145., 155.]]])
torch.Size([1, 2, 6])

Batched Examples:
tensor([[[  0.,  10.,  20.,  30.,  40.,  50.],
         [  5.,  15.,  25.,  35.,  45.,  55.]],

        [[100., 110., 120., 130., 140., 150.],
         [105., 115., 125., 135., 145., 155.]]])
torch.Size([2, 2, 6])


Before passing it through, let's remind ourselves of what the model is going to see after padding (notice pytorch has no problem with padding multiple channels and multiple examples, if you only pass a tuple of shape (2,) it will only pad the final dimension (which for us is time). The 0 as the second element of the padding input tells it we don't need to pad the end of the signal.

In [76]:
signal = examples
print(f'Original:\n{signal}\n')
padding = (int((kernel_size - 1) * (dilation)), 0)
signal = torch.nn.functional.pad(signal, padding)
print(f'Padded Input:\n{signal}')

Original:
tensor([[[  0.,  10.,  20.,  30.,  40.,  50.],
         [  5.,  15.,  25.,  35.,  45.,  55.]],

        [[100., 110., 120., 130., 140., 150.],
         [105., 115., 125., 135., 145., 155.]]])

Padded:
tensor([[[  0.,   0.,   0.,  10.,  20.,  30.,  40.,  50.],
         [  0.,   0.,   5.,  15.,  25.,  35.,  45.,  55.]],

        [[  0.,   0., 100., 110., 120., 130., 140., 150.],
         [  0.,   0., 105., 115., 125., 135., 145., 155.]]])


Again, you can see that we just add the numbers up as the kernel observes them and because of the padding, our output `seq_len` matches our input `seq_len`. In addition, we can see that since we modified the second channel's kernel to be 2's instead of 1's, it is like adding the numbers and then multiplying by 2 for the second channel.

In [77]:
output = conv1d_layer(examples)
output

tensor([[[   5.,   30.,   75.,  135.,  195.,  255.],
         [  10.,   60.,  150.,  270.,  390.,  510.],
         [   5.,   30.,   75.,  135.,  195.,  255.],
         [   5.,   30.,   75.,  135.,  195.,  255.]],

        [[ 205.,  430.,  675.,  735.,  795.,  855.],
         [ 410.,  860., 1350., 1470., 1590., 1710.],
         [ 205.,  430.,  675.,  735.,  795.,  855.],
         [ 205.,  430.,  675.,  735.,  795.,  855.]]],
       grad_fn=<SqueezeBackward1>)

Ok now that all that is established, let's look at having a variable length output for each timestep of the model (i.e. instead of just guessing one value per channel - the next value presumably - can it guess the next `p` time steps for each channel). So now we're opening to have an output that is 4 dimensional instead of 3.

To back up a bit, here is what we expect from a multivariate output that is hoping to guess only the next data point (i.e. `p=1`) which for the first example looks like this (Note that we don't know the future of the series so the input of [50,55] won't have a "target" value in training - because... how could you possibly guess what comes next in this pattern?!?!). 

In [86]:
print(f'INPUT:\n{example1}\n')

print(f'OUTPUT:\n{example1[:,:,1:]}')

INPUT:
tensor([[[ 0., 10., 20., 30., 40., 50.],
         [ 5., 15., 25., 35., 45., 55.]]])

OUTPUT:
tensor([[[10., 20., 30., 40., 50.],
         [15., 25., 35., 45., 55.]]])


So we want the output of at the first time step for the two channels to be [10,15] when given the history leading up to and including the [0,5] current time step input (again remember we have a bunch of zeros padded in front of the input time dimension so that we can get a prediction corresponding to every input timestep).  

That's the paradigm we've been working with thus far, but what if we wanted to give the model the ability to guess the next p timesteps at each point. This means the model would have to be a lot smarter because it would need to guess farther into the future as well as the underlying patterns of the dataset (think underlying trend or frequency response of the data).  So instead of our output being the above, it would be some sort of rolling window like this (note that this is how I create `y_true` in the final model):

In [94]:
def torch_running_view(tensor,window,axis=-1):
    shape = list(tensor.shape)
    shape[axis] -= (window -1)
    assert(shape[axis]>0)
    return torch.as_strided(tensor, shape + [window], 
                            tensor.stride() + (tensor.stride()[axis],))

print(f'INPUT:\n{example1}\n')

out_timesteps = 3
rolling_output = torch_running_view(example1, out_timesteps)
print(f'OUTPUT:\n{rolling_output[:,:,1:]}')

INPUT:
tensor([[[ 0., 10., 20., 30., 40., 50.],
         [ 5., 15., 25., 35., 45., 55.]]])

OUTPUT:
tensor([[[[10., 20., 30.],
          [20., 30., 40.],
          [30., 40., 50.]],

         [[15., 25., 35.],
          [25., 35., 45.],
          [35., 45., 55.]]]])


While the expected dimensions of `torch.nn.Conv1d` are `batch_size, out_channels, seq_len`, the dimensions of the above correspond with `batch_size, out_channels, seq_len, out_timesteps` (so `[:,:,1:]` still cuts the first value of the sequence forcing the model to guess "next" not the current state). For example, starting with [0,5] and padding, we want the first output of channel 1 to be [10, 20, 30] (it's contribution was the 0) and the output of channel 2 to be [15, 25, 35] (it's contribution was the 5). Notice instead of `Tin - 1` like before, how output sequence is of length `Tin - p` (where T is seq_len of input and p is number of expected output timesteps at each step).  Thus, we might as well not even feed in the last p of the input sequence examples into the model because they cannot be scored.

So how do we make our `CausalConv1d` class give us a 4th dimension when it naturally wants to give us 3?  Easy, we just make it guess extra channels and then reshape the output.

In [110]:
kernel_size = 3
dilation = 1
stride = 1
in_channels = 2
out_channels = in_channels # autoencoder remember
# new
out_timesteps = 3

multi_output_conv1d_layer = Causal_1dConv(in_channels = in_channels, 
                                         # new
                                         out_channels = out_channels * out_timesteps,
                                         kernel_size = kernel_size, 
                                         bias = False,
                                         dilation = dilation,
                                        stride = stride)

# make weights be an averaging function or simple sum
multi_output_conv1d_layer.conv.weight.data = torch.ones(out_channels * out_timesteps,in_channels,kernel_size)
multi_output_conv1d_layer.conv.weight

Parameter containing:
tensor([[[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]]], requires_grad=True)

Lets make the 2nd kernel be 2's instead of 1's to get a sense as to how the output changes (as seen above we have `out_channels * out_timesteps` number of `in_channels,kernel_size` shaped kernels that in a practical situation will all be different and create different output channels)

In [111]:
multi_output_conv1d_layer.conv.weight.data[1] = 2
multi_output_conv1d_layer.conv.weight

Parameter containing:
tensor([[[1., 1., 1.],
         [1., 1., 1.]],

        [[2., 2., 2.],
         [2., 2., 2.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.]]], requires_grad=True)

Just to remind us what example1 looks like with a little padding (i.e. what the model will see on the forward pass)

In [109]:
signal = example1
print(f'Original:\n{signal}\n')
padding = (int((kernel_size - 1) * (dilation)), 0)
signal = torch.nn.functional.pad(signal, padding)
print(f'Padded Input:\n{signal}')

Original:
tensor([[[ 0., 10., 20., 30., 40., 50.],
         [ 5., 15., 25., 35., 45., 55.]]])

Padded Input:
tensor([[[ 0.,  0.,  0., 10., 20., 30., 40., 50.],
         [ 0.,  0.,  5., 15., 25., 35., 45., 55.]]])


So when we convolve the padded input with those 6 kernels, we get 6 output series that are all the same length (`Tin`) as the input series.  We can also so that the second kernel predictably has created an output channel with values that are double in magnitude of the others (again, the kernels that are simply 1's are essentially just adding numbers from the input together).

In [104]:
raw_multi_output = multi_output_conv1d_layer(example1)
raw_multi_output

tensor([[[  5.,  30.,  75., 135., 195., 255.],
         [ 10.,  60., 150., 270., 390., 510.],
         [  5.,  30.,  75., 135., 195., 255.],
         [  5.,  30.,  75., 135., 195., 255.],
         [  5.,  30.,  75., 135., 195., 255.],
         [  5.,  30.,  75., 135., 195., 255.]]], grad_fn=<SqueezeBackward1>)

In [136]:
T = example1.shape[-1]
p = out_timesteps
Tin = T - p
a = multi_output_conv1d_layer(example1[:,:,:Tin])
a

tensor([[[  5.,  30.,  75.],
         [ 10.,  60., 150.],
         [  5.,  30.,  75.],
         [  5.,  30.,  75.],
         [  5.,  30.,  75.],
         [  5.,  30.,  75.]]], grad_fn=<SqueezeBackward1>)

Because we padded, the output sequences are all the same length as the input but we remember that we won't be able to score the last `p` predictions from each output channel.  We will remove them after reshaping the output in two steps:

First, we split the final dimension into `(out_timesteps, seq_len)`

In [113]:
Tin = example1.shape[-1]  # length of input series
modified_output = raw_multi_output.view(1,in_channels,out_timesteps,Tin)
modified_output

tensor([[[[  5.,  30.,  75., 135., 195., 255.],
          [ 10.,  60., 150., 270., 390., 510.],
          [  5.,  30.,  75., 135., 195., 255.]],

         [[  5.,  30.,  75., 135., 195., 255.],
          [  5.,  30.,  75., 135., 195., 255.],
          [  5.,  30.,  75., 135., 195., 255.]]]], grad_fn=<ViewBackward>)

Technically thats sufficient but I think it's weird to have the dimensions be `batch_size, out_channels, out_timesteps, seq_len` so I flip the last two dimensions (this also lines it up with the output that I showed above with the rolling window which is the function I use to create the `y_true`

In [127]:
reshaped_output = modified_output.permute((0,1,3,2)) # flipping 2nd and 3rd dims
print(reshaped_output.shape)
reshaped_output

torch.Size([1, 2, 6, 3])


tensor([[[[  5.,  10.,   5.],
          [ 30.,  60.,  30.],
          [ 75., 150.,  75.],
          [135., 270., 135.],
          [195., 390., 195.],
          [255., 510., 255.]],

         [[  5.,   5.,   5.],
          [ 30.,  30.,  30.],
          [ 75.,  75.,  75.],
          [135., 135., 135.],
          [195., 195., 195.],
          [255., 255., 255.]]]], grad_fn=<PermuteBackward>)

Again, remember that we don't have future values for all the timesteps in the input series. Since we can't score them, let's remove the last p from the `seq_len` dim (the last one now). This will actually be included by reshaping the inputs but I wanted to point it out here.

In [128]:
# length of output series (not actually used but included to provide further context)
Tout = Tin - out_timesteps 
final_output = reshaped_output[:,:,:out_timesteps,:]
print(final_output.shape)
final_output

torch.Size([1, 2, 3, 3])


tensor([[[[  5.,  10.,   5.],
          [ 30.,  60.,  30.],
          [ 75., 150.,  75.]],

         [[  5.,   5.,   5.],
          [ 30.,  30.,  30.],
          [ 75.,  75.,  75.]]]], grad_fn=<SliceBackward>)

Ok sanity check, those two clusters above represent the output of two channels.  Let's look at the second channel because none of its values were generated with the 2's kernel. Thus, at each time step, the model generates a string of 3 equivalent values (first: 5's, second: 30s, and finally: 75s).  That should pass our sanity check because each kernel is identical which means that the outputs should be identical if it's seeing the same inputs (if you want to check this, remember that the values the kernel sees are simply added when the kernel is all 1's).

Let's put it all together in a class that would allow us to 

In [190]:
kernel_size = 1
dilation = 1
stride = 3
in_channels = 2
out_channels = in_channels # autoencoder remember
# new
out_timesteps = 3

test1 = Causal_1dConv(in_channels = in_channels, 
                                         # new
                                         out_channels = out_channels * out_timesteps,
                                         kernel_size = kernel_size, 
                                         bias = False,
                                         dilation = dilation,
                                        stride = 1)

test2 = Causal_1dConv(in_channels = in_channels, 
                                         # new
                                         out_channels = out_channels * out_timesteps,
                                         kernel_size = kernel_size, 
                                         bias = False,
                                         dilation = dilation,
                                        stride = stride)

# make weights be an averaging function or simple sum
test1.conv.weight.data = torch.ones(out_channels * out_timesteps,in_channels,kernel_size)
test2.conv.weight.data = torch.ones(out_channels * out_timesteps,in_channels,kernel_size)
# test.conv.weight

In [205]:
seq_len =14
channels = 2

example3 = torch.zeros((1,channels,seq_len)) + torch.arange(seq_len).float() * 10
example3[:,1] = example3[:,1] + 5
print(f'1st Example:\n{example3}\n{example3.shape}\n')

o1 = test1(example3[:,:,:-out_timesteps])[:,:,::stride]
o2 = test2(example3[:,:,:-out_timesteps])
print(o1[:,0],'\n',o1.shape)
print(o2[:,0],'\n',o2.shape)

1st Example:
tensor([[[  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100.,
          110., 120., 130.],
         [  5.,  15.,  25.,  35.,  45.,  55.,  65.,  75.,  85.,  95., 105.,
          115., 125., 135.]]])
torch.Size([1, 2, 14])

tensor([[  5.,  65., 125., 185.]], grad_fn=<SelectBackward>) 
 torch.Size([1, 6, 4])
tensor([[  5.,  65., 125., 185.]], grad_fn=<SelectBackward>) 
 torch.Size([1, 6, 4])


In [207]:
def shape_inputs(signal,out_timesteps,kernel_size,dilation):
    # pad so that it's causal
    padding = (int((kernel_size - 1) * (dilation)), 0)
    X = torch.nn.functional.pad(signal, padding)
    
    # remove timesteps that we won't have predictions for
    X = X[:,:,:-out_timesteps]
    
    return X

def shape_preds(preds, X, out_timesteps):
    Tin = X.shape[-1]
    in_channels = X.shape[1]
    preds = preds.view(1,in_channels,out_timesteps,-1)
    preds = preds.permute((0,1,3,2))
    return preds

def shape_targets(signal,out_timesteps,stride):
    future_vals = signal[:,:,1:] # all but the first
    targets = torch_running_view(future_vals,out_timesteps) # reshape to have a 4th dimension
    targets = targets[:,:,::stride] # sample according to striding
    return targets

X = shape_inputs(example3,out_timesteps,kernel_size,dilation)
preds = test2(X)
preds = shape_preds(preds, X, out_timesteps)
targets = shape_targets(example3,out_timesteps,stride)

print(f'Input Series:\n{example3}\n{example3.shape}\n')
print(f'X:\n{X}\n{X.shape}\n')
print(f'preds:\n{preds}\n{preds.shape}\n')
print(f'targets:\n{targets}\n{targets.shape}')

Input Series:
tensor([[[  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100.,
          110., 120., 130.],
         [  5.,  15.,  25.,  35.,  45.,  55.,  65.,  75.,  85.,  95., 105.,
          115., 125., 135.]]])
torch.Size([1, 2, 14])

X:
tensor([[[  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100.],
         [  5.,  15.,  25.,  35.,  45.,  55.,  65.,  75.,  85.,  95., 105.]]])
torch.Size([1, 2, 11])

preds:
tensor([[[[  5.,   5.,   5.],
          [ 65.,  65.,  65.],
          [125., 125., 125.],
          [185., 185., 185.]],

         [[  5.,   5.,   5.],
          [ 65.,  65.,  65.],
          [125., 125., 125.],
          [185., 185., 185.]]]], grad_fn=<PermuteBackward>)
torch.Size([1, 2, 4, 3])

targets:
tensor([[[[ 10.,  20.,  30.],
          [ 40.,  50.,  60.],
          [ 70.,  80.,  90.],
          [100., 110., 120.]],

         [[ 15.,  25.,  35.],
          [ 45.,  55.,  65.],
          [ 75.,  85.,  95.],
          [105., 115., 125.]]]])
torch.Si

In a class

In [222]:
class Conv1dAutoencoder(torch.nn.Module):
    """
    A causal implementation of Conv1d
    """
    def __init__(self, in_channels, out_channels, out_timesteps = 1,
                 kernel_size=1, stride=1, dilation=1, bias=True):
        super(Conv1dAutoencoder, self).__init__()
        
        self.kernel_size = kernel_size
        self.dilation = dilation
        self.stride = stride
        
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.out_timesteps = out_timesteps

        self.conv = torch.nn.Conv1d(in_channels= in_channels, 
                                    out_channels = out_channels * out_timesteps,
                                    kernel_size=kernel_size, 
                                    stride=stride,
                                    dilation=dilation, 
                                    bias=bias)

    def forward(self, signal):
        X = self.shape_inputs(signal)
        
        targets = self.shape_targets(signal)

        preds = self.conv(X)
        preds = self.reshape_preds(preds)

        return X, preds, targets
    
    def shape_inputs(self,signal):
        # pad so that it's causal
        padding = (int((self.kernel_size - 1) * (self.dilation)), 0)
        X = torch.nn.functional.pad(signal, padding)
        # remove timesteps that we won't have predictions for
        X = X[:,:,:-self.out_timesteps]
        return X  
    
    def reshape_preds(self,preds):
        preds = preds.view(1,self.in_channels,self.out_timesteps,-1)
        preds = preds.permute((0,1,3,2))
        return preds
    
    def shape_targets(self,signal):
        future_vals = signal[:,:,1:] # all but the first
        targets = torch_running_view(future_vals,self.out_timesteps) # reshape to have a 4th dimension
        targets = targets[:,:,::self.stride] # sample according to striding
        return targets

In [227]:
kernel_size = 2
dilation = 2
stride = 3
in_channels = 2
out_channels = in_channels # autoencoder remember
# new
out_timesteps = 3

model = Conv1dAutoencoder(in_channels = in_channels, 
                         # new
                         out_channels = out_channels,
                          out_timesteps = out_timesteps,
                         kernel_size = kernel_size, 
                         bias = False,
                         dilation = dilation,
                         stride = stride)

model.conv.weight.data = torch.ones(out_channels * out_timesteps,in_channels,kernel_size)

seq_len = 10
example4 = torch.zeros((1,2,seq_len)) + torch.arange(seq_len).float() * 10
example4[:,1] = example4[:,1] + 5
print(f'4th Example:\n{example4}\n{example4.shape}\n')

X, preds, targets = model(example4)



print(f'X:\n{X}\n{X.shape}\n')
print(f'preds:\n{preds}\n{preds.shape}\n')
print(f'targets:\n{targets}\n{targets.shape}')


4th Example:
tensor([[[ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90.],
         [ 5., 15., 25., 35., 45., 55., 65., 75., 85., 95.]]])
torch.Size([1, 2, 10])

X:
tensor([[[ 0.,  0.,  0., 10., 20., 30., 40., 50., 60.],
         [ 0.,  0.,  5., 15., 25., 35., 45., 55., 65.]]])
torch.Size([1, 2, 9])

preds:
tensor([[[[  5.,   5.,   5.],
          [ 90.,  90.,  90.],
          [210., 210., 210.]],

         [[  5.,   5.,   5.],
          [ 90.,  90.,  90.],
          [210., 210., 210.]]]], grad_fn=<PermuteBackward>)
torch.Size([1, 2, 3, 3])

targets:
tensor([[[[10., 20., 30.],
          [40., 50., 60.],
          [70., 80., 90.]],

         [[15., 25., 35.],
          [45., 55., 65.],
          [75., 85., 95.]]]])
torch.Size([1, 2, 3, 3])


Lets see if we can break it... doesn't seem like it

In [233]:
kernel_size = 5
dilation = 3
stride = 4
in_channels = 2
out_channels = in_channels # autoencoder remember
# new
out_timesteps = 10

model = Conv1dAutoencoder(in_channels = in_channels, 
                         # new
                         out_channels = out_channels,
                          out_timesteps = out_timesteps,
                         kernel_size = kernel_size, 
                         bias = False,
                         dilation = dilation,
                         stride = stride)

model.conv.weight.data = torch.ones(out_channels * out_timesteps,in_channels,kernel_size)

seq_len = 30
example4 = torch.zeros((1,2,seq_len)) + torch.arange(seq_len).float() * 10
example4[:,1] = example4[:,1] + 5
print(f'4th Example:\n{example4}\n{example4.shape}\n')

X, preds, targets = model(example4)



print(f'X:\n{X}\n{X.shape}\n')
print(f'preds:\n{preds}\n{preds.shape}\n')
print(f'targets:\n{targets}\n{targets.shape}')


4th Example:
tensor([[[  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100.,
          110., 120., 130., 140., 150., 160., 170., 180., 190., 200., 210.,
          220., 230., 240., 250., 260., 270., 280., 290.],
         [  5.,  15.,  25.,  35.,  45.,  55.,  65.,  75.,  85.,  95., 105.,
          115., 125., 135., 145., 155., 165., 175., 185., 195., 205., 215.,
          225., 235., 245., 255., 265., 275., 285., 295.]]])
torch.Size([1, 2, 30])

X:
tensor([[[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
            0.,   0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90.,
          100., 110., 120., 130., 140., 150., 160., 170., 180., 190.],
         [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
            0.,   5.,  15.,  25.,  35.,  45.,  55.,  65.,  75.,  85.,  95.,
          105., 115., 125., 135., 145., 155., 165., 175., 185., 195.]]])
torch.Size([1, 2, 32])

preds:
tensor([[[[   5.,    5.,    5.,    5.,    5.,    5.,    5

In [5]:
class Autoencoder():
    def __init__(self, in_channels, out_channels, output_len,
                 kernel_size=1, stride=1,dilation=1):
        
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.output_len = output_len
        self.kernel_size = kernel_size
        self.dilation = dilation
        
        
        self.model = Causal_1dConv(in_channels = in_channels,
                                   out_channels = out_channels * output_len,
                                   kernel_size = kernel_size, 
                                   bias = False, stride = stride,
                                   dilation = dilation)
        
    def predict(self,x):
        
        batch_size = x.shape[0]
        assert x.shape[1] == self.in_channels, 'not proper number of channels'
        seq_len = x.shape[2]
        
        # should be 3 dim
        preds = self.model(x)
        print(preds.shape)
        # needs to be expanded to 4 in order to overlay on top of y_true
        # batch_size, channels, seq_len, output_len
# uncomment this
        #         preds = preds.view(batch_size,self.out_channels,-1,self.output_len) 
# UNCOMMENT THIS
        
        # padding is independent of dilation 
        padding = (int(self.kernel_size - 1 ), 0)
        x = torch.nn.functional.pad(x, padding)
        y_true = torch_running_view(x,self.output_len)
        return preds, y_true

In [18]:
data = torch.zeros((2,2,8)) + torch.arange(8).float() *10
data[:,1] = data[:,1] + 5
data[1] = data[1] + 100
print(data)
print('shape:   ',data.shape)


chans =  data.shape[1]
kernel_size = 2
output_len = 4
dilation = 1

autoencoder = Autoencoder(in_channels = chans, stride=1,
                          out_channels = chans,
                          output_len = output_len,
                          kernel_size = kernel_size,
                         dilation=dilation)


autoencoder.model.conv.weight.data = torch.ones(chans*output_len,chans,kernel_size)
# autoencoder.model.conv.weight.data[1,:,:] = 2
preds, y_true = autoencoder.predict(data[0].view(1,2,-1))

preds.shape, y_true.shape

tensor([[[  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.],
         [  5.,  15.,  25.,  35.,  45.,  55.,  65.,  75.]],

        [[100., 110., 120., 130., 140., 150., 160., 170.],
         [105., 115., 125., 135., 145., 155., 165., 175.]]])
shape:    torch.Size([2, 2, 8])
torch.Size([1, 8, 8])


(torch.Size([1, 8, 8]), torch.Size([1, 2, 6, 4]))

In [19]:
autoencoder.model.conv.weight.data

tensor([[[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]]])

In [14]:
autoencoder.model.conv.weight.data.shape

torch.Size([8, 2, 2])

In [15]:
signal = data[0]
padding = (int((kernel_size-1)*dilation),0)
signal = torch.nn.functional.pad(signal, padding)
signal

tensor([[ 0.,  0., 10., 20., 30., 40., 50., 60., 70.],
        [ 0.,  5., 15., 25., 35., 45., 55., 65., 75.]])

In [16]:
# batch, chans, seqlen, output_len
preds[0,0,:,:]#.shape

IndexError: too many indices for tensor of dimension 3

# its repeating information because everything is a 1

In [20]:
preds

tensor([[[  5.,  30.,  70., 110., 150., 190., 230., 270.],
         [  5.,  30.,  70., 110., 150., 190., 230., 270.],
         [  5.,  30.,  70., 110., 150., 190., 230., 270.],
         [  5.,  30.,  70., 110., 150., 190., 230., 270.],
         [  5.,  30.,  70., 110., 150., 190., 230., 270.],
         [  5.,  30.,  70., 110., 150., 190., 230., 270.],
         [  5.,  30.,  70., 110., 150., 190., 230., 270.],
         [  5.,  30.,  70., 110., 150., 190., 230., 270.]]],
       grad_fn=<SqueezeBackward1>)

In [40]:
# add an extra dim to split the channels from output dim (remember we're having the conv layer give us num_chan * output_dim)
test = preds.view(1,chans,output_len,data.shape[-1])#.permute((0,1,3,2))
test

tensor([[[[  5.,  30.,  70., 110., 150., 190., 230., 270.],
          [  5.,  30.,  70., 110., 150., 190., 230., 270.],
          [  5.,  30.,  70., 110., 150., 190., 230., 270.],
          [  5.,  30.,  70., 110., 150., 190., 230., 270.]],

         [[  5.,  30.,  70., 110., 150., 190., 230., 270.],
          [  5.,  30.,  70., 110., 150., 190., 230., 270.],
          [  5.,  30.,  70., 110., 150., 190., 230., 270.],
          [  5.,  30.,  70., 110., 150., 190., 230., 270.]]]],
       grad_fn=<ViewBackward>)

In [42]:
# batch, chan, output_index, current time
test[0,0,:,0]

tensor([5., 5., 5., 5.], grad_fn=<SelectBackward>)

In [44]:
# to match the shape of y_true we need to swap the axis (using permute) 
# each row is all the same because all four of the kernels (or 8 if you consider both channels) are seeing
# the same input at that time step and all the kernels are also identical
# ideally the output kernels would learn to be different based on what the input is (which right now is not only 2 time steps of two channels
# so not a whole lot of information for it to expand to 8 values)

# I guess if we had multiple layers, only the output would need to receive this reshaping treatment
test2 = test.permute((0,1,3,2))
test2

tensor([[[[  5.,   5.,   5.,   5.],
          [ 30.,  30.,  30.,  30.],
          [ 70.,  70.,  70.,  70.],
          [110., 110., 110., 110.],
          [150., 150., 150., 150.],
          [190., 190., 190., 190.],
          [230., 230., 230., 230.],
          [270., 270., 270., 270.]],

         [[  5.,   5.,   5.,   5.],
          [ 30.,  30.,  30.,  30.],
          [ 70.,  70.,  70.,  70.],
          [110., 110., 110., 110.],
          [150., 150., 150., 150.],
          [190., 190., 190., 190.],
          [230., 230., 230., 230.],
          [270., 270., 270., 270.]]]], grad_fn=<PermuteBackward>)

In [48]:
signal = data[0]
padding = (int((kernel_size-1)*dilation),0)
signal = torch.nn.functional.pad(signal, padding)
signal

tensor([[ 0.,  0., 10., 20., 30., 40., 50., 60., 70.],
        [ 0.,  5., 15., 25., 35., 45., 55., 65., 75.]])

In [45]:
y_true

tensor([[[[ 0.,  0., 10., 20.],
          [ 0., 10., 20., 30.],
          [10., 20., 30., 40.],
          [20., 30., 40., 50.],
          [30., 40., 50., 60.],
          [40., 50., 60., 70.]],

         [[ 0.,  5., 15., 25.],
          [ 5., 15., 25., 35.],
          [15., 25., 35., 45.],
          [25., 35., 45., 55.],
          [35., 45., 55., 65.],
          [45., 55., 65., 75.]]]])

In [21]:
(batch_size,self.out_channels,-1,self.output_len) 

NameError: name 'batch_size' is not defined

In [26]:
preds.view(1,2,data.shape[-1],-1).transpose(3,4)

IndexError: Dimension out of range (expected to be in range of [-4, 3], but got 4)

In [24]:
x = torch.cat((torch.ones(1, 24, 24),
               torch.ones(1, 24, 24)*2,
               torch.ones(1, 24, 24)*3), 0)

permute = [2, 1, 0]
x = x[permute, :, :]
y = x.numpy()
c, h, w = y.shape
y = y.transpose(1, 2, 0)
y

array([[[3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.],
        ...,
        [3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.]],

       [[3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.],
        ...,
        [3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.]],

       [[3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.],
        ...,
        [3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.]],

       ...,

       [[3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.],
        ...,
        [3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.]],

       [[3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.],
        ...,
        [3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.]],

       [[3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.],
        ...,
        [3., 2., 1.],
        [3., 2., 1.],
        [3., 2., 1.]]], dtype=float32)

In [25]:
x

tensor([[[3., 3., 3.,  ..., 3., 3., 3.],
         [3., 3., 3.,  ..., 3., 3., 3.],
         [3., 3., 3.,  ..., 3., 3., 3.],
         ...,
         [3., 3., 3.,  ..., 3., 3., 3.],
         [3., 3., 3.,  ..., 3., 3., 3.],
         [3., 3., 3.,  ..., 3., 3., 3.]],

        [[2., 2., 2.,  ..., 2., 2., 2.],
         [2., 2., 2.,  ..., 2., 2., 2.],
         [2., 2., 2.,  ..., 2., 2., 2.],
         ...,
         [2., 2., 2.,  ..., 2., 2., 2.],
         [2., 2., 2.,  ..., 2., 2., 2.],
         [2., 2., 2.,  ..., 2., 2., 2.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]]])

In [190]:
y_true

tensor([[[[ 0.,  0.],
          [ 0., 10.],
          [10., 20.],
          [20., 30.],
          [30., 40.],
          [40., 50.],
          [50., 60.],
          [60., 70.]],

         [[ 0.,  5.],
          [ 5., 15.],
          [15., 25.],
          [25., 35.],
          [35., 45.],
          [45., 55.],
          [55., 65.],
          [65., 75.]]]])

In [11]:
def torch_running_view(tensor,window,axis=-1):
    shape = list(tensor.shape)
    shape[axis] -= (window -1)
    assert(shape[axis]>0)
    return torch.as_strided(tensor, shape + [window], 
                            tensor.stride() + (tensor.stride()[axis],))

# data = torch.arange(10).view(2,1,-1)
data = torch.zeros((2,2,8)) + torch.arange(8).float() *10
data[:,1] = data[:,1] + 5
data[1] = data[1] + 100
print(data)
print('shape:   ',data.shape)

reshaped_data = torch_running_view(data,3)
print("reshaped:",reshaped_data.shape)
reshaped_data[0,0]

tensor([[[  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.],
         [  5.,  15.,  25.,  35.,  45.,  55.,  65.,  75.]],

        [[100., 110., 120., 130., 140., 150., 160., 170.],
         [105., 115., 125., 135., 145., 155., 165., 175.]]])
shape:    torch.Size([2, 2, 8])
reshaped: torch.Size([2, 2, 6, 3])


tensor([[ 0., 10., 20.],
        [10., 20., 30.],
        [20., 30., 40.],
        [30., 40., 50.],
        [40., 50., 60.],
        [50., 60., 70.]])

In [5]:
data = torch.arange(20).float()
# reformat data
X = data.view(1,1,-1)
y_true = data.squeeze()

# pass into our convolutional layer
output = conv1d_layer(X)

print(f'     X: {X.squeeze()}\ny_true: {y_true}\ny_pred: {(output.view(-1))}')
print('\n')
output.shape, X.shape, y_true.shape


     X: tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])
y_true: tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])
y_pred: tensor([ 0.,  1.,  3.,  6., 10., 15., 20., 25., 30., 35., 40.],
       grad_fn=<ViewBackward>)




(torch.Size([1, 1, 11]), torch.Size([1, 1, 11]), torch.Size([11]))

In [8]:
data = torch.arange(20).float()

data *10 +5

tensor([  5.,  15.,  25.,  35.,  45.,  55.,  65.,  75.,  85.,  95., 105.])

In [34]:
data = torch.zeros((2,2,8)) + torch.arange(8).float() *10
data[:,1] = data[:,1] + 5
data[1] = data[1] + 100
data[0]

tensor([[ 0., 10., 20., 30., 40., 50., 60., 70.],
        [ 5., 15., 25., 35., 45., 55., 65., 75.]])

In [22]:
data[1]

tensor([[100., 110., 120., 130., 140., 150., 160., 170.],
        [105., 115., 125., 135., 145., 155., 165., 175.]])

In [24]:
torch.as_strided(data[0,0],(2,-1),1)

TypeError: as_strided(): argument 'stride' (position 3) must be tuple of ints, not int

In [25]:
def running_view(arr, window, axis=-1):
    """
    return a running view of length 'window' over 'axis'
    the returned array has an extra last dimension, which spans the window
    """
    shape = list(arr.shape)
    shape[axis] -= (window-1)
    assert(shape[axis]>0)
    return np.lib.index_tricks.as_strided(
        arr,
        shape + [window],
        arr.strides + (arr.strides[axis],))

In [32]:
data.strides

(8,)

In [33]:
def pytorch_rolling_window(x, window_size, step_size=1):
    # unfold dimension to make our rolling window
    return x.unfold(0,window_size,step_size)

In [41]:
data = torch.arange(10).view(2,-1)

pytorch_rolling_window(data,2,2)

tensor([[[0, 5],
         [1, 6],
         [2, 7],
         [3, 8],
         [4, 9]]])

In [44]:
data

tensor([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])

In [43]:
data.stride()

(5, 1)

In [31]:
data = np.arange(10)
data

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [29]:
running_view(data,2)

array([[0, 1],
       [1, 2],
       [2, 3],
       [3, 4],
       [4, 5],
       [5, 6],
       [6, 7],
       [7, 8],
       [8, 9]])

Intro to CNN (must read if need clarification on some of the terminology) https://arxiv.org/pdf/1603.07285.pdf

https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md

In the last decade, we witnessed great strides in the field of computer vision through the utilization and development of deep convolutional neural network architectures. Thus, much attention has been given to the development and manipulation of these convolutional models both in terms of theory and ease of implementation/prototyping (think PyTorch, TensorFlow, etc).

However, time series modeling can also use this and is really can just be thought of as a simpler 1D version of the 2D.  We would like to  

This tutorial will focus on providing a top down approach to building a 1D convolutional model in PyTorch.  We will walk through the arithmetic of the different hyperparameters of such models in order to provide a solid foundation for more sophisticated hypertuning of these parameters to build highly complex and powerful neural network architectures for time series analysis.

## PyTorch's `nn.Conv1d` Class

We will focus on a code-first, top-down approach and fill in theoretical details as needed. The hope is that by the end of this you will feel comfortable working with all the levers made available by a modern, open-source, machine learning libraries. Thus, we begin with noting that PyTorch provides a native class for a 1D convolutional layer:

`torch.nn.Conv1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')`

We begin with the required arguments (`in_channels`, `out_channels`, `kernel_size`) of the `torch.nn.Conv1d` class before moving on to the keyword arguments.  The documentation defines them in the following way:
```
in_channels (python:int) – Number of channels in the input image
out_channels (python:int) – Number of channels produced by the convolution
kernel_size (python:int or tuple) – Size of the convolving kernel
```
Notice the slight Freudian slip of the architects assuming that the input would be an "image."  Since we're undyingly grateful for the work they've done, we'll let this slide and instead focus on discussing the implication of being able to specify the number of channels in a "so-called" 1D model. 

## The Univariate Problem: 1 channel in, 1 channel out

Again, we're taking a build-first approach so let's just start with a simple example: 1 channel in, 1 channel out, and a kernel_size of 5.  This is a realistic starting point since, oftentimes, a time series problem is framed such that a univariate (single channel input) dataset is used to predict the future value of itself (single channel output).  (Aside: we will also remove the bias term from the model for now).

In [3]:
kernel_size =5
conv1d_layer = torch.nn.Conv1d(in_channels = 1, out_channels = 1, 
                               kernel_size =kernel_size, bias = False)
conv1d_layer

Conv1d(1, 1, kernel_size=(5,), stride=(1,), bias=False)

And that's it, we've just created our first 1D convolutional layer that inherently leverages all the power of PyTorch's autograd functionality, GPU utilization, etc.  But... what was actually created? Well let's take a look at the kernel (or weight matrix if you prefer):

In [4]:
conv1d_layer.weight

Parameter containing:
tensor([[[-0.0071,  0.1819,  0.4029, -0.4130,  0.0206]]], requires_grad=True)

Unsurprisingly, PyTorch has randomly generated 5 numbers to act as our initial weight tensor.  However, if you'd like to use your own weight initialization logic you can overlay this by accessing the values of the tensor using the `data` attribute.  Let's say we'd previously optimized this model and knew that it should simply take an unweighted moving average of the last 5 data points. We would modify all the weights to be 1/5 like so:

In [5]:
conv1d_layer.weight.data = torch.ones(1,1,5) / 5
conv1d_layer.weight

Parameter containing:
tensor([[[0.2000, 0.2000, 0.2000, 0.2000, 0.2000]]], requires_grad=True)

Of course, you can use similar logic to overlay any number of weight initialization protocols.

OK, so we have a model (which takes an unweighted average over the 5 most recent data points), let's create a fake stream of time series data and pass it in to see what happens (of course, reshaping to the expected dimensions of `[batch_size, channels_in, seq_length]` and make sure that the type is `float32`).

In [15]:
data = torch.tensor([4, 9, 3, 6, 4, 7, 9, 2, 6, 5, 6, 8, 9, 1, 4]).float()
# reformat data [batch, channels, seq_len]
X = data.view(1,1,-1)[:,:,:-1]
print(X.shape)

# pass into our convolutional layer
output = conv1d_layer(X)
print(output.squeeze().shape)
output

torch.Size([1, 1, 14])
torch.Size([10])


tensor([[[5.2000, 5.8000, 5.8000, 5.6000, 5.6000, 5.8000, 5.6000, 5.4000,
          6.8000, 5.8000]]], grad_fn=<SqueezeBackward1>)

# WHY ISN'T IT CAUSAL? Seems like it is if set up like this

the average of [0., 1., 2., 3., 4] is 2 and likewise is 3 of [1., 2., 3., 4, 5].  So clearly its taking the first 5 then shifting 1 to the right...etc.  I can set y_true to be 5 and 6 and let it modify it's weights but to me this proves that it's doing what we hope it would unless I'm missing something

In [36]:
data = torch.arange(7).float()
# reformat data
X = data.view(1,1,-1)[:,:,:-1]
y_true = data.squeeze()[kernel_size:]

# pass into our convolutional layer
output = conv1d_layer(X)

print(f'     X: {X.squeeze()}\ny_true: {y_true}\ny_pred: {torch.round(output.view(-1))}')

     X: tensor([0., 1., 2., 3., 4., 5.])
y_true: tensor([5., 6.])
y_pred: tensor([2., 3.], grad_fn=<RoundBackward>)


In [44]:
kernel_size =5
conv1d_layer = torch.nn.Conv1d(in_channels = 1, out_channels = 1, 
                               kernel_size =kernel_size, bias = False,dilation=2)
conv1d_layer.weight.data = torch.ones(1,1,5) / 5
conv1d_layer.weight

Parameter containing:
tensor([[[0.2000, 0.2000, 0.2000, 0.2000, 0.2000]]], requires_grad=True)

In [78]:
data = torch.arange(11).float()
# reformat data
X = data.view(1,1,-1)
y_true = data.squeeze()

# pass into our convolutional layer
output = conv1d_layer(X)

print(f'     X: {X.squeeze()}\ny_true: {y_true}\ny_pred: {(output.view(-1))}')
print('\n')
output.shape, X.shape, y_true.shape


     X: tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])
y_true: tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])
y_pred: tensor([0.0000, 0.2000, 0.6000, 1.2000, 2.0000, 3.0000, 4.0000, 5.0000, 6.0000,
        7.0000, 8.0000], grad_fn=<ViewBackward>)




(torch.Size([1, 1, 11]), torch.Size([1, 1, 11]), torch.Size([11]))

wavenets implementation is easier, maybe don't introduce it until later

https://github.com/NVIDIA/nv-wavenet/blob/master/pytorch/wavenet.py#L23

In [62]:
class Conv(torch.nn.Module):
    """
    A convolution with the option to be causal and use xavier initialization
    """
    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
                 dilation=1, bias=True, w_init_gain='linear', is_causal=False):
        super(Conv, self).__init__()
        self.is_causal = is_causal
        self.kernel_size = kernel_size
        self.dilation = dilation

        self.conv = torch.nn.Conv1d(in_channels, out_channels,
                                    kernel_size=kernel_size, stride=stride,
                                    dilation=dilation, bias=bias)

        torch.nn.init.xavier_uniform(
            self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))

    def forward(self, signal):
        if self.is_causal:
                padding = (int((self.kernel_size - 1) * (self.dilation)), 0)
                signal = torch.nn.functional.pad(signal, padding) 
        return self.conv(signal)

In [75]:
kernel_size =5
conv1d_layer = Conv(in_channels = 1, out_channels = 1, 
                               kernel_size =kernel_size, bias = False,dilation=1,is_causal=True)
conv1d_layer.conv.weight.data = torch.ones(1,1,5) / 5
conv1d_layer.conv.weight



Parameter containing:
tensor([[[0.2000, 0.2000, 0.2000, 0.2000, 0.2000]]], requires_grad=True)

In [78]:
data = torch.arange(11).float()
# reformat data
X = data.view(1,1,-1)
y_true = data.squeeze()

# pass into our convolutional layer
output = conv1d_layer(X)

print(f'     X: {X.squeeze()}\ny_true: {y_true}\ny_pred: {(output.view(-1))}')
print('\n')
output.shape, X.shape, y_true.shape


     X: tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])
y_true: tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])
y_pred: tensor([0.0000, 0.2000, 0.6000, 1.2000, 2.0000, 3.0000, 4.0000, 5.0000, 6.0000,
        7.0000, 8.0000], grad_fn=<ViewBackward>)




(torch.Size([1, 1, 11]), torch.Size([1, 1, 11]), torch.Size([11]))

this guy goes into how to restructuredata if you have multivariate, stacked output https://machinelearningmastery.com/how-to-develop-convolutional-neural-network-models-for-time-series-forecasting/

In [58]:
(np.array([0,2,4,6,8])+1).mean()

5.0

Ahh the simplicity of PyTorch.  All we needed to do was pass our input data in as an argument to the model we created and it generates the output series based on the weights stored (or optimized as that may be the case).  But you may notice that there are only 11 values in the output series whereas our original dataset had 15.  Recall that we are taking a moving average of 5 data points which means that, in order to make an inference at a particular point in time, we need to know the previous 4 values.  Thus, our first output can only come once we have "seen" 5 data points. The figure provides a visual of how a 1d convolutional kernel slides along the time axis of our input series to provide a single output value at every timestep after skipping the first 4.  Again, PyTorch handles all of this logic internally for us but understanding the expected dimensional output is still very important for model scoring, prototyping, etc.

##### Picture of sliding conv on time axis

11 small pictures with shading to show where kernel is focusing, and then one of them will be a zoomed in to show the math

input
|
kernel
|
output

note the directionality of it

## The Multivariate Problem: `n` input channels, `m` output channels





For example, let's say that we were interest in predicting the future value of a particular stock and only wanted to use past-values of that stock to do so.  That would mean that both `in_channels` and `out_channels` would be set to 1.  However, if we wanted to use the 5 most similar company's stock prices to predict the value of a single stock then `out_channels` would stay 

Ok enough talk, let's start building...

the input of a 1D model would ever have more  why a time series would ever be

"The bread and butter of neural networks is affine transformations: a vector
is received as input and is multiplied with a matrix to produce an output (to
which a bias vector is usually added before passing the result through a nonlinearity)." https://arxiv.org/pdf/1603.07285.pdf

## Basic 1d Conv

Look at what the basic output looks like

1 channel in and 1 channel out, and a kernel length of 5,

no extra striding (stride =1), no extra padding (paadding = 0), no dilation (dilation =1)

note that we are only required to give `in_channels, out_channels, kernel_size`. left the others there with their defaults (except bias, we're turning that off for now)

In [3]:
model = torch.nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size =5,
                        stride=1, padding=0, dilation=1, groups=1, bias=False, padding_mode='zeros')

print(model.weight)

Parameter containing:
tensor([[[ 0.2755,  0.1295, -0.0642,  0.2511,  0.3457]]], requires_grad=True)


Lets create some simple data, 2 channels with sequence length: 15 (obviously not very interesting data here)

In [7]:
data = torch.ones((1,15))
data

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

Lets grab the first channel and add 2 dummy dimensions to it since that's what's expected

In [5]:
inp = data[0].view(1,1,-1)
# inp[0,0,-1] = 4
# inp[0,0,4] = 4
inp

tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]])

Pass it into the model and get 11 numbers out which makes sense because the first four inputs didn't have enough past history to make a prediction (kernel size = 5) so it's only at the 5th element that we get our first prediction

In [6]:
out = model(inp)
print(out.shape)
out

torch.Size([1, 1, 11])


tensor([[[0.9376, 0.9376, 0.9376, 0.9376, 0.9376, 0.9376, 0.9376, 0.9376,
          0.9376, 0.9376, 0.9376]]], grad_fn=<SqueezeBackward1>)

As for the value, it's not surprisingly constant but its also just the sum of all the elements in the kernel since our data is simply ones ($\sum_{i=0}^k w_i$ where $k$ is kernal length and $w_i$ refers to each weight-element in the kernel). 

One can access the kernel weights with the `weight` attribute for the layer in pytorch

In [61]:
model.weight.sum()

tensor(0.6312, grad_fn=<SumBackward0>)

### kernel is not a multiple of sequence length

What happens if we change the kernal length to be a non-multiple of the sequence length (k = 4 and seq_len = 15)

Not a problem since our stride is 1 and we don't have any padding, it just follows the same pattern above where the first k-1 points don't recieve predictions

In [11]:
model = torch.nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size =4,
                        stride=1, padding=0, dilation=1, groups=1, bias=False, padding_mode='zeros')

print(model.weight)

Parameter containing:
tensor([[[-0.4973, -0.2145,  0.1587, -0.0901]]], requires_grad=True)


Again since the kernel is overlaid on top of the sequence, the first 3 values in the sequence don't have corresponding outputs from the model (since it is a requirement to have 3 

In [12]:
out = model(inp)
print(out.shape)
out

torch.Size([1, 1, 12])


tensor([[[-0.6433, -0.6433, -0.6433, -0.6433, -0.6433, -0.6433, -0.6433,
          -0.6433, -0.6433, -0.6433, -0.6433, -0.6433]]],
       grad_fn=<SqueezeBackward1>)

## Multiple inputs

So what happens when we just change the number of input channels but still expect 1 channel in the output. We see that the weight tensor now has 2 kernels of length 5 (so the shape is (1,2,5))

In [14]:
data = torch.ones((2,15))
data[1] += 1
data

tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]])

In [15]:
model = torch.nn.Conv1d(in_channels = 2, out_channels = 1, kernel_size =5,
                        stride=1, padding=0, dilation=1, groups=1, bias=False, padding_mode='zeros')

print(model.weight)

Parameter containing:
tensor([[[ 5.5815e-02,  6.2070e-02,  1.1655e-01, -1.6826e-01, -2.7581e-01],
         [ 2.0035e-01, -1.8993e-04,  1.6400e-01,  4.2376e-02,  1.4429e-01]]],
       requires_grad=True)


Add 2 dummy dimension to the tensor since that's what's expected by torch (shape = (1,2,15))

In [17]:
inp = data.view(1,2,15)
inp

tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]])

Pass it into the model

In [18]:
model(inp)

tensor([[[0.8920, 0.8920, 0.8920, 0.8920, 0.8920, 0.8920, 0.8920, 0.8920,
          0.8920, 0.8920, 0.8920]]], grad_fn=<SqueezeBackward1>)

The results are constant again (as to be expected) and simply the sum of kernel1 convolving the first series (of ones) and kernel2 convolving the second series (of 2s).  Again, the first convolution is just the sum of the weight elements in the kernel ($\sum_{i=0}^k w_i$) whereas the the second

Note that it's just $\sum_{i=0}^k 1*w_i$ where $k$ is kernal length and $w_i$ refers to each weight. It's mult by 1 right now since our series is just ones [1,1,1....,1]

In [27]:
kernel1 = model.weight[0,0]
kernel2 = model.weight[0,1]

kernel1.sum() * 1 + kernel2.sum() * 2

tensor(0.8920, grad_fn=<AddBackward0>)

In [19]:
a.view(1,1,-1)

tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2.,
          2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]]])

stride is how many steps to take before making a new calculation (if you say 0 it will just stay in one spot and obviously kill the kernel you're working on)

In [46]:
model = torch.nn.Conv1d(in_channels = 1, out_channels = 1, kernel_size =5,
                stride=1, padding=0, dilation=2, groups=1, 
                bias=False, padding_mode='zeros')

print(model.weight)
model(a.view(1,1,-1))

Parameter containing:
tensor([[[-0.4267,  0.4077, -0.1108,  0.2961,  0.1737]]], requires_grad=True)


tensor([[[0.8097, 0.8097, 0.8097, 0.8097, 0.8097, 0.6989, 0.6989, 0.6989,
          0.6989, 0.6989]]], grad_fn=<SqueezeBackward1>)

In [39]:
model.weight[0,0,1:].sum()

tensor(0.4398, grad_fn=<SumBackward0>)

In [33]:
model.weight[].sum()*2


tensor(-0.0668, grad_fn=<MulBackward0>)

In [47]:
torch.tensor([ 0.2745, -0.1809,  0.0806, -0.3044, -0.2521]).sum()

tensor(-0.3823)

In [48]:
torch.tensor([ 0.2901,  0.1385, -0.2273, -0.1210,  0.2087]).sum()

tensor(0.2890)

In [50]:
-0.3823 + 0.2890*2

0.19569999999999999

In [54]:
torch.tensor([0.0807,  0.1279, -0.1774, -0.0982, -0.0345]).sum() + 2 * torch.tensor([-0.0052,  0.0016,  0.1569,  0.2386,  0.1325]).sum()

tensor(0.9473)