
## Multivariate time series prediction using Transformer


### In this document I will explain how a transformer encoder network is supposed to process information based on my knowledge. 
<i> Note: PatchTST is a popular timeseries forecasting model which uses Transformer Encoder to make predictions</i>

I will explain the way the multi head attention works here.

Let's suppose the input to the model is a sequence given by
$$[x_1, x_2, x_3, ..., x_{12}]$$ (these are tokens input to the model)

We first reshape it into a following matrix (called embeddings) of some patch size say 4.
$$ Input (I) = \begin{bmatrix}
x_1 & x_2 & x_3 & x_4 \\
x_5 & x_6 & x_7 & x_8 \\
x_9 & x_{10} & x_{11} & x_{12} \\  
\end{bmatrix}^T = 
\begin{bmatrix}
x_1 & x_5 & x_9  \\
x_2 & x_6 & x_{10} \\
x_3 & x_7 & x_{11} \\
x_4 & x_8 & x_{12}
\end{bmatrix}  (4 \times 3)
$$

We use a fc layer of shape say $( 3 \times 10)$, (this is called enriching embeddings with the positional encodings to retain the sequence order information), the output of this process is a $(4 \times 10)$ matrix. Let's denote the output matrix by say $x_d$

Now the output of FC layer is passed through the multi-head attention block.
The number of heads can be any number you can define. I will explain what happens in one of those head. 

Each head has three matrices, let's say them $W_Q, W_K, W_V.$
We compute the following.
$$ Q_h = x_d \times W_Q $$
$$ K_h = x_d \times W_K $$
$$ V_h = x_d \times W_V $$

Note that the no. of columns of Q, K, V is equal to some value say $d_k$ (user defined)

The attention value for a head is calculated as 
$$
O_h = softmax\left( \frac{Q_h K_h^T}{\sqrt{d_k}} \right) V_h
$$

Let's suppose the output $O_h$ has dimensions $(4 \times 10)$

In general there are multiple heads, so we have output of attention heads as $O_1, O_2, ..., O_h$. \
The attention heads are concatenated to get multi head attention
$$O_{multi-head} = Concat(O_1, O_2, O_3, ... , O_h)$$

A weight matrix $W^o$ (Output Projection Matrix) is used to bring back the $O_{multi-head}$ to the shape of $x_d$.
And we compute

$O' = O_{multi-head} \times W^o$ \
and we calculate residuals as \
$O_{residual} = O' + x_d$

This $O_{residual}$ is the input to the feedforward network to get the required output.



In [1]:
import torch
import numpy as np
import math

import pandas as pd
from sklearn.preprocessing import StandardScaler



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/cuser/Documents/cybertraining/venv/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/cuser/Documents/cybertraining/venv/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/cuser/Documents/cybertraining/venv/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 7

In [2]:
path = '../dataset/chattahoochee_3hr_02336490.csv'

In [3]:
torch.cuda.is_available()

False

Here we define **RiverData** a custom Dataset class to load the dataset we have. It extends the pytorch Dataset class.  
- We need to define \_\_init__() function which can be used for loading data from file and optionally for data preprocessing.
- Thereafter we define \_\_len__() function which gives the length of dataset.
- Then we define \_\_getitem__() function which returns an instance of (feature, label) tuple which can be used for model training.
  For our time series data, feature means the past values to be used for training and label means the future values to be predicted.

In [4]:
class RiverData(torch.utils.data.Dataset):
    
    def __init__(self, df, target, datecol, seq_len, pred_len):
        self.df = df
        self.datecol = datecol
        self.target = target
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.setIndex()
        

    def setIndex(self):
        self.df.set_index(self.datecol, inplace=True)
    

    def __len__(self):
        return len(self.df) - self.seq_len - self.pred_len


    def __getitem__(self, idx):
        if len(self.df) <= (idx + self.seq_len+self.pred_len):
            raise IndexError(f"Index {idx} is out of bounds for dataset of size {len(self.df)}")
        df_piece = self.df[idx:idx+self.seq_len].values
        feature = torch.tensor(df_piece, dtype=torch.float32)
        label_piece = self.df[self.target][idx + self.seq_len:  idx+self.seq_len+self.pred_len].values
        label = torch.tensor(label_piece, dtype=torch.float32)
        return (feature, label) 

### Normalize the data

In [5]:
df = pd.read_csv(path)
raw_df = df.drop('DATE', axis=1, inplace=False)
scaler = StandardScaler()

# Apply the transformations
df_scaled = scaler.fit_transform(raw_df)

df_scaled = pd.DataFrame(df_scaled, columns=raw_df.columns)
df_scaled['DATE'] = df['DATE']
df = df_scaled

Some advanced python syntax have been used here. \
*common_args : it's used to pass arguments to a function, where common_args represents a python list \
**common_args: it's used to pass arguments to a function, where common_args represents a python dictionary

In [6]:

train_size = int(0.7 * len(df))
test_size = int(0.2 * len(df))
val_size = len(df) - train_size - test_size

seq_len = 8
pred_len = 1
num_features = 7
num_layers = 1


common_args = ['gaze_height', 'DATE', seq_len, pred_len]
train_dataset = RiverData(df[:train_size], *common_args)
val_dataset = RiverData(df[train_size: train_size+val_size], *common_args)
test_dataset = RiverData(df[train_size+val_size : len(df)], *common_args)


In [7]:
# Important parameters

BATCH_SIZE = 512 # keep as big as can be handled by GPU and memory
SHUFFLE = False # we don't shuffle the time series data
DATA_LOAD_WORKERS = 1 # it depends on amount of data you need to load

In [8]:
from torch.utils.data import DataLoader

common_args = {'batch_size': BATCH_SIZE, 'shuffle': SHUFFLE}
train_loader = DataLoader(train_dataset, **common_args)
val_loader = DataLoader(val_dataset, **common_args)
test_loader = DataLoader(test_dataset, **common_args)

### Here we define our pytorch model.

BasicTransformerNetwork is the model class, it extends the Module class provided by pytorch. \
- We define \_\_init__() function. It sets up layers and defines the model parameters.
- Also, we define forward() function which defines how the forwared pass computation occurs
- We also implement PositionalEncoding class which is an important part of transformer

In [9]:
# The transformer implementation in pytorch doesn't implement the 
# positional encoding which is an essential part of the transforemer model

# Provide more description of positional encoding
class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super().__init__();
        self.dropout = torch.nn.Dropout(p=dropout)

        Xp = torch.zeros(max_len, d_model) # max_len x d_model
        position = torch.arange(0, max_len).unsqueeze(1) # max_len x 1
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(100000.0) / d_model)) #length: d_model/2

        #Applying sine to even indices in the array; 2i
        Xp[:, 0::2] = torch.sin(position.float() * div_term)

        #Applying cosine to odd indices in the array; 2i + 1
        Xp[:, 1::2] = torch.cos(position.float() * div_term)

        Xp = Xp.unsqueeze(1)
        self.register_buffer('Xp', Xp)

    def forward(self, x):
        x  = x + self.Xp[:x.size(0)]
        return self.dropout(x)


class BasicTransformerNetwork(torch.nn.Module):
    
    def __init__(self, seq_len, pred_len):
        # call the constructor of the base class
        super().__init__()
        self.model_type = 'Transformer'
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.num_features = num_features

        # I don't think the embedding size should be this big. We will see.
        self.embedding_size = 128 #The features are converted to 512 embeddings
        self.num_layers = num_layers
        self.pos_encoder = PositionalEncoding(self.embedding_size, 0.1, 10000)
        
        
        self.encLayer = torch.nn.TransformerEncoderLayer(d_model=self.embedding_size, nhead=8, 
                                                 dim_feedforward=256, dropout=0.1, activation="relu", 
                                                 layer_norm_eps=1e-05, batch_first=True, norm_first=False, bias=True, 
                                                 device=None, dtype=None)
        
        self.transformerEnc = torch.nn.TransformerEncoder(self.encLayer, num_layers=self.num_layers)

        self.input_fc = torch.nn.Linear(self.num_features, self.embedding_size)
        self.relu = torch.nn.ReLU()
        
        self.output_fc1 = torch.nn.Linear(self.embedding_size, self.pred_len)
        self.output_fc2 = torch.nn.Linear(self.seq_len, 1)

        

    
    def forward(self, x):
        x = self.input_fc(x) * np.sqrt(self.embedding_size)
        x = self.pos_encoder(x)
        out = self.transformerEnc(x)
        out = self.output_fc1(out) # dimension 512 x seq_len x pred_len
        out = out.transpose(1,2) # dimension 512 x pred_len x seq_len
        out = self.output_fc2(out) # dimension 512 x pred_len x 1
        out = out.squeeze(-1) # dimension 512 x pred_len
        return out
# Note that the gradients are stored insize the FC layer objects
# For each training example we need to get rid of these gradients

In [10]:
print(torch.__version__)

2.2.2


In [11]:
model = BasicTransformerNetwork(seq_len, pred_len)
loss = torch.nn.MSELoss()
learning_rate = 2e-2
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)

In [12]:
for gen in model.parameters():
    print(gen.shape)

torch.Size([384, 128])
torch.Size([384])
torch.Size([128, 128])
torch.Size([128])
torch.Size([256, 128])
torch.Size([256])
torch.Size([128, 256])
torch.Size([128])
torch.Size([128])
torch.Size([128])
torch.Size([128])
torch.Size([128])
torch.Size([384, 128])
torch.Size([384])
torch.Size([128, 128])
torch.Size([128])
torch.Size([256, 128])
torch.Size([256])
torch.Size([128, 256])
torch.Size([128])
torch.Size([128])
torch.Size([128])
torch.Size([128])
torch.Size([128])
torch.Size([128, 7])
torch.Size([128])
torch.Size([1, 128])
torch.Size([1])
torch.Size([1, 8])
torch.Size([1])


In [13]:
for i, (f,l) in enumerate(train_loader):
    print('features shape: ', f.shape)
    print('labels shape: ', l.shape)
    break

features shape:  torch.Size([512, 8, 7])
labels shape:  torch.Size([512, 1])


In [14]:
# define metrics
import numpy as np
epsilon = np.finfo(float).eps

def wape_function(y, y_pred):
    """Weighted Average Percentage Error metric in the interval [0; 100]"""
    y = np.array(y)
    y_pred = np.array(y_pred)
    nominator = np.sum(np.abs(np.subtract(y, y_pred)))
    denominator = np.add(np.sum(np.abs(y)), epsilon)
    wape = np.divide(nominator, denominator) * 100.0
    return wape

def nse_function(y, y_pred):
    y = np.array(y)
    y_pred = np.array(y_pred)
    return (1-(np.sum((y_pred-y)**2)/np.sum((y-np.mean(y))**2)))


def evaluate_model(model, data_loader):
    # following line prepares the model for evaulation mode. It disables dropout and batch normalization if they have 
    # are part of the model. For our simple model it's not necessary. Still I'm going to use it.

    model.eval()
    all_inputs = torch.empty((0, seq_len, num_features))
    all_labels = torch.empty(0, pred_len)
    for inputs, labels in data_loader:
        all_inputs = torch.vstack((all_inputs, inputs))
        all_labels = torch.vstack((all_labels, labels))
    
    with torch.no_grad():
        outputs = model(all_inputs)
        nse = nse_function(all_labels.numpy(), outputs.numpy())
        wape = wape_function(all_labels.numpy(), outputs.numpy())
        
    print(f'NSE : {nse}', end=' ')
    print(f'WAPE : {wape}')
    
    model.train()
    return nse, wape


In [15]:
num_epochs = 30

for epoch in range(num_epochs):
    epoch_loss = []
    for batch_idx, (inputs, labels) in enumerate(train_loader):
        outputs = model(inputs)
        loss_val = loss(outputs, labels)

        # calculate gradients for back propagation
        loss_val.backward()

        # update the weights based on the gradients
        optimizer.step()

        # reset the gradients, avoid gradient accumulation
        optimizer.zero_grad()
        epoch_loss.append(loss_val.item())
    
    print(f'Epoch {epoch+1}: {sum(epoch_loss)/len(epoch_loss)}', end=' ')
    nse, wape = evaluate_model(model, val_loader)
    
        



Epoch 1: 0.39835770916322183 

RuntimeError: Numpy is not available

In [16]:
evaluate_model(model, test_loader)

NSE : 0.8960729837417603 WAPE : 29.622249249767947


(np.float32(0.896073), np.float64(29.622249249767947))

In [None]:
# Plot the results with the metrics inside it