<center><img src="../img/img_0.PNG"  width="1000" height="240"/></center>


This notebook demonstrate an apllication of SimMTM, a simple self-supervised learning framework for time series modeling. Self-supervised learning is a learning paradigm that allows model to learn a good representation from the input data itself. The learned representation will be beneficial to some downstream tasks such as forecasting, classification and outlier detection. 

Self-supervised learning has a lof of success and achieves state-of-the-art performance in some domains, especially in the image domain. In this demo, we will show a self-supervised learning method, SimMTM, in the time-series domain. SimMTM adopts both masked modeling and contrastive modeling to learn a good representation of the input data. By using the learned representation and finetuning it, we achieve a significant improvement compared to the model without self-supervised learning. 

In [1]:
import os
try:
    os.chdir('src')
except:
    pass
print(os.getcwd())

/home/shamvinc/ssl_time_series/mvts_transformer/src


In [2]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload


In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import numpy as np
from tqdm import tqdm
import copy

import math

from datasets.datasplit import split_dataset
from datasets.data import data_factory, Normalizer, TSRegressionArchive, CSVRegressionArchive
from datasets.datasplit import split_dataset
from datasets.dataset import collate_superv
from models.ts_transformer import model_factory
from models.loss import get_loss_module, contrastive_loss
from optimizers import get_optimizer

from options import Options
from running import setup


# Masked Modeling

Self-supervision via a ‘pretext task’ on input data combined with finetuning on labeled data is widely used for improving model performance in language and computer
vision. One of the popular self-supervision tasks on language data is masked modeling. Masking modeling is to mask some of the input entries randomly and predict those masked entries by using unmasked entries. By masked modeling, the model can learn the relationship through different features and different timesteps. 

<img src="../img/img_1.PNG"  width="900" height="240"/>

<img src="../img/img_2.PNG"  width="900" height="240"/>

# Masking Choice
### Random Masking

Random Masking is not a good choice to learn a good representation because the model can simply learn to take the average from the neighbour values. 



<img src="../img/img_3.PNG" width="600"/>

### Geometric Masking

Instead, we choose to use the geometric masking method, which is to mask a sequence of the input data randomly. The length of the sequence is followed by a geometric distribution. In this case, the model requires to recover a masked sequence from other unmasked input data. We suggest the expected length of a masked sequence is a half of the whole time series sequence.

In [4]:
def geom_noise_mask_single(L, lm, masking_ratio):
    """
    Randomly create a boolean mask of length `L`, consisting of subsequences of average length lm, masking with 0s a `masking_ratio`
    proportion of the sequence L. The length of masking subsequences and intervals follow a geometric distribution.
    Args:
        L: length of mask and sequence to be masked
        lm: average length of masking subsequences (streaks of 0s)
        masking_ratio: proportion of L to be masked

    Returns:
        (L,) boolean numpy array intended to mask ('drop') with 0s a sequence of length L
    """
    keep_mask = np.ones(L, dtype=bool)
    p_m = 1 / lm  # probability of each masking sequence stopping. parameter of geometric distribution.
    p_u = p_m * masking_ratio / (1 - masking_ratio)  # probability of each unmasked sequence stopping. parameter of geometric distribution.
    p = [p_m, p_u]

    # Start in state 0 with masking_ratio probability
    state = int(np.random.rand() > masking_ratio)  # state 0 means masking, 1 means not masking
    for i in range(L):
        keep_mask[i] = state  # here it happens that state and masking value corresponding to state are identical
        if np.random.rand() < p[state]:
            state = 1 - state

    return keep_mask

# SimMTM ultilizes both contrastive learning and mask modeling to learn the data representation.
## 1 - Contrastive Learning

when we mask the input time series data, we create many masked views of the input data. We expect that the distance between two views of the same time series sequence is minimized while maximizing the distance between two different sequences.

<img src="../img/img_5.png"/>

## The contrastive loss is the following: (Eq. 8 in the paper)

<center><img src="../img/img_6.PNG"/><center/>

In [5]:
def demo_contrastive_loss(s, batch_size, tau=0.05):
    s = s.squeeze(-1) 

    B = s.shape[0]
    v = s.reshape(B, -1)

    norm_v = torch.norm(v, p=2, dim=-1).unsqueeze(-1)
    v = v/norm_v
    u = torch.transpose(v, 0, 1)

    R = torch.matmul(v,u)

 
    R = torch.exp(R/tau) # (batch + mask size) x (batch + mask size)
    
    # number of masks
    M = B//batch_size
    mask = torch.eye(batch_size, device=R.device).repeat_interleave(M,dim=0).repeat_interleave(M,dim=1)

    denom = R * (torch.ones_like(R) - torch.eye(R.shape[0], device=R.device))

    denom = R.sum(-1).unsqueeze(-1)

    loss = torch.log(R/denom)
    

    loss = (loss * (mask - torch.eye(R.shape[0], device=R.device))).sum(1)/(M-1) # except no masked unit
    loss = loss.mean(0)
    
    return -loss


## 2 - Masked Modeling

SimMTM proposes to recover a time serie by the weighted sum of multiple masked points, which eases the reconstruction task by assembling ruined but complementary temporal variations.

<img src="../img/img_4.png"/>

In [6]:
from models.ts_transformer import LearnablePositionalEncoding, TransformerBatchNormEncoderLayer

class DemoSimMTMTransformerEncoder(nn.Module):
    
    def __init__(self, max_len, feat_dim, out_len, out_dim, d_model=16, n_heads=4, num_layers=2, dim_feedforward=32, dropout=0.2, temporal_unit=3):
        super(DemoSimMTMTransformerEncoder, self).__init__()

        self.max_len = max_len
        self.d_model = d_model
        self.n_heads = n_heads
        
        self.tau = 0.05
        self.mask_length = max_len//2
        self.mask_rate = 0.5

        self.project_inp = nn.Linear(feat_dim, d_model)
        self.projector_layer = nn.Linear(max_len, 1)
        self.pos_enc1 = LearnablePositionalEncoding(d_model, dropout=dropout, max_len=max_len)
        self.pos_enc2 = LearnablePositionalEncoding(d_model, dropout=dropout, max_len=out_len)
        
        self.act = F.gelu 

        # encoder_layer = nn.TransformerEncoderLayer(d_model, self.n_heads, dim_feedforward, dropout, activation='gelu')
        encoder_layer = TransformerBatchNormEncoderLayer(d_model, self.n_heads, dim_feedforward, dropout, activation='gelu')

        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)

        self.output_layer = nn.Linear(d_model, feat_dim)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout2d(dropout)

        # self.predict_layer1 = nn.Conv1d(d_model, 512, 5, stride=1)
        self.predict_layer1 = nn.Linear(max_len, out_len)
        self.predict_layer2 = nn.Linear(d_model, out_dim)
        # self.bn = nn.BatchNorm1d(d_model)

        self.feat_dim = feat_dim

        self.temporal_unit = temporal_unit

        self.w1 = torch.nn.parameter.Parameter(data=torch.ones(1), requires_grad=True)
        self.w2 = torch.nn.parameter.Parameter(data=torch.ones(1), requires_grad=True)
        
        
    def forward(self, X):
        """
        Reconstruct the input and create the projected output of X
        
        Args:
            X: (batch_size, seq_length, feat_dim) torch tensor of original input

        Returns:
            output: (batch_size, seq_length, feat_dim)
            s: (batch_size, d_model, 1)
        """

        
        _x = X
         
        # Create masked views of the input X
        for i in range(self.temporal_unit):
            mask = geom_noise_mask_single(X.shape[0] * X.shape[1] * X.shape[2], self.mask_length, self.mask_rate)
            mask = mask.reshape(X.shape[0], X.shape[1], X.shape[2])
            mask = torch.from_numpy(mask).to(X.device)
            x_masked = mask * X
            _x = torch.cat([_x, x_masked], axis=-1) # [batch_size, seq_length, feat_dim * temporal_unit]
    
        
        _x = _x.reshape(X.shape[0] * (self.temporal_unit + 1), X.shape[1], X.shape[2])
  

        inp = _x.permute(1, 0, 2)
        inp = self.project_inp(inp) * np.sqrt(self.d_model)  # [seq_length, batch_size, d_model] project input vectors to d_model dimensional space
        inp = self.pos_enc1(inp)  # add positional encoding

        
        output = self.transformer_encoder(inp)  # (seq_length, batch_size, d_model)
        output = self.act(output)  # the output transformer encoder/decoder embeddings don't include non-linearity
        output = output.permute(1, 0, 2)  # (batch_size, seq_length, d_model)
        output = self.dropout1(output)

        z_hat, _s = self.project(output, self.tau)
        # Most probably defining a Linear(d_model,feat_dim) vectorizes the operation over (seq_length, batch_size).
        output = self.output_layer(z_hat)  # (batch_size, seq_length, feat_dim)

        return output, _s

    
    def project(self, z, tau):
        """
        Output a weighted average of z
        
        Args:
            X: (batch_size, seq_length, feat_dim) torch tensor of original input

        Returns:
            z_hat: (batch_size, seq_length, d_model)
            s: (batch_size, d_model, 1)
        """
        _z = z.transpose(1, 2) # [batch_size, d_model, seq_length]
        _s = s = self.projector_layer(_z) # [batch_size, d_model, 1]
        
        if self.training:
            mask = torch.ones(1, self.d_model, 1).to(z.device)
            mask = self.dropout3(mask)
            s = s * mask 
            s = s + torch.randn(s.shape).to(z.device) * 1e-2
        
        
        s = s.squeeze(-1) 
        B = s.shape[0]
        v = s.reshape(B, -1)

        norm_v = torch.norm(v, p=2, dim=-1).unsqueeze(-1)
        v = v/norm_v
        u = torch.transpose(v, 0, 1)
        
        R = torch.matmul(v,u)
     
  
        R = torch.exp(R/tau) # (batch + mask size) x (batch + mask size)
        R = R * (torch.ones_like(R) - torch.eye(R.shape[0], device=R.device)) # zero out the weight of no masked component
        R = R/R.sum(-1).unsqueeze(-1)
        M = self.temporal_unit + 1
        R = R[::M] # extract every no mask unit # (batch size) x (batch + mask size)

        z_hat = (R.unsqueeze(-1).unsqueeze(-1) * z.unsqueeze(0)).sum(1) 
        return z_hat, _s


    def predict(self, X):
        """
        Predict an output given X
        
        Args:
            z: (batch_size, seq_length, d_model) torch tensor of representations of input
            tau: temperture of similarity matrix

        Returns:
            output: (batch_size, out_seq_len, out_dim)
        """
        
        # permute because pytorch convention for transformers is [seq_length, batch_size, feat_dim]. padding_masks [batch_size, feat_dim]
        inp = X.permute(1, 0, 2)
        inp = self.project_inp(inp) * np.sqrt(self.d_model)  # [seq_length, batch_size, d_model] project input vectors to d_model dimensional space
        inp = self.pos_enc1(inp)  # add positional encoding
        # NOTE: logic for padding masks is reversed to comply with definition in MultiHeadAttention, TransformerEncoderLayer

        output = self.transformer_encoder(inp)
        output = output.permute(1, 0, 2)  # (batch_size, seq_length, d_model)
        # output = self.dropout1(output)
       
        output = output.transpose(1, 2) # (batch_size, d_model, seq_length)
        output = self.predict_layer1(output)
        # output = self.act(output)
        
        output = output.transpose(1, 2) # (batch_size, seq_length, d_model)
        output = output.permute(1, 0, 2)
        output = self.pos_enc2(output)
        output = output.permute(1, 0, 2)
        output = self.dropout2(output)
        output = self.predict_layer2(output) 
        return output



# Data Loading and Preparation

In this demo, we use a benchmask time series dataset called AppliancesEnergy.

This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

The goal of this dataset is to predict total energy usage in kWh of a house. This dataset contains 138 time series obtained from the Appliances Energy Prediction dataset from the UCI repository. The time series has 24 dimensions. This includes temperature and humidity measurements of 9 rooms in a house, monitored with a ZigBee wireless sensor network. It also includes weather and climate data such as temperature, pressure, humidity, wind speed, visibility and dewpoint measured from Chievres airport. The data set is averaged for 10 minutes period and spanning 4.5 months.

Please refer to https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction  for more details


In [7]:
args = Options().parse()  
args.data_dir = '../datasets/AppliancesEnergy'
args.task = 'regression'
args.output_dir = '../experiments'
config = setup(args)
from datasets.data import CSVRegressionArchive
data = TSRegressionArchive(config['data_dir'], pattern='TRAIN', config=config)
test_data = TSRegressionArchive(config['data_dir'], pattern='TEST', config=config)
_data = data

# Standard Normalization
normalizer = Normalizer(config['normalization'])
data.feature_df = normalizer.normalize(data.feature_df)
test_data.feature_df = normalizer.normalize(test_data.feature_df)


                

2023-08-24 07:59:40,122 | INFO : Stored configuration file in '../experiments/_2023-08-24_07-59-39_pBT'
119it [00:04, 29.64it/s]
66it [00:01, 38.10it/s] 


In [8]:
data.feature_df

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,dim_8,dim_9,...,dim_14,dim_15,dim_16,dim_17,dim_18,dim_19,dim_20,dim_21,dim_22,dim_23
0,-0.269570,-2.153832,-0.259319,-2.121628,-0.999057,-1.514857,-0.514935,-1.837894,-1.126190,-0.656912,...,0.267595,-1.654712,-0.454857,-0.464965,-0.418045,0.330620,-1.948513,1.202557,0.106478,-1.981361
0,-0.269570,-2.089203,-0.288860,-2.089363,-0.999057,-1.497081,-0.514935,-1.823090,-1.126190,-0.672042,...,0.234318,-1.654712,-0.454857,-0.456310,-0.436693,0.342184,-1.915849,1.134967,0.106478,-1.973379
0,-0.269570,-2.056059,-0.303630,-2.027255,-0.999057,-1.464492,-0.528758,-1.786821,-1.088955,-0.668638,...,0.199456,-1.640794,-0.454857,-0.456310,-0.455342,0.353748,-1.883185,1.067377,0.106478,-1.965397
0,-0.287474,-2.006345,-0.330216,-1.962726,-0.999057,-1.464492,-0.528758,-1.758695,-1.126190,-0.675447,...,0.151917,-1.616753,-0.472082,-0.464965,-0.473990,0.365312,-1.850521,0.999787,0.106478,-1.957415
0,-0.323280,-1.988116,-0.343509,-1.907877,-0.999057,-1.464492,-0.544116,-1.726867,-1.126190,-0.686795,...,0.123394,-1.628141,-0.472082,-0.464965,-0.492638,0.376875,-1.817857,0.932197,0.106478,-1.949433
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,-0.104462,-1.923486,-0.166268,-1.946594,-0.907768,-1.286734,-0.408963,-1.883045,-1.126190,-0.545512,...,0.437149,-1.565509,-0.454857,-0.503523,-0.371424,0.307492,-2.111834,1.540506,0.106478,-2.101089
94,-0.144247,-2.054402,-0.181038,-2.034514,-0.872534,-1.464492,-0.456574,-1.853438,-1.126190,-0.577475,...,0.405457,-1.589549,-0.454857,-0.489359,-0.380749,0.312118,-2.079170,1.472917,0.106478,-2.077144
94,-0.166129,-2.130632,-0.210578,-2.113562,-0.939800,-1.545471,-0.487290,-1.899329,-1.088955,-0.592606,...,0.375349,-1.602835,-0.454857,-0.482277,-0.390073,0.316744,-2.046506,1.405327,0.106478,-2.053198
94,-0.209893,-2.213490,-0.259319,-2.146632,-0.970229,-1.556334,-0.487290,-1.862320,-1.107573,-0.618706,...,0.346826,-1.616753,-0.454857,-0.482277,-0.399397,0.321369,-2.013842,1.337737,0.106478,-2.029252


In [9]:
max_len = data.feature_df.loc[0].shape[0]
out_size = 1
out_dim = 1
# config['data_window_len'] = max_len
# config['task'] = 'simmtm'
# config['normalization_layer'] = 'BatchNorm'
# config['out_len'] = 24
# config['out_dim'] = 7
# config['d_model'] = 16
# config['dim_feedforward'] = 128
# config['num_heads'] = 4
# config['num_layers'] = 1
# from models.ts_transformer import model_factory
# model = model_factory(config, data)
model = DemoSimMTMTransformerEncoder(max_len=max_len, feat_dim=data.feature_df.shape[1], out_len=out_size, out_dim=out_dim, 
                                     d_model=8, n_heads=4, num_layers=2, dim_feedforward=16)

device = "cuda"
model.to(device)
model.tau = 0.05
model.mask_length = max_len//2
model.mask_ratio = 0.5
model.temporal_unit = 3

In [10]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [11]:
from torch.utils.data import DataLoader
batch_size = 64

train_indices, val_indices, _ = split_dataset(data_indices=data.all_IDs,
                                                         validation_method='ShuffleSplit',
                                                         n_splits=1,
                                                         validation_ratio=0.2,
                                                         test_set_ratio=0,  # used only if test_indices not explicitly specified
                                                         test_indices=None,
                                                         random_seed=1337,
                                                         labels=None)
train_indices = train_indices[0]
val_indices = val_indices[0]
test_indices = np.array(test_data.all_IDs)

train_dataloader = DataLoader(train_indices, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_indices, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_indices, batch_size=batch_size, shuffle=True)



# Self-Supervised Learning Training Loop

In [12]:
i = 0
max_epoch = 10
best_loss = 1e10
best_epoch = 0
device = "cuda"
loss_fn = nn.MSELoss()
best_model = copy.deepcopy(model)

while i < max_epoch:
    train_loss = { "loss": [], "loss_mse": [], "loss_con": []}
    progress_bar = tqdm(train_dataloader)
    
    for IDs in progress_bar:
        model.train()
        X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
        X = X.float()
        X = X.reshape(-1, max_len, X.shape[-1])
        # X = X[:, :, -1:]
        
        pred, s = model(X)  # (batch_size, padded_length, feat_dim)
        
        loss_mse = loss_fn(pred, X) 

        loss_con = demo_contrastive_loss(s, X.shape[0])

        loss = 1/(model.w1.pow(2)) * loss_mse + 1/(model.w2.pow(2)) * loss_con + torch.log(model.w1) + torch.log(model.w2)
  


        optimizer.zero_grad()
        loss.backward()

        nn.utils.clip_grad_norm_(model.parameters(), max_norm=4.0)
        optimizer.step()
        # import ipdb; ipdb.set_trace()
        progress_bar.set_description("Epoch {0} - Training loss: {1:.2f} - MSE loss: {2:.2f} - Contrastive loss: {3:.2f}".format(i, 
                loss.cpu().detach().numpy().item(), loss_mse.cpu().detach().numpy().item(), loss_con.cpu().detach().numpy().item())) 
        train_loss["loss"].append(loss)
        train_loss["loss_mse"].append(loss_mse)
        train_loss["loss_con"].append(loss_con)
    
            
    with torch.no_grad():
        val_loss = { "loss": [], "loss_mse": [], "loss_con": []}
        for IDs in val_dataloader:
            model.eval()
            X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
            X = X.float()
            X = X.reshape(-1, max_len, X.shape[-1])
            # X = X[:, :, -1:]


            pred, s = model(X)  # (batch_size, padded_length, feat_dim)
        
            loss_mse = loss_fn(pred, X) 

            loss_con = demo_contrastive_loss(s, X.shape[0])

            loss = 1/(model.w1.pow(2)) * loss_mse + 1/(model.w2.pow(2)) * loss_con + torch.log(model.w1) + torch.log(model.w2)

            val_loss["loss"].append(loss)
            val_loss["loss_mse"].append(loss_mse)
            val_loss["loss_con"].append(loss_con)

        train_loss["loss"] = torch.tensor(train_loss["loss"]).mean()
        train_loss["loss_mse"] = torch.tensor(train_loss["loss_mse"]).mean()
        train_loss["loss_con"] = torch.tensor(train_loss["loss_con"]).mean()
        val_loss["loss"] = torch.tensor(val_loss["loss"]).mean()
        val_loss["loss_mse"] = torch.tensor(val_loss["loss_mse"]).mean()
        val_loss["loss_con"] = torch.tensor(val_loss["loss_con"]).mean()

        if val_loss["loss"] < best_loss:
            best_loss = val_loss["loss"]
            best_model = copy.deepcopy(model)
            best_epoch = i
    
        progress_bar.write("Epoch {0} - Training loss: {1:.2f} {2:.2f} {3:.2f} - Validation loss: {4:.2f} {5:.2f} {6:.2f}".format(i, 
            train_loss["loss"].cpu().detach().numpy().item(), train_loss["loss_mse"].cpu().detach().numpy().item(), train_loss["loss_con"].cpu().detach().numpy().item(),
            val_loss["loss"].cpu().detach().numpy().item(), val_loss["loss_mse"].cpu().detach().numpy().item(), val_loss["loss_con"].cpu().detach().numpy().item()))
    i += 1
    
    
tqdm.write("Best Epoch {} - Best Validation loss: {}".format(best_epoch, best_loss))

Epoch 0 - Training loss: 19.16 - MSE loss: 1.42 - Contrastive loss: 17.78: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.22it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 0 - Training loss: 19.60 1.24 18.38 - Validation loss: 16.28 1.16 15.18


Epoch 1 - Training loss: 18.82 - MSE loss: 0.81 - Contrastive loss: 18.11: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.30it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 1 - Training loss: 18.75 1.01 17.83 - Validation loss: 11.87 1.12 10.83


Epoch 2 - Training loss: 13.22 - MSE loss: 1.26 - Contrastive loss: 12.08: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.32it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 2 - Training loss: 15.33 1.18 14.27 - Validation loss: 9.63 1.11 8.61


Epoch 3 - Training loss: 12.89 - MSE loss: 1.31 - Contrastive loss: 11.73: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.35it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 3 - Training loss: 13.13 1.20 12.08 - Validation loss: 8.95 1.09 7.98


Epoch 4 - Training loss: 11.42 - MSE loss: 1.05 - Contrastive loss: 10.53: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.26it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 4 - Training loss: 11.49 1.09 10.56 - Validation loss: 8.61 1.08 7.66


Epoch 5 - Training loss: 9.18 - MSE loss: 1.15 - Contrastive loss: 8.20: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.34it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 5 - Training loss: 9.50 1.13 8.53 - Validation loss: 7.95 1.06 7.03


Epoch 6 - Training loss: 8.49 - MSE loss: 1.03 - Contrastive loss: 7.63: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.29it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 6 - Training loss: 8.72 1.08 7.81 - Validation loss: 7.43 1.06 6.53


Epoch 7 - Training loss: 6.09 - MSE loss: 1.27 - Contrastive loss: 4.96: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.36it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 7 - Training loss: 7.09 1.17 6.08 - Validation loss: 7.16 1.03 6.31


Epoch 8 - Training loss: 6.15 - MSE loss: 1.00 - Contrastive loss: 5.31: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.34it/s]
  0%|                                                                                                                                                           | 0/2 [00:00<?, ?it/s]

Epoch 8 - Training loss: 6.81 1.06 5.92 - Validation loss: 7.13 1.03 6.29


Epoch 9 - Training loss: 7.13 - MSE loss: 0.94 - Contrastive loss: 6.39: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.36it/s]


Epoch 9 - Training loss: 7.08 1.04 6.25 - Validation loss: 6.32 1.02 5.49
Best Epoch 9 - Best Validation loss: 6.319032669067383


# Finetune Training Loop


In [13]:
finetune_model = copy.deepcopy(best_model)
optimizer = torch.optim.AdamW(finetune_model.parameters(), lr=1e-3)

In [14]:
i = 0
max_epoch = 10
best_loss = 1e10
best_finetune_model = copy.deepcopy(best_model)
best_epoch = 0
device = "cuda"
finetune_model.to(device)
while i < max_epoch:
    train_loss = []
    progress_bar = tqdm(train_dataloader)
    
    for IDs in progress_bar:
        finetune_model.train()
        
        X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
        X = X.reshape(-1, max_len, X.shape[-1])
        targets = torch.tensor(data.labels_df.loc[IDs].to_numpy()).to(device)
        targets = targets.reshape(targets.shape[0], out_size, -1)
        
        pred = finetune_model.predict(X.float())
        pred = pred.reshape(X.shape[0], out_size, -1)
        loss = loss_fn(pred, targets)


        optimizer.zero_grad()
        loss.backward()

        nn.utils.clip_grad_norm_(finetune_model.parameters(), max_norm=4.0)
        optimizer.step()

        progress_bar.set_description("Epoch {} - Training loss: {:.2f}".format(i, loss)) 
        train_loss.append(loss)
    
    with torch.no_grad():
        val_loss = []
        for IDs in val_dataloader:
            finetune_model.eval()
            X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
            X = X.reshape(-1, max_len, X.shape[-1])
            targets = torch.tensor(data.labels_df.loc[IDs].to_numpy()).to(device)
            targets = targets.reshape(targets.shape[0], out_size, -1)

            pred = finetune_model.predict(X.float())
            pred = pred.reshape(X.shape[0], out_size, -1)
            
            loss = loss_fn(pred, targets)
            val_loss.append(loss)

        train_loss = torch.tensor(train_loss).mean()
        val_loss = torch.tensor(val_loss).mean()

        if val_loss < best_loss:
            best_loss = val_loss
            best_finetune_model = copy.deepcopy(finetune_model)
            best_epoch = i
    
    progress_bar.write("Epoch {} - Training loss: {:.2f} - Validation loss: {:.2f}".format(i, train_loss, val_loss))
    i += 1
    
    
tqdm.write("Best Epoch {} - Best Validation loss: {}".format(best_epoch, best_loss))

Epoch 0 - Training loss: 190.31: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 44.17it/s]
Epoch 1 - Training loss: 179.38: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 46.66it/s]
Epoch 2 - Training loss: 199.37:   0%|                                                                                                                          | 0/2 [00:00<?, ?it/s]

Epoch 0 - Training loss: 197.49 - Validation loss: 266.50
Epoch 1 - Training loss: 193.02 - Validation loss: 264.45


Epoch 2 - Training loss: 194.42: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 43.68it/s]
Epoch 3 - Training loss: 252.29: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 45.58it/s]
Epoch 4 - Training loss: 195.14:   0%|                                                                                                                          | 0/2 [00:00<?, ?it/s]

Epoch 2 - Training loss: 196.89 - Validation loss: 262.29
Epoch 3 - Training loss: 220.50 - Validation loss: 259.99


Epoch 4 - Training loss: 202.30: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 45.81it/s]
Epoch 5 - Training loss: 177.72: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 46.39it/s]
Epoch 6 - Training loss: 202.85:   0%|                                                                                                                          | 0/2 [00:00<?, ?it/s]

Epoch 4 - Training loss: 198.72 - Validation loss: 257.59
Epoch 5 - Training loss: 188.57 - Validation loss: 254.73


Epoch 6 - Training loss: 142.39: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 43.97it/s]
Epoch 7 - Training loss: 278.66: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 44.32it/s]
Epoch 8 - Training loss: 187.06:   0%|                                                                                                                          | 0/2 [00:00<?, ?it/s]

Epoch 6 - Training loss: 172.62 - Validation loss: 251.80
Epoch 7 - Training loss: 227.18 - Validation loss: 248.23


Epoch 8 - Training loss: 191.96: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 43.77it/s]
Epoch 9 - Training loss: 203.59: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 44.58it/s]


Epoch 8 - Training loss: 189.51 - Validation loss: 244.24
Epoch 9 - Training loss: 192.86 - Validation loss: 240.24
Best Epoch 9 - Best Validation loss: 240.23963928222656


In [16]:
test_loss = []
with torch.no_grad():
    for IDs in test_dataloader:
        best_finetune_model.eval()
        X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
        X = X.reshape(-1, max_len, X.shape[-1])
        targets = torch.tensor(data.labels_df.loc[IDs].to_numpy()).to(device)
        targets = targets.reshape(targets.shape[0], out_size, -1)
        

        pred = best_finetune_model.predict(X.float())
        pred = pred.reshape(X.shape[0], out_size, -1)
        loss = loss_fn(pred, targets)


        test_loss.append(loss)


test_loss = torch.tensor(test_loss).mean()
print("Test MSE loss: {}".format(test_loss))
print("Test RMSE loss: {}".format(np.sqrt(test_loss)))

Test MSE loss: 192.38543701171875
Test RMSE loss: 13.870307922363281


<center><img src="../img/img_8.PNG"/><center/>


Reference:
1. https://arxiv.org/abs/2302.00861
2. https://github.com/gzerveas/mvts_transformer

No Pretrain

Test loss: 17.563016891479492

Test loss: 24.951457977294922

Test loss: 13.356301307678223

Test loss: 15.543647766113281

Test loss: 10.70577621459961

Test loss: 11.653989791870117

Test loss: 28.348876953125

Test loss: 9.66756534576416

Test loss: 22.540130615234375

Test loss: 20.3616943359375

Pretrain 

Test loss: 11.152569770812988

Test loss: 8.513360023498535

Test loss: 20.0480899810791

Test loss: 12.694931983947754

Test loss: 15.458633422851562

Test loss: 16.706832885742188

Test loss: 11.432455062866211

Test loss: 7.263433456420898

Test loss: 15.908415794372559

Test loss: 17.736656188964844