<center><img src="../img/img_0.PNG"  width="1000" height="240"/></center>


This notebook demonstrate an apllication of SimMTM, a simple self-supervised learning framework for time series modeling. Self-supervised learning is a learning paradigm that allows model to learn a good representation from the input data itself. The learned representation will be beneficial to some downstream tasks such as forecasting, classification and outlier detection. 

Self-supervised learning has a lof of success and achieves state-of-the-art performance in some domains, especially in the image domain. In this demo, we will show a self-supervised learning method, SimMTM, in the time-series domain. SimMTM adopts both masked modeling and contrastive modeling to learn a good representation of the input data. By using the learned representation and finetuning it, we achieve a significant improvement compared to the model without self-supervised learning. 

In [1]:
import os
try:
    os.chdir('src')
except:
    pass
print(os.getcwd())

/home/shamvinc/ssl_time_series/mvts_transformer/src


In [2]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload


In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import numpy as np
from tqdm import tqdm
import copy

import math

from datasets.datasplit import split_dataset
from datasets.data import data_factory, Normalizer, TSRegressionArchive, CSVRegressionArchive
from datasets.datasplit import split_dataset
from datasets.dataset import collate_superv
from models.ts_transformer import model_factory
from models.loss import get_loss_module, contrastive_loss
from optimizers import get_optimizer

from options import Options
from running import setup


# Masked Modeling

Self-supervision via a ‘pretext task’ on input data combined with finetuning on labeled data is widely used for improving model performance in language and computer
vision. One of the popular self-supervision tasks on language data is masked modeling. Masking modeling is to mask some of the input entries randomly and predict those masked entries by using unmasked entries. By masked modeling, the model can learn the relationship through different features and different timesteps. 

<img src="../img/img_1.PNG"  width="900" height="240"/>

<img src="../img/img_2.PNG"  width="900" height="240"/>

# Masking Choice
### Random Masking

Random Masking is not a good choice to learn a good representation because the model can simply learn to take the average from the neighbour values. 



<img src="../img/img_3.PNG" width="600"/>

### Geometric Masking

Instead, we choose to use the geometric masking method, which is to mask a sequence of the input data randomly. The length of the sequence is followed by a geometric distribution. In this case, the model requires to recover a masked sequence from other unmasked input data. We suggest the expected length of a masked sequence is a half of the whole time series sequence.

In [4]:
def geom_noise_mask_single(L, lm, masking_ratio):
    """
    Randomly create a boolean mask of length `L`, consisting of subsequences of average length lm, masking with 0s a `masking_ratio`
    proportion of the sequence L. The length of masking subsequences and intervals follow a geometric distribution.
    Args:
        L: length of mask and sequence to be masked
        lm: average length of masking subsequences (streaks of 0s)
        masking_ratio: proportion of L to be masked

    Returns:
        (L,) boolean numpy array intended to mask ('drop') with 0s a sequence of length L
    """
    keep_mask = np.ones(L, dtype=bool)
    p_m = 1 / lm  # probability of each masking sequence stopping. parameter of geometric distribution.
    p_u = p_m * masking_ratio / (1 - masking_ratio)  # probability of each unmasked sequence stopping. parameter of geometric distribution.
    p = [p_m, p_u]

    # Start in state 0 with masking_ratio probability
    state = int(np.random.rand() > masking_ratio)  # state 0 means masking, 1 means not masking
    for i in range(L):
        keep_mask[i] = state  # here it happens that state and masking value corresponding to state are identical
        if np.random.rand() < p[state]:
            state = 1 - state

    return keep_mask

# SimMTM ultilizes both contrastive learning and mask modeling to learn the data representation.
## 1 - Contrastive Learning

when we mask the input time series data, we create many masked views of the input data. We expect that the distance between two views of the same time series sequence is minimized while maximizing the distance between two different sequences.

<img src="../img/img_5.png"/>

## The contrastive loss is the following: (Eq. 8 in the paper)

<center><img src="../img/img_6.PNG"/><center/>

In [5]:
def demo_contrastive_loss(s, batch_size, tau=0.05):
    s = s.squeeze(-1) 

    B = s.shape[0]
    v = s.reshape(B, -1)

    norm_v = torch.norm(v, p=2, dim=-1).unsqueeze(-1)
    v = v/norm_v
    u = torch.transpose(v, 0, 1)

    R = torch.matmul(v,u)

 
    R = torch.exp(R/tau) # (batch + mask size) x (batch + mask size)
    
    # number of masks
    M = B//batch_size
    mask = torch.eye(batch_size, device=R.device).repeat_interleave(M,dim=0).repeat_interleave(M,dim=1)

    denom = R * (torch.ones_like(R) - torch.eye(R.shape[0], device=R.device))

    denom = R.sum(-1).unsqueeze(-1)

    loss = torch.log(R/denom)
    

    loss = (loss * (mask - torch.eye(R.shape[0], device=R.device))).sum(1)/(M-1) # except no masked unit
    loss = loss.mean(0)
    
    return -loss


## 2 - Masked Modeling

SimMTM proposes to recover a time serie by the weighted sum of multiple masked points, which eases the reconstruction task by assembling ruined but complementary temporal variations.

<img src="../img/img_4.png"/>

In [6]:
from models.ts_transformer import LearnablePositionalEncoding, TransformerBatchNormEncoderLayer

class DemoSimMTMTransformerEncoder(nn.Module):
    
    def __init__(self, max_len, feat_dim, out_len, out_dim, d_model=16, n_heads=4, num_layers=2, dim_feedforward=32, dropout=0.2, temporal_unit=3):
        super(DemoSimMTMTransformerEncoder, self).__init__()

        self.max_len = max_len
        self.d_model = d_model
        self.n_heads = n_heads
        
        self.tau = 0.05
        self.mask_length = max_len//2
        self.mask_rate = 0.5

        self.project_inp = nn.Linear(feat_dim, d_model)
        self.projector_layer = nn.Linear(max_len, 1)
        self.pos_enc1 = LearnablePositionalEncoding(d_model, dropout=dropout, max_len=max_len)
        self.pos_enc2 = LearnablePositionalEncoding(d_model, dropout=dropout, max_len=out_len)
        
        self.act = F.gelu 

        # encoder_layer = nn.TransformerEncoderLayer(d_model, self.n_heads, dim_feedforward, dropout, activation='gelu')
        encoder_layer = TransformerBatchNormEncoderLayer(d_model, self.n_heads, dim_feedforward, dropout, activation='gelu')

        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)

        self.output_layer = nn.Linear(d_model, feat_dim)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout2d(dropout)

        # self.predict_layer1 = nn.Conv1d(d_model, 512, 5, stride=1)
        self.predict_layer1 = nn.Linear(max_len, out_len)
        self.predict_layer2 = nn.Linear(d_model, out_dim)
        # self.bn = nn.BatchNorm1d(d_model)

        self.feat_dim = feat_dim

        self.temporal_unit = temporal_unit

        self.w1 = torch.nn.parameter.Parameter(data=torch.ones(1), requires_grad=True)
        self.w2 = torch.nn.parameter.Parameter(data=torch.ones(1), requires_grad=True)
        
        
    def forward(self, X):
        """
        Reconstruct the input and create the projected output of X
        
        Args:
            X: (batch_size, seq_length, feat_dim) torch tensor of original input

        Returns:
            output: (batch_size, seq_length, feat_dim)
            s: (batch_size, d_model, 1)
        """

        
        _x = X
         
        # Create masked views of the input X
        for i in range(self.temporal_unit):
            mask = geom_noise_mask_single(X.shape[0] * X.shape[1] * X.shape[2], self.mask_length, self.mask_rate)
            mask = mask.reshape(X.shape[0], X.shape[1], X.shape[2])
            mask = torch.from_numpy(mask).to(X.device)
            x_masked = mask * X
            _x = torch.cat([_x, x_masked], axis=-1) # [batch_size, seq_length, feat_dim * temporal_unit]
    
        
        _x = _x.reshape(X.shape[0] * (self.temporal_unit + 1), X.shape[1], X.shape[2])
  

        inp = _x.permute(1, 0, 2)
        inp = self.project_inp(inp) * np.sqrt(self.d_model)  # [seq_length, batch_size, d_model] project input vectors to d_model dimensional space
        inp = self.pos_enc1(inp)  # add positional encoding

        
        output = self.transformer_encoder(inp)  # (seq_length, batch_size, d_model)
        output = self.act(output)  # the output transformer encoder/decoder embeddings don't include non-linearity
        output = output.permute(1, 0, 2)  # (batch_size, seq_length, d_model)
        output = self.dropout1(output)

        z_hat, _s = self.project(output, self.tau)
        # Most probably defining a Linear(d_model,feat_dim) vectorizes the operation over (seq_length, batch_size).
        output = self.output_layer(z_hat)  # (batch_size, seq_length, feat_dim)

        return output, _s

    
    def project(self, z, tau):
        """
        Output a weighted average of z
        
        Args:
            X: (batch_size, seq_length, feat_dim) torch tensor of original input

        Returns:
            z_hat: (batch_size, seq_length, d_model)
            s: (batch_size, d_model, 1)
        """
        _z = z.transpose(1, 2) # [batch_size, d_model, seq_length]
        _s = s = self.projector_layer(_z) # [batch_size, d_model, 1]
        
        if self.training:
            mask = torch.ones(1, self.d_model, 1).to(z.device)
            mask = self.dropout3(mask)
            s = s * mask 
            s = s + torch.randn(s.shape).to(z.device) * 1e-2
        
        
        s = s.squeeze(-1) 
        B = s.shape[0]
        v = s.reshape(B, -1)

        norm_v = torch.norm(v, p=2, dim=-1).unsqueeze(-1)
        v = v/norm_v
        u = torch.transpose(v, 0, 1)
        
        R = torch.matmul(v,u)
     
  
        R = torch.exp(R/tau) # (batch + mask size) x (batch + mask size)
        R = R * (torch.ones_like(R) - torch.eye(R.shape[0], device=R.device)) # zero out the weight of no masked component
        R = R/R.sum(-1).unsqueeze(-1)
        M = self.temporal_unit + 1
        R = R[::M] # extract every no mask unit # (batch size) x (batch + mask size)

        z_hat = (R.unsqueeze(-1).unsqueeze(-1) * z.unsqueeze(0)).sum(1) 
        return z_hat, _s


    def predict(self, X):
        """
        Predict an output given X
        
        Args:
            z: (batch_size, seq_length, d_model) torch tensor of representations of input
            tau: temperture of similarity matrix

        Returns:
            output: (batch_size, out_seq_len, out_dim)
        """
        
        # permute because pytorch convention for transformers is [seq_length, batch_size, feat_dim]. padding_masks [batch_size, feat_dim]
        inp = X.permute(1, 0, 2)
        inp = self.project_inp(inp) * np.sqrt(self.d_model)  # [seq_length, batch_size, d_model] project input vectors to d_model dimensional space
        inp = self.pos_enc1(inp)  # add positional encoding
        # NOTE: logic for padding masks is reversed to comply with definition in MultiHeadAttention, TransformerEncoderLayer

        output = self.transformer_encoder(inp)
        output = output.permute(1, 0, 2)  # (batch_size, seq_length, d_model)
        # output = self.dropout1(output)
       
        output = output.transpose(1, 2) # (batch_size, d_model, seq_length)
        output = self.predict_layer1(output)
        # output = self.act(output)
        
        output = output.transpose(1, 2) # (batch_size, seq_length, d_model)
        output = output.permute(1, 0, 2)
        output = self.pos_enc2(output)
        output = output.permute(1, 0, 2)
        output = self.dropout2(output)
        output = self.predict_layer2(output) 
        return output



# Data Loading and Preparation

In this demo, we use a benchmask time series dataset called BeijingPM25Quality.
This dataset is part of the Monash, UEA & UCR time series regression repository. http://tseregression.org/

The goal of this dataset is to predict PM2.5 air quality in the city of Beijing. This dataset contains 17532 time series with 9 dimensions.  This includes hourly air pollutants measurments (SO2, NO2, CO and O3), temperature, pressure, dew point, rainfall and windspeed measurments from 12 nationally controlled air quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017. 

In [7]:
args = Options().parse()  
args.data_dir = '../datasets/BeijingPM25Quality'
args.task = 'regression'
args.output_dir = '../experiments'
config = setup(args)
from datasets.data import CSVRegressionArchive
data = TSRegressionArchive(config['data_dir'], pattern='TRAIN', config=config)
test_data = TSRegressionArchive(config['data_dir'], pattern='TEST', config=config)
_data = data

# Standard Normalization
normalizer = Normalizer(config['normalization'])
data.feature_df = normalizer.normalize(data.feature_df)
test_data.feature_df = normalizer.normalize(test_data.feature_df)


                

2023-08-24 07:55:47,195 | INFO : Stored configuration file in '../experiments/_2023-08-24_07-55-46_ETs'
11942it [00:47, 251.82it/s]
5072it [00:19, 257.36it/s]


In [8]:
data.feature_df

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,dim_8
0,-0.578722,-1.262337,-0.827452,0.331285,-1.354744,1.282595,-1.671894,-0.077886,2.197098
0,-0.578722,-1.262337,-0.827452,0.331285,-1.390910,1.302248,-1.626941,-0.077886,2.439030
0,-0.536461,-1.177414,-0.827452,0.261866,-1.390910,1.331729,-1.626941,-0.077886,3.164828
0,-0.282896,-1.149107,-0.827452,0.244511,-1.418035,1.429997,-1.716847,-0.077886,1.148723
0,-0.240635,-1.120799,-0.827452,0.244511,-1.472284,1.498785,-1.724339,-0.077886,0.261637
...,...,...,...,...,...,...,...,...,...
11917,0.393278,1.257029,1.802581,-0.848836,-1.418035,1.606881,-0.907690,-0.077886,-0.544805
11917,0.689104,1.341951,2.153252,-0.848836,-1.517492,1.597054,-0.892705,-0.077886,-0.302872
11917,0.562321,1.228721,1.627246,-0.848836,-1.535576,1.557746,-0.862736,-0.077886,-0.625449
11917,0.942669,1.341951,2.591591,-0.779417,-1.607908,1.528266,-0.832768,-0.077886,-1.028670


In [9]:
max_len = 24
out_size = 1
out_dim = 1
# config['data_window_len'] = max_len
# config['task'] = 'simmtm'
# config['normalization_layer'] = 'BatchNorm'
# config['out_len'] = 24
# config['out_dim'] = 7
# config['d_model'] = 16
# config['dim_feedforward'] = 128
# config['num_heads'] = 4
# config['num_layers'] = 1
# from models.ts_transformer import model_factory
# model = model_factory(config, data)
model = DemoSimMTMTransformerEncoder(max_len=max_len, feat_dim=data.feature_df.shape[1], out_len=out_size, out_dim=out_dim, 
                                     d_model=4, n_heads=4, num_layers=2, dim_feedforward=8)

device = "cuda"
model.to(device)
model.tau = 0.05
model.mask_length = max_len//2
model.mask_ratio = 0.5
model.temporal_unit = 3

In [10]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [11]:
from torch.utils.data import DataLoader
batch_size = 64

train_indices, val_indices, _ = split_dataset(data_indices=data.all_IDs,
                                                         validation_method='ShuffleSplit',
                                                         n_splits=1,
                                                         validation_ratio=0.2,
                                                         test_set_ratio=0,  # used only if test_indices not explicitly specified
                                                         test_indices=None,
                                                         random_seed=1337,
                                                         labels=None)
train_indices = train_indices[0]
val_indices = val_indices[0]
test_indices = np.array(test_data.all_IDs)

train_dataloader = DataLoader(train_indices, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_indices, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_indices, batch_size=batch_size, shuffle=True)



# Self-Supervised Learning Training Loop

In [12]:
i = 0
max_epoch = 10
best_loss = 1e10
best_epoch = 0
device = "cuda"
loss_fn = nn.MSELoss()
best_model = copy.deepcopy(model)

while i < max_epoch:
    train_loss = { "loss": [], "loss_mse": [], "loss_con": []}
    progress_bar = tqdm(train_dataloader)
    
    for IDs in progress_bar:
        model.train()
        X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
        X = X.float()
        X = X.reshape(-1, max_len, X.shape[-1])
        # X = X[:, :, -1:]
        
        pred, s = model(X)  # (batch_size, padded_length, feat_dim)
        
        loss_mse = loss_fn(pred, X) 

        loss_con = demo_contrastive_loss(s, X.shape[0])

        loss = 1/(model.w1.pow(2)) * loss_mse + 1/(model.w2.pow(2)) * loss_con + torch.log(model.w1) + torch.log(model.w2)
  


        optimizer.zero_grad()
        loss.backward()

        nn.utils.clip_grad_norm_(model.parameters(), max_norm=4.0)
        optimizer.step()
        # import ipdb; ipdb.set_trace()
        progress_bar.set_description("Epoch {0} - Training loss: {1:.2f} - MSE loss: {2:.2f} - Contrastive loss: {3:.2f}".format(i, 
                loss.cpu().detach().numpy().item(), loss_mse.cpu().detach().numpy().item(), loss_con.cpu().detach().numpy().item())) 
        train_loss["loss"].append(loss)
        train_loss["loss_mse"].append(loss_mse)
        train_loss["loss_con"].append(loss_con)
    
            
    with torch.no_grad():
        val_loss = { "loss": [], "loss_mse": [], "loss_con": []}
        for IDs in val_dataloader:
            model.eval()
            X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
            X = X.float()
            X = X.reshape(-1, max_len, X.shape[-1])
            # X = X[:, :, -1:]


            pred, s = model(X)  # (batch_size, padded_length, feat_dim)
        
            loss_mse = loss_fn(pred, X) 

            loss_con = demo_contrastive_loss(s, X.shape[0])

            loss = 1/(model.w1.pow(2)) * loss_mse + 1/(model.w2.pow(2)) * loss_con + torch.log(model.w1) + torch.log(model.w2)

            val_loss["loss"].append(loss)
            val_loss["loss_mse"].append(loss_mse)
            val_loss["loss_con"].append(loss_con)

        train_loss["loss"] = torch.tensor(train_loss["loss"]).mean()
        train_loss["loss_mse"] = torch.tensor(train_loss["loss_mse"]).mean()
        train_loss["loss_con"] = torch.tensor(train_loss["loss_con"]).mean()
        val_loss["loss"] = torch.tensor(val_loss["loss"]).mean()
        val_loss["loss_mse"] = torch.tensor(val_loss["loss_mse"]).mean()
        val_loss["loss_con"] = torch.tensor(val_loss["loss_con"]).mean()

        if val_loss["loss"] < best_loss:
            best_loss = val_loss["loss"]
            best_model = copy.deepcopy(model)
            best_epoch = i
    
        progress_bar.write("Epoch {0} - Training loss: {1:.2f} {2:.2f} {3:.2f} - Validation loss: {4:.2f} {5:.2f} {6:.2f}".format(i, 
            train_loss["loss"].cpu().detach().numpy().item(), train_loss["loss_mse"].cpu().detach().numpy().item(), train_loss["loss_con"].cpu().detach().numpy().item(),
            val_loss["loss"].cpu().detach().numpy().item(), val_loss["loss_mse"].cpu().detach().numpy().item(), val_loss["loss_con"].cpu().detach().numpy().item()))
    i += 1
    
    
tqdm.write("Best Epoch {} - Best Validation loss: {}".format(best_epoch, best_loss))

Epoch 0 - Training loss: 4.23 - MSE loss: 0.74 - Contrastive loss: 4.67: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 13.99it/s]
Epoch 1 - Training loss: 4.11 - MSE loss: 0.71 - Contrastive loss: 4.54:   1%|▉                                                                       | 2/149 [00:00<00:10, 14.02it/s]

Epoch 0 - Training loss: 6.78 0.98 6.53 - Validation loss: 4.10 0.82 4.41


Epoch 1 - Training loss: 3.20 - MSE loss: 0.62 - Contrastive loss: 4.07: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 13.98it/s]
Epoch 2 - Training loss: 3.24 - MSE loss: 0.66 - Contrastive loss: 4.11:   1%|▉                                                                       | 2/149 [00:00<00:10, 14.37it/s]

Epoch 1 - Training loss: 3.73 0.81 4.34 - Validation loss: 3.17 0.74 3.88


Epoch 2 - Training loss: 2.93 - MSE loss: 0.65 - Contrastive loss: 3.98: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 14.11it/s]
Epoch 3 - Training loss: 3.14 - MSE loss: 1.00 - Contrastive loss: 3.94:   1%|▉                                                                       | 2/149 [00:00<00:10, 14.33it/s]

Epoch 2 - Training loss: 3.13 0.77 4.00 - Validation loss: 2.85 0.71 3.75


Epoch 3 - Training loss: 2.64 - MSE loss: 0.57 - Contrastive loss: 3.82: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 14.01it/s]
Epoch 4 - Training loss: 2.76 - MSE loss: 0.70 - Contrastive loss: 3.90:   1%|▉                                                                       | 2/149 [00:00<00:10, 14.36it/s]

Epoch 3 - Training loss: 2.86 0.74 3.90 - Validation loss: 2.66 0.69 3.68


Epoch 4 - Training loss: 2.53 - MSE loss: 0.61 - Contrastive loss: 3.73: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 14.11it/s]
Epoch 5 - Training loss: 2.65 - MSE loss: 0.76 - Contrastive loss: 3.78:   1%|▉                                                                       | 2/149 [00:00<00:10, 14.32it/s]

Epoch 4 - Training loss: 2.70 0.73 3.84 - Validation loss: 2.55 0.68 3.66


Epoch 5 - Training loss: 2.55 - MSE loss: 0.79 - Contrastive loss: 3.65: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 14.19it/s]
Epoch 6 - Training loss: 2.49 - MSE loss: 0.64 - Contrastive loss: 3.77:   1%|▉                                                                       | 2/149 [00:00<00:10, 13.99it/s]

Epoch 5 - Training loss: 2.59 0.72 3.80 - Validation loss: 2.45 0.67 3.63


Epoch 6 - Training loss: 2.41 - MSE loss: 0.66 - Contrastive loss: 3.68: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 14.00it/s]
Epoch 7 - Training loss: 2.44 - MSE loss: 0.65 - Contrastive loss: 3.80:   1%|▉                                                                       | 2/149 [00:00<00:10, 14.39it/s]

Epoch 6 - Training loss: 2.50 0.71 3.78 - Validation loss: 2.39 0.66 3.62


Epoch 7 - Training loss: 2.63 - MSE loss: 1.08 - Contrastive loss: 3.60: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 13.99it/s]
Epoch 8 - Training loss: 2.33 - MSE loss: 0.57 - Contrastive loss: 3.78:   1%|▉                                                                       | 2/149 [00:00<00:10, 14.12it/s]

Epoch 7 - Training loss: 2.44 0.70 3.76 - Validation loss: 2.33 0.65 3.59


Epoch 8 - Training loss: 2.40 - MSE loss: 0.79 - Contrastive loss: 3.64: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 13.91it/s]
Epoch 9 - Training loss: 2.32 - MSE loss: 0.65 - Contrastive loss: 3.69:   1%|▉                                                                       | 2/149 [00:00<00:10, 14.39it/s]

Epoch 8 - Training loss: 2.39 0.70 3.75 - Validation loss: 2.29 0.65 3.59


Epoch 9 - Training loss: 2.22 - MSE loss: 0.55 - Contrastive loss: 3.70: 100%|██████████████████████████████████████████████████████████████████████| 149/149 [00:10<00:00, 13.88it/s]


Epoch 9 - Training loss: 2.35 0.70 3.73 - Validation loss: 2.27 0.65 3.60
Best Epoch 9 - Best Validation loss: 2.2673282623291016


# Finetune Training Loop



In [13]:
finetune_model = copy.deepcopy(best_model)
optimizer = torch.optim.AdamW(finetune_model.parameters(), lr=1e-3)

In [14]:
i = 0
max_epoch = 10
best_loss = 1e10
best_finetune_model = copy.deepcopy(best_model)
best_epoch = 0
device = "cuda"
finetune_model.to(device)
while i < max_epoch:
    train_loss = []
    progress_bar = tqdm(train_dataloader)
    
    for IDs in progress_bar:
        finetune_model.train()
        
        X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
        X = X.reshape(-1, max_len, X.shape[-1])
        targets = torch.tensor(data.labels_df.loc[IDs].to_numpy()).to(device)
        targets = targets.reshape(targets.shape[0], out_size, -1)
        
        pred = finetune_model.predict(X.float())
        pred = pred.reshape(X.shape[0], out_size, -1)
        loss = loss_fn(pred, targets)


        optimizer.zero_grad()
        loss.backward()

        nn.utils.clip_grad_norm_(finetune_model.parameters(), max_norm=4.0)
        optimizer.step()

        progress_bar.set_description("Epoch {} - Training loss: {:.2f}".format(i, loss)) 
        train_loss.append(loss)
    
    with torch.no_grad():
        val_loss = []
        for IDs in val_dataloader:
            finetune_model.eval()
            X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
            X = X.reshape(-1, max_len, X.shape[-1])
            targets = torch.tensor(data.labels_df.loc[IDs].to_numpy()).to(device)
            targets = targets.reshape(targets.shape[0], out_size, -1)

            pred = finetune_model.predict(X.float())
            pred = pred.reshape(X.shape[0], out_size, -1)
            
            loss = loss_fn(pred, targets)
            val_loss.append(loss)

        train_loss = torch.tensor(train_loss).mean()
        val_loss = torch.tensor(val_loss).mean()

        if val_loss < best_loss:
            best_loss = val_loss
            best_finetune_model = copy.deepcopy(finetune_model)
            best_epoch = i
    
    progress_bar.write("Epoch {} - Training loss: {:.2f} - Validation loss: {:.2f}".format(i, train_loss, val_loss))
    i += 1
    
    
tqdm.write("Best Epoch {} - Best Validation loss: {}".format(best_epoch, best_loss))

Epoch 0 - Training loss: 34837.46: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 20.94it/s]
Epoch 1 - Training loss: 27544.54:   1%|█▍                                                                                                            | 2/149 [00:00<00:07, 19.43it/s]

Epoch 0 - Training loss: 31422.14 - Validation loss: 29367.37


Epoch 1 - Training loss: 22497.54: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 20.65it/s]
Epoch 2 - Training loss: 21008.80:   1%|█▍                                                                                                            | 2/149 [00:00<00:07, 19.49it/s]

Epoch 1 - Training loss: 28681.35 - Validation loss: 25832.41


Epoch 2 - Training loss: 16309.61: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 20.16it/s]
Epoch 3 - Training loss: 15380.78:   1%|█▍                                                                                                            | 2/149 [00:00<00:08, 17.73it/s]

Epoch 2 - Training loss: 23916.93 - Validation loss: 20252.27


Epoch 3 - Training loss: 14280.37: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 20.03it/s]
Epoch 4 - Training loss: 16303.24:   2%|██▏                                                                                                           | 3/149 [00:00<00:06, 20.97it/s]

Epoch 3 - Training loss: 18323.24 - Validation loss: 15602.92


Epoch 4 - Training loss: 16054.93: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 20.53it/s]
Epoch 5 - Training loss: 10605.63:   2%|██▏                                                                                                           | 3/149 [00:00<00:07, 20.55it/s]

Epoch 4 - Training loss: 13232.80 - Validation loss: 9963.98


Epoch 5 - Training loss: 7481.14: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 21.17it/s]
Epoch 6 - Training loss: 5631.07:   2%|██▏                                                                                                            | 3/149 [00:00<00:06, 21.24it/s]

Epoch 5 - Training loss: 9234.78 - Validation loss: 6904.74


Epoch 6 - Training loss: 15172.48: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 21.23it/s]
Epoch 7 - Training loss: 12741.97:   2%|██▏                                                                                                           | 3/149 [00:00<00:06, 21.29it/s]

Epoch 6 - Training loss: 7900.12 - Validation loss: 5592.28


Epoch 7 - Training loss: 4540.79: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 21.25it/s]
Epoch 8 - Training loss: 7992.80:   2%|██▏                                                                                                            | 3/149 [00:00<00:06, 21.28it/s]

Epoch 7 - Training loss: 7426.88 - Validation loss: 4517.94


Epoch 8 - Training loss: 5199.29: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:06<00:00, 21.31it/s]
Epoch 9 - Training loss: 6122.55:   2%|██▏                                                                                                            | 3/149 [00:00<00:06, 21.24it/s]

Epoch 8 - Training loss: 7415.72 - Validation loss: 4677.93


Epoch 9 - Training loss: 7548.82: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 149/149 [00:07<00:00, 21.26it/s]


Epoch 9 - Training loss: 7343.61 - Validation loss: 4388.68
Best Epoch 9 - Best Validation loss: 4388.6767578125


In [15]:
test_loss = []
with torch.no_grad():
    for IDs in test_dataloader:
        best_finetune_model.eval()
        X = torch.tensor(data.feature_df.loc[IDs].to_numpy()).to(device)
        X = X.reshape(-1, max_len, X.shape[-1])
        targets = torch.tensor(data.labels_df.loc[IDs].to_numpy()).to(device)
        targets = targets.reshape(targets.shape[0], out_size, -1)
        

        pred = best_finetune_model.predict(X.float())
        pred = pred.reshape(X.shape[0], out_size, -1)
        loss = loss_fn(pred, targets)


        test_loss.append(loss)


test_loss = torch.tensor(test_loss).mean()
print("Test MSE loss: {}".format(test_loss))
print("Test RMSE loss: {}".format(np.sqrt(test_loss)))

Test loss: 4371.2177734375


<center><img src="../img/img_8.PNG"/><center/>


Reference:
1. https://arxiv.org/abs/2302.00861
2. https://github.com/gzerveas/mvts_transformer

No Pretrain

Test loss: 4083.34765625

Test loss: 4096.765625

Test loss: 3880.41064453125

Test loss: 3818.454833984375

Test loss: 3640.204833984375

Test loss: 3793.255615234375

Test loss: 3386.54638671875

Test loss: 3532.69140625

Test loss: 3568.943115234375

Test loss: 3639.838623046875

Pretrain

Test loss: 3398.438720703125

Test loss: 3389.167724609375

Test loss: 3470.773681640625

Test loss: 3443.25830078125

Test loss: 3331.67919921875

Test loss: 3753.1474609375

Test loss: 3480.992431640625

Test loss: 3532.7275390625

Test loss: 3068.515869140625

Test loss: 3118.0390625