# LVM - language-vector model

## Abstract
LVMs are transformer-based models designed for prediction of temporal sequences with tokens.
Unlike LLMs that generate only tokens, LVM can generate tokens aligned with a sequence of R^n vectors
or a mixture of aligned tokens-vectors.
LVM model is useful for prediction of time series data in a variety of significant applications.
For example, it can be applied for binary classification problems involving extended history.


## Brief Introduction
The transformer architecture in deep neural networks has been applied widely beyond its original use for prediction human language sequence. In particular, a sequence of studies applied transformers for forecasting of time-series data (either regular or irregular). A pure vector sequence might represent daily stock prices for 500 stocks over 300 days and the goal is to predict the prices in the following T days.

This project introduces a novel generalization of transformers for time-series tabular data with aligned word sequences. In the original transformer the input consists of a sequence of length w from a dictionary D of tokens $(x_1...x_w) \in D^w$. Here we suppose that at each of the w points we receive a vector and a token, and thus: $(x_1...x_w) \in (\cal{R} +D)^w$.  By contrast, the classic time-series prediction has $D=\emptyset$.


There are numerous interesting applications of this
- human-produced words but where at each token we *also* measure characteristics such as volume, emotion (suitably-quantified) etc
- prediction of next-label from sequence data: say, a physical system (a plant, a natural phenomenon) where at each day its state is described by state vector + categorical label (binary or N-ary) and the goal is to predict the label in t+1
- general situations normally solved by hidden-markov models where tokens are produced by states in $R^n$

Novelty: 
- whereas there are several successful implementations of time-series transformer models, here we examine generation to include a token label  
- most importantly, when the token is a binary label, the approach promises to improve the solution of binary classification problem where the input vectors have extended history that varies from sample to sample.  


# Extensions

**Irregular Time** - The vanila architecture does not specify the time separation between the vectors.  This is sufficient if the original data is regular or naturally clusters into weeks or months. It is easy to generalize to data that has time separation by adding a time variable as an added vector dimension (see Luo, Ye et al. 2020). 

## Related studies
- Liu et al. 2023 - [iTransformer: Inverted Transformers Are Effective for Time Series Forecasting](https://arxiv.org/abs/2310.06625)
- Ma et al. 2023 - [BTAD: A binary transformer deep neural network model for anomaly detection in multivariate time series data](https://www.sciencedirect.com/science/article/abs/pii/S1474034623000770?via%3Dihub)
- Kotelnikov, Baranchuk et al.[TabDDPM] in ICML 2023
- Luo, Ye et al. Hitanet: hierarchical time-aware attention networks for risk prediction SIGKDD 2020


# Results
In a previous unpublished study, the authors have developed a simple generalization of transformers to binary state prediction in healthcare management, i.e. the goal was to predict the label in $t+1$ where $D=\{0,1\}$


# Code
Code sources
- [Aladdin Persson's transformer from scratch](https://github.com/aladdinpersson/Machine-Learning-Collection/blob/558557c7989f0b10fee6e8d8f953d7269ae43d4f/ML/Pytorch/more_advanced/transformer_from_scratch/transformer_from_scratch.py).  Under the [MIT License](https://github.com/aladdinpersson/Machine-Learning-Collection/blob/558557c7989f0b10fee6e8d8f953d7269ae43d4f/LICENSE.txt)
- standard dataloader, lightning examples 
- Github Copilot


In [1]:
import pandas as pd
import pdb
import pickle
import numpy as np
import os
import sys, math, copy
import time
import warnings
from typing import Tuple

#debugging
#os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

import torch
import torch.autograd as autograd
from torch import nn, Tensor
from torch.utils.data import dataset, DataLoader

import lightning as L
import torch.nn.functional as F

In [2]:
#pretraining_data = r'data\synthetic1seed=0_n=1000.pkl' #for debugging only
pretraining_data = r'data\synthetic1seed=0_n=100000.pkl'

# Model architecture

This is a decoder-only transformer model
1. input sentence containing vectors + words
2. Positional Encoder
3. Masked multi-headed attention 

In [32]:
class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Emebdding size needs to be divisible by heads"

        self.values = nn.Linear(embed_size, embed_size)
        self.keys = nn.Linear(embed_size, embed_size)
        self.queries = nn.Linear(embed_size, embed_size)
        self.fc_out = nn.Linear(embed_size, embed_size)

    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
        values = self.values(values)  #(N, value_len, embed_size)
        keys = self.keys(keys) #(N, key_len, embed_size)
        queries = self.queries(query) #(N, query_len, embed_size)

        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = queries.reshape(N, query_len, self.heads, self.head_dim)

        #einstein summation for tensor multiplication
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)

        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, self.heads*self.head_dim
        )

        out = self.fc_out(out)

        return out        

In [33]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion*embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)

        x= self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))

        return out


In [34]:
#code from https://pytorch.org/tutorials/beginner/transformer_tutorial.html
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(1, max_len, d_model)
        pe[0, :, 0::2] = torch.sin(position * div_term)
        pe[0, :, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """
        Arguments:
            x: Tensor, shape ``[batch_size, seq_len, embedding_dim]``
        """
        x = x + self.pe[:x.size(1)]
        return self.dropout(x)

In [35]:
class Decoder_Embedder(nn.Module):
    """
    Decoder with an embedded embedder for the tokens
    """
    def __init__(
                self,
                word_embed_size,
                vector_embed_size,
                num_layers,
                heads,
                vocab_size,
                forward_expansion,
                dropout,
                device,
                max_length):
            super(Decoder_Embedder, self).__init__()
            self.device = device
            self.embed_size = vector_embed_size + word_embed_size
            self.vector_embed_size = vector_embed_size

            #it's recommended to use more intelligent embedding - currently it just randomizes. e.g. PositionalEncoding1d
            self.word_embedding = nn.Embedding(vocab_size, word_embed_size)   

            self.pos_encoder    = PositionalEncoding(d_model=self.embed_size, dropout=dropout, max_len=max_length)

            self.layers = nn.ModuleList(
                [
                    TransformerBlock(embed_size=self.embed_size, heads=heads,
                                    dropout=dropout, forward_expansion=forward_expansion)
                                    for _ in range(num_layers)
                ]
            )
            self.retokenizer = nn.Linear(self.word_embed_size, 1)
            self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask):
        #debug x[-1]: something like this:
        embedded_token = self.word_embedding(x[:,:,-1].long()).unsqueeze(1) 
        embedded_all   = torch.cat((x[:,:,:-1], embedded_token), dim=1) 
        out = self.pos_encoder(embedded_all)

        for layer in self.layers:
            out = layer(out, out, out, mask) 

        out2 = torch.cat(out[:,-1,:self.vector_embed_size], 
                         self.retokenizer(out[:,-1,self.vector_embed_size:]))
        
        return out2


In [36]:
class Decoder_Simple(nn.Module):
    """
    Decoder without the embedding of token
    """ 
    def __init__(
                self,
                word_embed_size,
                vector_embed_size,
                num_layers,
                heads,
                vocab_size,
                forward_expansion,
                dropout,
                device,
                max_length):
            super(Decoder_Simple, self).__init__()
            self.device = device
            self.embed_size = vector_embed_size + word_embed_size
            
            self.pos_encoder    = PositionalEncoding(d_model=self.embed_size, dropout=dropout, max_len=max_length)

            self.layers = nn.ModuleList(
                [
                    TransformerBlock(embed_size=self.embed_size, heads=heads,
                                    dropout=dropout, forward_expansion=forward_expansion)
                                    for _ in range(num_layers)
                ]
            )
            self.dropout = nn.Dropout(dropout)
            #for binary classification 
            # self.fc_out = nn.Linear(self.embed_size, 1)
    
    def forward(self, x, mask):
        out = self.pos_encoder(x)

        for layer in self.layers:
            out = layer(out, out, out, mask) 

        #for binary classification
        #out = self.fc_out(out) 
        
        return out


In [37]:
"""
the main class for the LVM

it assumes that the x[:,-1] is the word, and the rest is real vectorized data

core parameters
    word_embed_size 
    vector_embed_size
    num_layers


"""
class LVM(nn.Module):
    def __init__(self, decoder='simple', word_embed_size=512, vector_embed_size=10, num_layers=6, forward_expansion=4, heads=8, dropout=0, device='cpu', vocab_size=1000, max_length=100):
        super(LVM, self).__init__()

        if decoder == 'simple':
            decoder = Decoder_Simple
        else:
            decoder = Decoder_Embedder
        self.decoder = decoder(word_embed_size=word_embed_size, vector_embed_size=vector_embed_size,
                            num_layers=num_layers, heads=heads, forward_expansion=forward_expansion,
                                dropout=dropout, device=device, vocab_size=vocab_size, max_length=max_length)
        self.device = device
        #self._init_weights()

    def make_mask(self, N, x_len):
        """
        mask the triangular mask with 1 to mask the next words in the sequence
        we can't cache it because in general the N can very from input to input
        """
        mask = torch.tril(torch.ones((x_len, x_len))).expand(
            N, 1, x_len, x_len
        )
        return mask.to(self.device)

    def forward(self, x):
        mask = self.make_mask(N=x.shape[0], x_len=x.shape[1]) 
        out = self.decoder(x, mask)

        return out
        #return torch.sigmoid(out)  #this is a binary classification. typically pre-train for self-supervision


# Training

In [9]:
#caution: the model should be pre-trained to just before loss falls to zero

In [10]:
dataset_synth_fpath = pretraining_data
dataset_synth_vocab_size = 11 #10 regular and +1 is for padding value
dataset_synth_index_token_val = 22
vector_embed_size = dataset_synth_index_token_val-1 #all but the last column

heads = 8
vector_embed_size = dataset_synth_index_token_val-1
word_embed_size1 = min(dataset_synth_vocab_size, 200)
word_embed_size2 = word_embed_size1 + (heads - ((vector_embed_size + word_embed_size1) % heads)) % 8

# load the data with pickle:
with open(dataset_synth_fpath, 'rb') as f:
    dataset_synth = pickle.load(f)


LVM design options
1. Send the mixed data through the transformer, including the categorical token + real-valued vectors
    a. part of the output needs to go through embedding at the input and output layers
    b. the loss function mixes types e.g. MSE and CrossEntropy
2. Embed the tokens before sending the data as real-valued vecotrs  
    a. Dataset needs pre-processing before the transformer
    b. Fine-tuning step will involve additional adapter

In [11]:
## write a torch.utils.data.TensorDataset class to wrap the tensor.  dim 0 is the sample ID
class TensorDataset(torch.utils.data.TensorDataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, index):
        return self.data[index, :, :]

    def __len__(self):
        return self.data.shape[0]

In [12]:
class TensorDatasetEmbedder(torch.utils.data.TensorDataset):
    def __init__(self, data, embedder=None, word_embed_size=None, vocab_size=None):
        self.data = data
        self.device=data.device
        if embedder is not None:
            self.embedder = embedder
        else:
            self.embedder = nn.Embedding(vocab_size, word_embed_size, device=self.device)

    def __getitem__(self, index):
        x = self.data[index, :, :]
        embedded_token = self.embedder(x[:,-1].long())
        embedded_all   = torch.cat((x[:,:-1], embedded_token), dim=1) 
        return embedded_all

    def __len__(self):
        return self.data.shape[0]

In [13]:
dataset_synth2 = TensorDatasetEmbedder(dataset_synth, 
                                       word_embed_size=word_embed_size2, 
                                       vocab_size=dataset_synth_vocab_size)
dataset_synth2_train_loader = DataLoader(dataset_synth2)

In [23]:
class LitAutoEncoder(L.LightningModule):
    def __init__(self, lvm):
        super().__init__()
        self.lvm = lvm

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        x = batch
        x2 = x.view(x.size(0), -1)
        x_hat = self.lvm(x)
        x_hat2 = x_hat.view(x.size(0), -1)
        loss = F.mse_loss(x_hat2, x2)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

In [38]:
lvm1 = LVM(decoder='simple',
          vocab_size=dataset_synth_vocab_size, 
            vector_embed_size=vector_embed_size, 
            word_embed_size=word_embed_size2,
            heads=heads,
            max_length=dataset_synth.shape[1],
            device='cuda')

In [39]:
#print(embedded_all.shape)

#lvm1(dataset_synth2[0].to(lvm1.device))

In [40]:
# model
autoencoder = LitAutoEncoder(lvm=lvm1)

# train model
trainer = L.Trainer()
trainer.fit(model=autoencoder, train_dataloaders=dataset_synth2_train_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name | Type | Params
------------------------------
0 | lvm  | LVM  | 76.2 K
------------------------------
76.2 K    Trainable params
0         Non-trainable params
76.2 K    Total params
0.305     Total estimated model params size (MB)


Epoch 0:   0%|          | 0/1000 [00:00<?, ?it/s] 

Epoch 8:  31%|███▏      | 313/1000 [00:07<00:17, 40.17it/s, v_num=4] 

c:\Users\agutf\miniforge3\envs\lvm\Lib\site-packages\lightning\pytorch\trainer\call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...
