# LVM - large vector model

## Abstract
Large vector model (LVM) is a generalization of the LLM class of models used in language generation.
LVMs are transformer-based models designed for prediction of temporal sequences with tokens.
However, unlike LLMs that generate human-language tokens, LVM can generate a sequence of R^n vectors
or a mixture of aligned tokens-vectors.
LVM model promise to help in prediction of time series data in a variety of significant applications.
In particular, it provides an important generalization of binary classification problems involving extended history.


## Brief Introduction
The transformer architecture in deep neural networks has been applied widely beyond its original use for prediction human language sequence. In particular, a sequence of studies applied transformers for forecasting of time-series data (either regular or irregular). A pure vector sequence might represent daily stock prices for 500 stocks over 300 days and the goal is to predict the prices in the following T days.

This project introduces a novel generalization of transformers for time-series tabular data with aligned word sequences. In the original transformer the input consists of a sequence of length w from a dictionary D of tokens $(x_1...x_w) \in D^w$. Here we suppose that at each of the w points we receive a vector and a token, and thus: $(x_1...x_w) \in (\cal{R} +D)^w$.  By contrast, the classic time-series prediction has $D=\emptyset$.


There are numerous interesting applications of this
- human-produced words but where at each token we *also* measure characteristics such as volume, emotion (suitably-quantified) etc
- prediction of next-label from sequence data: say, a physical system (a plant, a natural phenomenon) where at each day its state is described by state vector + categorical label (binary or N-ary) and the goal is to predict the label in t+1
- general situations normally solved by hidden-markov models where tokens are produced by states in $R^n$

Novelty: 
- whereas there are several successful implementations of time-series transformer models, here we examine generation to include a token label  
- most importantly, when the token is a binary label, the approach promises to improve the solution of binary classification problem where the input vectors have extended history that varies from sample to sample.  

## Related studies
- Liu et al. 2023 - [iTransformer: Inverted Transformers Are Effective for Time Series Forecasting](https://arxiv.org/abs/2310.06625)
- Ma et al. 2023 - [BTAD: A binary transformer deep neural network model for anomaly detection in multivariate time series data](https://www.sciencedirect.com/science/article/abs/pii/S1474034623000770?via%3Dihub)


# Results
In a previous unpublished study, the authors have developed a simple generalization of transformers to binary state prediction in healthcare management, i.e. the goal was to predict the label in $t+1$ where $D=\{0,1\}$


# Code
Code sources
- [Aladdin Persson's transformer from scratch](https://github.com/aladdinpersson/Machine-Learning-Collection/blob/558557c7989f0b10fee6e8d8f953d7269ae43d4f/ML/Pytorch/more_advanced/transformer_from_scratch/transformer_from_scratch.py).  Under the [MIT License](https://github.com/aladdinpersson/Machine-Learning-Collection/blob/558557c7989f0b10fee6e8d8f953d7269ae43d4f/LICENSE.txt)
- standard dataloader examples 
- Github Copilot

WIP
- data loader, e.g. using torch lightning
- training code, pretraining + use torch lightning

In [4]:
# Requirements
# torch - select your hardware, CUDA version and OS 

#pip install positional-encodings==6.0.1

In [1]:
import pandas as pd
import pdb
import pickle
import numpy as np
import os
import sys, math, copy
import time
import warnings
from typing import Tuple
from tempfile import TemporaryFile

import torch
import torch.autograd as autograd
from torch import nn, Tensor
from torch.utils.data import dataset, DataLoader


from positional_encodings.torch_encodings import PositionalEncoding1D, Summer


# Model architecture

This is a decoder-only transformer model
1. input sentence containing vectors + words
2. Positional Encoder
3. Masked multi-headed attention 

In [32]:
class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Emebdding size needs to be divisible by heads"

    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
        values = self.values(values)  #(N, value_len, embed_size)
        keys = self.keys(keys) #(N, key_len, embed_size)
        queries = self.queries(query) #(N, query_len, embed_size)

        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = queries.reshape(N, query_len, self.heads, self.head_dim)

        #einstein summation for tensor multiplication
        energy = torch.einsum("nqhd, nkhd->nhqk", [queries, keys])
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / (self.embed_size ** (1/2)), dim=3)

        out = torch.einsum("nhql,nhld->nqhd", [attention, values]).reshape(
            N, query_len, self.heads*self.head_dim
        )

        out = self.fc_out(out)

        return out        

In [37]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion*embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)

        x= self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))

        return out


In [43]:
class Decoder(nn.Module):
    def __init__(
                self,
                word_embed_size,
                vector_embed_size,
                num_layers,
                heads,
                trg_vocab_size,
                forward_expansion,
                dropout,
                device,
                max_length):
            super(Decoder, self).__init__()
            self.device = device
            self.embed_size = vector_embed_size + word_embed_size
            
            #TODO: implement more intelligent embedding - currently it just randomizes. e.g. PositionalEncoding1d
            self.word_embedding = nn.Embedding(trg_vocab_size, word_embed_size)
            self.position_embedding = nn.Embedding(max_length, self.embed_size)

            self.layers = nn.ModuleList(
                [
                    TransformerBlock(embed_size=self.embed_size, heads=heads,
                                    dropout=dropout, forward_expansion=forward_expansion)
                                    for _ in range(num_layers)
                ]
            )
            self.dropout = nn.Dropout(dropout)
            self.fc_out = nn.Linear(self.embed_size, 1)
            self.dropout = nn.Dropout(dropout)

    def embed(self, vec):
         #TODO: implement something like
         #return self.dropout((self.word_embedding(vec[0]) + self.position_embedding(vec[1:]))
         pass
    
    def forward(self, x_input, trg_mask):
        out = self.dropout(self.embed(x_input))
        for layer in self.layers:
            out = layer(out, out, out, trg_mask) 

        out = self.fc_out(out) #sigmoid - binary classification. in general need to train by self-supervise

        return out


In [2]:
"""
the main class for the LVM

it assumes that the x[0] is the word, and the rest is real vectorized data

parameters
    word_embed_size 
    vector_embed_size
    num_layers


"""
class LVM(nn.Module):
    def __init__(self, word_embed_size=512, vector_embed_size=100, num_layers=6, forward_expansion=4, heads=8, dropout=0, device='cpu', trg_vocab_size=1000, max_length=100):
        super(LVM, self).__init__()

        self.decoder = Decoder(word_embed_size=word_embed_size, vector_embed_size=vector_embed_size,
                               num_layers=num_layers, heads=heads, forward_expansion=forward_expansion,
                                dropout=dropout, device=device, trg_vocab_size=trg_vocab_size, max_length=max_length)
        self.device = device
        #self._init_weights()

    def make_trg_mask(self, trg):
        N, trg_len, _ = trg.shape
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            N, 1, trg_len, trg_len
        )
        return trg_mask.to(self.device)

    def forward(self, x):
        trg_mask = self.make_trg_mask(x)
        out = self.decoder(x, trg_mask)

        return torch.sigmoid(out)  #TODO: this is a binary classification. typically pre-train for self-supervision


# Training

In [5]:
# criterion = nn.BCELoss(,)