# <h1 style="color:rgb(228, 12, 33); text-align: center;">From Theory to Practice: LSTM and Transformers in PyTorch</h1>

---
![image.png](https://discuss.pytorch.org/uploads/default/6415da0424dd66f2f5b134709b92baa59e604c55)

<div style="background-color: rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <p>Welcome to this Kaggle notebook, where we'll dive deep into understanding and implementing Long Short-Term Memory (LSTM) networks using PyTorch, a powerful deep learning framework. But before we delve into the intricacies of LSTM, let's take a moment to understand the basic concepts of time series data, Recurrent Neural Networks (RNNs), and LSTM.</p>
    <h2 style="color:rgb(31, 103, 211);">Time Series Data</h2>
    <p>Time series data is a sequence of numerical data points taken at successive equally spaced points in time. These data points are ordered and depend on the previous data points, making time series data a prime candidate for predictions. Examples of time series data include stock prices, weather forecasts, and sales data, among many others.</p>
    <h2 style="color:rgb(31, 103, 211);">Recurrent Neural Networks (RNNs)</h2>
    <p>Traditional neural networks struggle with time series data due to their inability to remember previous inputs in their current state. Recurrent Neural Networks (RNNs), however, are designed to address this problem. RNNs are a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows them to use their internal state (memory) to process sequences of inputs, making them ideal for time-dependent data.</p>
    <p>However, RNNs suffer from certain limitations. They struggle to handle long-term dependencies because of the 'vanishing gradient' problem, where the contribution of information decays geometrically over time, making it difficult for the RNN to learn from earlier layers.</p>
    <h2 style="color:rgb(31, 103, 211);">Long Short-Term Memory (LSTM)</h2>
    <p>Long Short-Term Memory networks, or LSTMs, are a special kind of RNN capable of learning long-term dependencies. Introduced by Hochreiter and Schmidhuber in 1997, LSTMs have a unique design that helps combat the vanishing gradient problem. They contain a cell state and three gates (input, forget, and output) to control the flow of information inside the network, allowing them to remember or forget information over long periods of time.</p>
    <p>In this notebook, we will explore how to correctly implement LSTM in PyTorch and use it for time series prediction tasks. We will cover everything from the basics of LSTM to its implementation, aiming to provide a comprehensive understanding of this powerful neural network architecture. Let's get started!</p>
</div>


<div style="background-color: rgba(100, 108, 116, 0.1); padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <h2 style="color:rgb(31, 103, 211);">Understanding Input and Output in torch.nn.RNN</h2>
    <p>In this section, we're going to delve into the specifics of the input and output parameters of the torch.nn.RNN module, a built-in recurrent neural network (RNN) implementation in the PyTorch library. It's crucial to understand these parameters to fully leverage PyTorch's RNN capabilities in our LSTM implementation.</p>
    <h3 style="color:rgb(172, 28, 44);">Input to torch.nn.RNN</h3>
    <p>The torch.nn.RNN module takes in two primary inputs:</p>
    <ul>
        <li><b>input</b>: This represents the sequence that is fed into the network. The expected size is (seq_len, batch, input_size). However, if batch_first=True is specified, then the input size should be rearranged to (batch, seq_len, input_size).</li>
        <li><b>h_0</b>: This stands for the initial hidden state of the network at time step t=0. By default, if we don't initialize this hidden layer, PyTorch will automatically initialize it with zeros. The size of h_0 should be (num_layers * num_directions, batch, input_size), where num_layers represents the number of stacked RNNs and num_directions equals 2 for bidirectional RNNs and 1 otherwise.</li>
    </ul>
    <h3 style="color:rgb(172, 28, 44);">Output from torch.nn.RNN</h3>
    <p>The torch.nn.RNN module provides two outputs:</p>
    <ul>
        <li><b>out</b>: This represents the output from the last RNN layer for all time steps. The size is (seq_len, batch, num_directions * hidden_size). However, if batch_first=True is specified, the output size becomes (batch, seq_len, num_directions * hidden_size).</li>
        <li><b>h_n</b>: This is the hidden state value from the last time step across all RNN layers. The size is (num_layers * num_directions, batch, hidden_size). Unlike the input, the h_n is unaffected by batch_first=True.</li>
    </ul>
    <p>To better visualize these inputs and outputs, refer to the following diagram. In this case, we assume a batch size of 1. While the diagram illustrates an LSTM, which has two hidden parameters (h, c), please note that RNN and GRU only have h.</p>
    <p>By understanding these parameters, we can harness the power of the torch.nn.RNN module and build effective models for our time series data using LSTM. Let's continue our exploration of LSTM with PyTorch in the following sections.</p>
</div>


![image.png](https://miro.medium.com/max/576/1*tUxl5-C-t3Qumt0cyVhm2g.png)

<a id="ToC"></a>
# Table of Contents
- [1. Imports](#1)
- [2. LSTM](#2)
    - [Many-to-One](#2.1)
    - [Many-to-Many](#2.2)
    - [Many-to-Many generating sequence](#2.3)
- [3. Transformers](#3)
    - [Masking Input](#3.1)
    - [SOS & EOS tokens](#3.2)

<a id="1"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:rgb(31, 103, 211);font-size:120%;font-family:Verdana;"><center><span> Imports </span></center></div>**

In [None]:
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

from tqdm import tqdm 
import numpy as np 
import pandas as pd 
import random
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set_style('white')

<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <h2 style="color:rgb(31, 103, 211);">About the Dataset</h2>
    <p>In this notebook, we will utilize a simple time series data to test and understand the application of LSTM and Transformer models. The chosen dataset is quite straightforward â€” a range of numbers starting from 0 and ending at 1000. This simplicity will allow us to focus more on the workings of the LSTM and Transformer models, examining how well they can comprehend and process a simple sequential numerical data. Through this, we aim to achieve a clear understanding of these powerful deep learning techniques.</p>
</div>


<a id="2"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:rgb(31, 103, 211);font-size:120%;font-family:Verdana;"><center><span> LSTM </span></center></div>**

<a id="2.1"></a>
<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <h2 style="color:rgb(31, 103, 211);">Understanding Many-to-One Architecture in LSTM</h2>
    <p>Long Short-Term Memory (LSTM) networks, like all Recurrent Neural Networks (RNNs), are renowned for their ability to process sequential data. One of the key aspects that make them flexible and powerful is the various types of input-output architectures they can adopt, one of which is the Many-to-One architecture.</p>
    <p>In a Many-to-One LSTM architecture, the model accepts a sequence of inputs over multiple time steps and produces a single output. In each time step, the LSTM cell takes in an input and the previous cell's hidden state, processes them, and passes on its own hidden state to the next cell.</p>
    <p>Despite receiving input at each time step, the Many-to-One LSTM only produces its final output at the last time step. This characteristic makes Many-to-One LSTM networks particularly useful for tasks like sentiment analysis, where a model reads a sequence of words (input) and outputs a single sentiment score, or text classification, where a document is read sequentially and a single class label is output.</p>
    <p>Through the power of LSTM and the flexibility of architectures like Many-to-One, we can effectively tackle a wide range of sequence-based problems in the world of machine learning and artificial intelligence.</p>
</div>


### Create Custom Data Loader [multi-core]

In [None]:
class CustomDataset(Dataset):
    def __init__(self, seq_len=5, max_len=1000):
        super(CustomDataset).__init__()
        self.datalist = np.arange(0,max_len)
        self.data, self.targets = self.timeseries(self.datalist, seq_len)
        
    def __len__(self):
        return len(self.data)
    
    def timeseries(self, data, window):
        temp = []
        targ = data[window:]
        for i in range(len(data)-window):
            temp.append(data[i:i+window])

        return np.array(temp), targ
    
    def __getitem__(self, index):
        x = torch.tensor(self.data[index]).type(torch.Tensor)
        y = torch.tensor(self.targets[index]).type(torch.Tensor)
        return x,y
    
dataset = CustomDataset(seq_len=5, max_len=1000)

In [None]:
for x,y in dataset:
    print(x,y)
    break

In [None]:
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=4)
#collate_fn=custom_collector

In [None]:
for x,y in dataloader:
    print(x,y)
    break

<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
<p>Let's take a closer look at our specific use case for the many-to-one LSTM architecture. In our scenario, we are feeding the LSTM with a sequence of 5 random numbers, and we anticipate that the model will predict the 6th number in the sequence. While we've chosen a straightforward series of incrementing numbers for this example, the potential applications of this concept extend much further.</p>

<p style="color:rgb(172, 28, 44);">Imagine this sequence being a time-series data of stock prices, weather conditions, or even a series of steps in a logical reasoning question. The ability to predict the next event based on a series of preceding events is a critical aspect in many fields, including finance, meteorology, and artificial intelligence. By training our LSTM model to understand and predict these sequences, we can leverage the many-to-one LSTM architecture to solve complex problems in these areas and beyond.</p>

</div>

In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(input_size,hidden_size,num_layers,batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        # hidden states not defnined hence the value of h0,c0 == (0,0)
        out, (hn, cn) = self.lstm(x)
        
        # as the diagram suggest to take the last output in many to one 
        # print(out.shape) 
        # print(hn.shape)
        # all batch, last column of seq, all hidden values
        out = out[:, -1, :]
        out = self.fc(out)
        
        return out

In [None]:
model = RNN(input_size=1, hidden_size=256, num_layers=2)

In [None]:
t = torch.tensor([11,12,13,14,15]).type(torch.Tensor).view(1,-1,1)
t.shape

In [None]:
model(t)

### Training 

In [None]:
loss_function = nn.MSELoss()
learning_rate = 1e-3
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
for e in tqdm(range(50)):
    i = 0
    for x,y in dataloader:
        optimizer.zero_grad()

        x = torch.unsqueeze(x, 0).permute(1,2,0)
        # forward
        predictions = model(x)

        loss = loss_function(predictions.view(-1), y)
        
        # backward
        loss.backward()

        # optimization
        optimizer.step()

        i+=1
    if e%5==0:
        print(loss.detach().numpy())

In [None]:
input_tensor = torch.tensor([10,11,12,13,14]).type(torch.Tensor).view(1,-1,1)
model(input_tensor)

<a id="2.2"></a>
<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <h2 style="color:rgb(31, 103, 211);">Understanding Many-to-Many Architecture in LSTM</h2>
   <p>Another crucial architecture in the world of Long Short-Term Memory (LSTM) networks, a type of Recurrent Neural Network (RNN), is the Many-to-Many architecture. This architecture offers a versatile way of handling a diverse set of problems involving sequential data.</p>
    <p>In a Many-to-Many LSTM architecture, the model processes a sequence of inputs over multiple time steps and generates a sequence of outputs. In this setting, each LSTM cell takes in an input and the previous cell's hidden state at each time step, then produces an output along with its own hidden state that it passes on to the next cell.</p>
    <p>Unlike the Many-to-One LSTM, the Many-to-Many LSTM doesn't wait till the last time step to produce an output. Instead, it generates an output at each time step. This makes Many-to-Many LSTM networks highly useful for tasks such as machine translation, where a sequence of words in one language (input) is translated into a sequence of words in another language (output).</p>
    <p>The Many-to-Many architecture of LSTM opens up a broad array of possibilities, making it a powerful tool in the realms of machine learning and artificial intelligence.</p>
</div>

In [None]:
class CustomDataset(Dataset):
    def __init__(self, seq_len=50, future=5,  max_len=1000):
        super(CustomDataset).__init__()
        self.datalist = np.arange(0,max_len)
        self.data, self.targets = self.timeseries(self.datalist, seq_len, future)
        
    def __len__(self):
        #this len will decide the index range in getitem
        return len(self.targets)
    
    def timeseries(self, data, window, future):
        temp = []
        targ = []
        
        for i in range(len(data)-window):
            temp.append(data[i:i+window])
            
        for i in range(len(data)-window -future):
            targ.append(data[i+window:i+window+future])

        return np.array(temp), targ
    
    def __getitem__(self, index):
        x = torch.tensor(self.data[index]).type(torch.Tensor)
        y = torch.tensor(self.targets[index]).type(torch.Tensor)
        return x,y
    
dataset = CustomDataset(seq_len=50, future=5, max_len=1000)

In [None]:
for x,y in dataset:
    print(x.shape, y.shape)
    break

In [None]:
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4)
#collate_fn=custom_collector

In [None]:
for x,y in dataloader:
    print(x.shape, y.shape)
    break

In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, future=5):
        super().__init__()
        self.future = future
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(input_size,hidden_size,num_layers,batch_first=True)
        self.fc = nn.Linear(hidden_size, future)

    def forward(self, x):
        # hidden states not defnined hence the value of h0,c0 == (0,0)
        out, (hn, cn) = self.lstm(x)
        
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, future=5):
        super().__init__()
        self.future = future
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(input_size,hidden_size,num_layers,batch_first=True)
        self.fc = nn.Linear(hidden_size, future)

    def forward(self, x):
        # hidden states not defnined hence the value of h0,c0 == (0,0)
        out, (hn, cn) = self.lstm(x)
        
        # as the diagram suggest to take the last output in many to one 
        # print(out.shape) 
        # print(hn.shape)
        # all batch, last column of seq, all hidden values
        out = out[:, -self.future, :]
        out = self.fc(out)
        
        return out

In [None]:
model = RNN(input_size=1, hidden_size=256, num_layers=2, future=5)

In [None]:
d = 45
t = torch.tensor(np.arange(d,d+50)).type(torch.Tensor).view(1,-1,1)
t.shape

In [None]:
model(t)

In [None]:
loss_function = nn.MSELoss()
learning_rate = 1e-3
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
for e in tqdm(range(50)):
    i = 0
    avg_loss = []
    for x,y in dataloader:
        optimizer.zero_grad()

        x = torch.unsqueeze(x, 0).permute(1,2,0)
        # forward
        predictions = model(x)
        
        # loss
        loss = loss_function(predictions, y)
        
        # backward
        loss.backward()

        # optimization
        optimizer.step()
        avg_loss.append(loss.detach().numpy())

        i+=1
    if e%2==0:
        avg_loss = np.array(avg_loss)
        print(avg_loss.mean())

<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
<p>After feeding the initial 50 terms of our sequence into the model, we begin to observe some promising results. It appears that the model is successfully learning to recognize the underlying patterns in the sequence.</p>
<p>The output generated by the model seems to adhere to the logic of the sequence, suggesting that the LSTM architecture is effectively capturing and understanding the sequential dependencies. This ability to discern patterns and extrapolate them is a powerful aspect of LSTM networks, and it's rewarding to see it at work in our model.</p>
<p>These early results are encouraging, indicating that our model is on the right track. As we continue to refine and train our LSTM, we can expect it to become even more adept at understanding and predicting the sequence.</p>
</div>

In [None]:
d = random.randint(0,1000)
t = torch.tensor(np.arange(d,d+50)).type(torch.Tensor).view(1,-1,1)
r = model(t).view(-1)

In [None]:
fig = plt.figure(figsize=(16,4))
plt_x = np.arange(0,t.shape[1]+len(r))
plt_y = np.arange(d,d+50+len(r))

plt_xp = np.arange(t.shape[1], t.shape[1]+len(r))
plt_yp = r.detach().numpy()
for i in range(len(r)):
    plt.scatter(plt_x, plt_y)
    plt.scatter(plt_xp, plt_yp)
    

<a id="2.3"></a>
<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <h2 style="color:rgb(31, 103, 211);">Understanding Many-to-Many Sequence Generation with LSTM</h2>
    <p>When working with Long Short-Term Memory (LSTM) networks, it's essential to understand how sequence generation is handled, particularly in a Many-to-Many setting. In such an architecture, the output from each LSTM cell can be used as an input to a subsequent feed-forward network to generate a sequence of outputs.</p>
    <p>Let's consider the following block of code as an example:</p>
    <pre style="background-color: #e0e0e0; padding: 10px; border-radius: 10px;">
        out, (hn, cn) = self.lstm(x)
        res = torch.zeros((out.shape[0], out.shape[1]))
        for b in range(out.shape[0]):
            feed = out[b, :, :]
            _out = self.fc(feed).view(-1)
            res[b] = _out
    </pre>
    <p>In this code, <code>self.lstm(x)</code> applies the LSTM layer to the input <code>x</code>, generating an output <code>out</code> and the final hidden and cell states <code>hn</code> and <code>cn</code>. We then initialize a zeros tensor <code>res</code> of the same size as <code>out</code> to store our results.</p>
    <p>Then, for each sequence in the output <code>out</code>, we feed the sequence through a fully connected layer <code>self.fc(feed)</code> and reshape the output to match our expected dimensions using <code>.view(-1)</code>. The result is stored in the corresponding position in <code>res</code>.</p>
    <p>This process exemplifies how a Many-to-Many LSTM network can be used to generate a sequence of outputs, with the LSTM layer and a subsequent feed-forward layer working in tandem to transform a sequence of inputs into a corresponding sequence of outputs.</p>
</div>


In [None]:
class CustomDataset(Dataset):
    def __init__(self, seq_len=50, future=50,  max_len=1000):
        super(CustomDataset).__init__()
        self.datalist = np.arange(0,max_len)
        self.data, self.targets = self.timeseries(self.datalist, seq_len, future)
        
    def __len__(self):
        #this len will decide the index range in getitem
        return len(self.targets)
    
    def timeseries(self, data, window, future):
        temp = []
        targ = []
        
        for i in range(len(data)-window):
            temp.append(data[i:i+window])
            
        for i in range(len(data)-window -future):
            targ.append(data[i+future:i+window+future])

        return np.array(temp), targ
    
    def __getitem__(self, index):
        x = torch.tensor(self.data[index]).type(torch.Tensor)
        y = torch.tensor(self.targets[index]).type(torch.Tensor)
        return x,y
    
dataset = CustomDataset(seq_len=50, future=5, max_len=1000)

In [None]:
for x, y in dataset:
    print(x.shape, y.shape)
    break

In [None]:
x

In [None]:
y

In [None]:
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4)
#collate_fn=custom_collector

In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, future=5):
        super().__init__()
        self.future = future
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(input_size,hidden_size,num_layers,batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        # hidden states not defnined hence the value of h0,c0 == (0,0)
        out, (hn, cn) = self.lstm(x)
        
        # as the diagram suggest to take the last output in many to one 
        # print(out.shape)
        # print(hn.shape)
        # all batch, last column of seq, all hidden values
        res = torch.zeros((out.shape[0], out.shape[1]))
        for b in range(out.shape[0]):
            feed = out[b, :, :]
            _out = self.fc(feed).view(-1)
            res[b] = _out
        
        return res

In [None]:
model = RNN(input_size=1, hidden_size=256, num_layers=2, future=5)

In [None]:
t = torch.tensor(np.arange(d,d+50)).type(torch.Tensor).view(1,-1,1)
r = model(t).view(-1)

In [None]:
r

In [None]:
loss_function = nn.MSELoss()
learning_rate = 1e-3
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
for e in tqdm(range(100)):
    i = 0
    avg_loss = []
    for x,y in dataloader:
        optimizer.zero_grad()

        x = torch.unsqueeze(x, 0).permute(1,2,0)
        # forward
        predictions = model(x)
        
        # loss
        loss = loss_function(predictions, y)
        
        # backward
        loss.backward()

        # optimization
        optimizer.step()
        avg_loss.append(loss.detach().numpy())

        i+=1
        
    if e%5==0:
        avg_loss = np.array(avg_loss)
        print(avg_loss.mean())

In [None]:
d = random.randint(0,1000)
t = torch.tensor(np.arange(d,d+50)).type(torch.Tensor).view(1,-1,1)
r = model(t).view(-1)

In [None]:
fig = plt.figure(figsize=(16,4))
plt_x = np.arange(0,t.shape[1])
plt_y = np.arange(d,d+50)

plt_xp = np.arange(5, t.shape[1]+5)
plt_yp = r.detach().numpy()
for i in range(len(r)):
    plt.scatter(plt_x, plt_y, label="real")
    plt.scatter(plt_xp, plt_yp, label="predicted")
    
plt.show()

<a id="2"></a>
# **<div style="padding:10px;color:white;display:fill;border-radius:5px;background-color:rgb(31, 103, 211);font-size:120%;font-family:Verdana;"><center><span> Transformers </span></center></div>**

<a id="2.3"></a>
<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <p style="color:rgb(172, 28, 44);">Transformers, a breakthrough in the field of natural language processing, also adopt various types of input-output architectures, including the Many-to-Many setup. In this context, Transformers bring a unique approach to the table, contrasting with the methods used in traditional Recurrent Neural Networks (RNNs) such as LSTM.</p>
    <p>In a Many-to-Many Transformer architecture, the model accepts a sequence of inputs and returns a sequence of outputs. However, unlike RNNs, which process sequences in a time-stepped manner, Transformers process all inputs simultaneously. This is made possible by the attention mechanism, which allows the model to focus on different parts of the input sequence for each output, essentially creating a 'shortcut' between each input and output.</p>
    <p>This architecture is especially useful in tasks like machine translation, where the model needs to understand the context of the whole sentence to accurately translate it. Similarly, it can be used in tasks like text summarization or question answering, where understanding the entire context at once can lead to better results.</p>
    <p>The Many-to-Many architecture in Transformers, combined with their attention mechanism, offers an innovative approach to tackling sequential tasks, making Transformers a powerful tool in the field of machine learning and artificial intelligence.</p>
</div>



![image.png](https://images.deepai.org/converted-papers/2001.08317/x1.png)

In [None]:
class CustomDataset(Dataset):
    def __init__(self, seq_len=50, future=50,  max_len=1000):
        super(CustomDataset).__init__()
        
        self.vocab = {'SOS':1001, 'EOS':1002}
        self.datalist = np.arange(0,max_len)
        self.data, self.targets = self.timeseries(self.datalist, seq_len, future)
        
    def __len__(self):
        #this len will decide the index range in getitem
        return len(self.targets)
    
    def timeseries(self, data, window, future):
        temp = []
        targ = []
        
        for i in range(len(data)-window):
            temp.append(data[i:i+window])
            
        for i in range(len(data)-window -future):
            targ.append(data[i+future:i+window+future])

        return np.array(temp), targ
    
    def __getitem__(self, index):
        x = torch.tensor(self.data[index]).type(torch.Tensor)
        x = torch.cat((torch.tensor([self.vocab['SOS']]), x, torch.tensor([self.vocab['EOS']]))).type(torch.LongTensor)
        
        y = torch.tensor(self.targets[index]).type(torch.Tensor)
        y = torch.cat((torch.tensor([self.vocab['SOS']]), y, torch.tensor([self.vocab['EOS']]))).type(torch.LongTensor)
        
        return x,y
    
dataset = CustomDataset(seq_len=48, future=5, max_len=1000)

In [None]:
for x, y in dataset:
    print(x)
    print(y)
    break

In [None]:
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4)
#collate_fn=custom_collector

In [None]:
for x, y in dataloader:
    print(x.shape)
    print(y.shape)
    break

<a id="3.1"></a>
<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <h3 style="color:rgb(172, 28, 44);">The Power of Masking and Efficiency in Transformers</h3>
    <p>One of the remarkable features of Transformers is their use of masking during the training process. Masking is an essential aspect of the Transformer's architecture that prevents the model from seeing future tokens in the input sequence during training, thereby preserving the sequential nature of the language.</p>
    <p>In a task such as language translation, where the input sequence is fed into the model all at once, it's crucial that the prediction for each word doesn't rely on words that come after it in the sequence. This is achieved by applying a mask to the input that effectively hides future words from the model during the training phase.</p>
    <p>Not only does masking maintain the sequential integrity of the language, but it also allows Transformers to train more efficiently than their RNN counterparts, like LSTM. Unlike RNNs, which process sequences step-by-step and thus require longer training times for long sequences, Transformers can process all the tokens in the sequence simultaneously, thanks to their attention mechanism. This parallel processing significantly speeds up the training process and allows the model to handle longer sequences more effectively.</p>
    <p>Thus, through the use of masking and their unique architecture, Transformers manage to overcome some of the limitations of traditional RNNs, offering a more efficient and effective approach to sequence-based tasks in machine learning and artificial intelligence.</p>
</div>


In [None]:
class Transformer(nn.Module):
    def __init__(self, num_tokens, dim_model, num_heads, num_layers, input_seq):
        super().__init__()
        self.input_seq = input_seq
        self.num_layers = num_layers
        self.embedding = nn.Embedding(num_tokens, dim_model)
        self.transformer = nn.Transformer(d_model=dim_model, nhead=num_heads,  
                                          num_encoder_layers=3, num_decoder_layers=3, 
                                          dim_feedforward=256, batch_first=True)
        
        self.fc = nn.Linear(dim_model, num_tokens)

    def forward(self, src, tgt, tf=True):
        mask = self.get_mask(tgt.shape[1], teacher_force=tf)
        src = self.embedding(src) 
        tgt = self.embedding(tgt)
        
        out = self.transformer(src, tgt, tgt_mask=mask)
        feed = self.fc(out)
        feed = torch.squeeze(feed,2)
        
        return feed
            
            
    def get_mask(self, size, teacher_force=True):
        if teacher_force:
            mask = torch.tril(torch.ones(size, size) == 1) # Lower triangular matrix
            mask = mask.float()
            mask = mask.masked_fill(mask == 0, float('-inf')) # Convert zeros to -inf
            mask = mask.masked_fill(mask == 1, float(0.0)) # Convert ones to 0

            # EX for size=5:
            # [[0., -inf, -inf, -inf, -inf],
            #  [0.,   0., -inf, -inf, -inf],
            #  [0.,   0.,   0., -inf, -inf],
            #  [0.,   0.,   0.,   0., -inf],
            #  [0.,   0.,   0.,   0.,   0.]]

            return mask
        else:
            mask = torch.tril(torch.zeros(size, size) == 1) # Lower triangular matrix
            mask = mask.float()
            mask = mask.masked_fill(mask == 0, float('-inf')) # Convert zeros to -inf
            mask = mask.masked_fill(mask == 1, float(0.0)) # Convert ones to 0

            return mask

In [None]:
model = Transformer(num_tokens=1000+3, dim_model=32, num_heads=2, num_layers=2, input_seq=50)

In [None]:
x.shape, y.shape

In [None]:
model(x, y).shape 

In [None]:
t = model(x,y)
t.shape

In [None]:
t.permute(0,2,1).shape

In [None]:
loss_function = nn.CrossEntropyLoss()
learning_rate = 1e-3
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
for e in tqdm(range(25)):
    i = 0
    avg_loss = []
    for x,y in dataloader:
        optimizer.zero_grad()
        
        #one step behind input and output // Like language modeling 
        y_input = y[:, :-1]         # from starting to -1 position
        y_expected = y[:, 1:]       # from 1st position to last 
        # this is done so that in prediction we see a start of token 
        
        # forward
        predictions = model(x, y_input)
        pred = predictions.permute(0, 2, 1)
        
        # loss
        loss = loss_function(pred, y_expected)
        
        # backward
        loss.backward()

        # optimization
        optimizer.step()
        avg_loss.append(loss.detach().numpy())

        i+=1
        
    if e%5==0:
        avg_loss = np.array(avg_loss)
        print(avg_loss.mean())

In [None]:
torch.squeeze(predictions.topk(1).indices, 2)

In [None]:
y_expected 

In [None]:
torch.argmax(pred, dim=1)

<a id="3.2"></a>
<div style="background-color: #f2f2f2; padding: 20px; border-radius: 10px; color: #333; font-family: Arial, sans-serif;">
    <h3 style="color:rgb(172, 28, 44);">The Role of SOS and EOS Tokens in Transformers</h3>
    <p>In the domain of natural language processing, particularly when working with Transformer models, special tokens like Start of Sentence (SOS) and End of Sentence (EOS) play a crucial role. These tokens provide valuable cues about the boundaries of sentences, facilitating the model's understanding of language structure.</p>
    <p>The SOS token is added at the beginning of each sentence, marking its start. Similarly, the EOS token is appended at the end of each sentence to indicate its conclusion. These tokens serve as consistent markers that help the model identify and process sentences as distinct units within larger bodies of text.</p>
    <p>Furthermore, in the context of sequence generation tasks, these tokens play an essential role in determining when to begin and end the generation process. For example, during text generation, an EOS token indicates to the model that it should stop generating further tokens.</p>
    <p>Therefore, SOS and EOS tokens are more than just markers; they're integral components in the design and functioning of Transformer models, contributing significantly to their ability to effectively understand and generate human language.</p>
</div>


In [None]:
def predict(model, input_sequence, max_length=50, SOS_token=1000+1, EOS_token=1000+2):
    model.eval()
    
    input_sequence = torch.tensor(input_sequence)
    input_sequence = torch.cat((torch.tensor([SOS_token]), input_sequence, torch.tensor([EOS_token]))).type(torch.LongTensor) 
    input_sequence = torch.unsqueeze(input_sequence,0)
    
    y_input = torch.tensor([1001], dtype=torch.long)
    y_input = torch.unsqueeze(y_input,0)

    for _ in range(max_length):
        
        predictions = model(input_sequence, y_input)
        
        top = predictions.topk(1).indices
        top = torch.squeeze(top, 2)
        
        next_item = torch.unsqueeze(top[:,-1],0)
        y_input = torch.cat((y_input, next_item), dim=1)
        mask = model.get_mask(y_input.shape[1])
        if next_item == EOS_token:
            break

    return y_input.view(-1).tolist()

In [None]:
d = random.randint(0,900)
t = torch.tensor(np.arange(d,d+48)).type(torch.Tensor)
input_sequence = t
print(t)

In [None]:
r=predict(model, input_sequence)
print(r)

In [None]:
fig = plt.figure(figsize=(16,4))

plt_x = np.arange(0,t.shape[0])
plt_y = t

plt_xp = np.arange(5, t.shape[0]+5)
plt_yp = r[1:-2]

plt.scatter(plt_x, plt_y, s=14, color='r', label="real")
plt.scatter(plt_xp, plt_yp, s=7, color='b', label="predicted")
    
plt.legend()
plt.show()