## Summarization task - trained on Amazon Fine Food Review dataset

Approach:
1. Load and pre-process dataset:

   Read the dataset using pandas, extract necessary columns, and handle missing values. Split the dataset into training, validation, and test sets.
2. Tokenize and Vocabulary building:
    
    Use spaCy for tokenization and torchtext utilities to build vocabularies for the text and summary fields
3. Create custom Dataset and Dataloader:
    
    Implement a custom dataset class to handle tokenization, vocabulary indexing, and padding. Use PyTorch's DataLoader to create iterable data loaders for training, validation, and testing.
4. Define Model Architecture:
    
    Define the training loop, loss function, backpropagation, optimation steps. Implement evaluation functions
5. Evaluate the Model:
    Use of metrics such as ROGUE

## Prepare the dataset for training

We will write custom code to handle tokenization, vocabulary building, and data loading:

1. Load and preprocess the dataset
2. Tokenize and build vocabulary
3. Create a custom dataset and data loader

### Step 1: Load and Preprocess the Dataset

In [1]:
!pip install -q tqdm

In [2]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Load the dataset
df = pd.read_csv('/kaggle/input/amazon-fine-food-reviews/Reviews.csv')

# Display the first few rows
print(df.head())


   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                     1                       1      4  1219017600   
3                     3                       3      2  1307923200   
4                     0                       0      5  1350777600   

                 Summary                                               Text  
0  Good Quality Dog Food  I have bought several of the Vitality canned d...  
1 

In [3]:
# Display the number of rows in the dataset
print(f"Total samples before dropping missing values: {len(df)}")

# Extract necessary columns and drop missing values
df = df[['Text', 'Summary']].dropna()

print(f"Total samples after dropping missing values: {len(df)}")

Total samples before dropping missing values: 568454
Total samples after dropping missing values: 568427


In [4]:
from sklearn.model_selection import train_test_split
# Split the dataset - 80/10/10
# 80/20 split
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
# 20 split as 10/10
valid_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(valid_df)}")
print(f"Test samples: {len(test_df)}")

Training samples: 454741
Validation samples: 56843
Test samples: 56843


### Step 2: Tokenize and Build Vocabulary

We are using ```spaCy``` for tokenizing the text and summary fields.

In [5]:
import spacy # used for tokenizing text
import torch # we will use neural network from torch

# torchtext: for processing text data and creating dataset and iterators
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

from tqdm import tqdm

# Tokenize using spacy
spacy_en = spacy.load('en_core_web_sm')
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

def yield_tokens(data_iter, text_field):
    for text in tqdm(data_iter[text_field], desc=f"Tokenizing {text_field}"):
        yield tokenizer(text)
        
# Build vocab for text and summary fields
print("Building vocabulary for text...")
text_vocab = build_vocab_from_iterator(yield_tokens(train_df, 'Text'), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
print("Building vocabulary for summary...")
summary_vocab = build_vocab_from_iterator(yield_tokens(train_df, 'Summary'), specials=["<unk>", "<pad>", "<bos>", "<eos>"])


# Set default index for unknown tokens
text_vocab.set_default_index(text_vocab["<unk>"])
summary_vocab.set_default_index(summary_vocab["<unk>"])

print(f"Text vocab size: {len(text_vocab)}")
print(f"Summary vocab size: {len(summary_vocab)}")

Building vocabulary for text...


Tokenizing Text: 100%|██████████| 454741/454741 [03:25<00:00, 2217.76it/s]


Building vocabulary for summary...


Tokenizing Summary: 100%|██████████| 454741/454741 [00:17<00:00, 26340.46it/s]


Text vocab size: 215083
Summary vocab size: 48602


### Step 3: Create custom Dataset and DataLoader

Create a custom dataset class ```TextSummaryDataset``` that tokenizes input text and summaries, converts tokens to indices using the vocabularies, and pads sequences.

```collate_batch``` does the crucial job for DataLoader by merging a list of samples into a single batch. It ensures that all sequences in a batch have the same length (which is the length of the longest sequence), an essential requirement for efficient computation on GPUs. It also converts lists of otkens into PyTorch tensors.

In [6]:
from torch.utils.data import Dataset, DataLoader

class TextSummaryDataset(Dataset):
    def __init__(self, df, text_vocab, summary_vocab, tokenizer):
        self.df = df
        self.text_vocab = text_vocab
        self.summary_vocab = summary_vocab
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        text = self.df.iloc[idx]['Text']
        summary = self.df.iloc[idx]['Summary']
        text_tokens = [self.text_vocab["<bos>"]] + [self.text_vocab[token] for token in self.tokenizer(text)] + [self.text_vocab["<eos>"]]
        summary_tokens = [self.summary_vocab["<bos>"]] + [self.summary_vocab[token] for token in self.tokenizer(summary)] + [self.summary_vocab["<eos>"]]
        return torch.tensor(text_tokens), torch.tensor(summary_tokens)

train_dataset = TextSummaryDataset(train_df, text_vocab, summary_vocab, tokenizer)
valid_dataset = TextSummaryDataset(valid_df, text_vocab, summary_vocab, tokenizer)
test_dataset = TextSummaryDataset(test_df, text_vocab, summary_vocab, tokenizer)

BATCH_SIZE = 32

def collate_batch(batch):
    text_list, summary_list = [], []
    for (_text, _summary) in batch:
        text_list.append(torch.tensor(_text, dtype=torch.int64))
        summary_list.append(torch.tensor(_summary, dtype=torch.int64))
    text_batch = torch.nn.utils.rnn.pad_sequence(text_list, padding_value=text_vocab["<pad>"])
    summary_batch = torch.nn.utils.rnn.pad_sequence(summary_list, padding_value=summary_vocab["<pad>"])
    return text_batch, summary_batch

train_loader = DataLoader(train_dataset, batch_size = BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)        

### Step 4: Define the Model Architecture


| Component | Role | Functionality |
|-----------|------|---------------|
| **Encoder** | Captures the context of the input sequence. | - Embeds the input tokens into dense vectors. <br> - Processes embeddings through a bidirectional GRU to capture forward and backward dependencies. <br> - Combines hidden states from both directions and projects them into the decoder's hidden state space. <br> - Returns the sequence of outputs from the GRU and the final hidden state. |
| **Attention** | Helps the Decoder focus on different parts of the input sequence when generating each word of the summary. | - Calculates alignment scores between the current decoder hidden state and each encoder output. <br> - Normalizes alignment scores to obtain attention weights. <br> - Computes a weighted sum of the encoder outputs based on attention weights, producing a context vector that emphasizes relevant parts of the input. |
| **Decoder** | Generates the output summary one token at a time, using the context provided by the attention mechanism. | - Embeds the previous output token (or the start token for the first step). <br> - Uses the context vector from the attention mechanism along with the embedded token to update its hidden state via a GRU. <br> - Projects the hidden state to the output vocabulary space to predict the next token. <br> - Combines the context vector, hidden state, and embedded token to produce the final output token probabilities. |
| **Seq2Seq** | Orchestrates the overall encoding and decoding process. | - Initializes the Encoder and Decoder. <br> - Passes the input sequence through the Encoder to obtain encoder outputs and the final hidden state. <br> - Iteratively uses the Decoder to generate each token of the summary, applying the attention mechanism at each step. <br> - Handles teacher forcing during training and autoregressive generation during inference. |

```
Input Sequence --> [Encoder] --> Encoder Outputs + Final Hidden State
                                      |
                                      v
                               [Attention]
                                      |
                                      v
Previous Token + Context Vector --> [Decoder] --> Next Token
                                      |
                                      v
                            (Repeat for next token)
```



### Embedding Layer: 
The embedding layer (nn.Embedding) is a lookup table that maps each token index to a dense vector (embedding). This layer is trained to produce meaningful representations of tokens in a continuous vector space.


In [7]:
import torch.nn as nn

"""
    Encoder class is responsible for processing the input sequence 
    and capturing its contextual information through a sequence of operations
"""

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hidden_dim, dropout):
        """
            input_dim   : The size of the input vocab
            emb_dim     : The dimensionality of the embedding vectors
            enc_hid_dim : The dimensionality of the encoder hidden states
            dec_hid_dim : The dimensionality of the decoder hidden states
            droput      : The dropout rate to be applied to the embeddings
        """
        super().__init__()
        
        # Embedding layer converts the input tokens into dense vectors of dim "emb_dim"
        self.embedding = nn.Embedding(inpt_dim, emb_dim)
        
        # (bi-directional) GRU layer that processes the embedded input sequence
        """
            emb_dim     : The dim of the input embeddings to the GRU
            enc_hid_dim : The dim of hte GRU hidden states
        """
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)
        
        """
            Fully connected layer - a linear layer that maps the concatenated hidden states 
            from the bidirectional GRU to the dim required by the decoder

            enc_hid_dim * 2 : the concatenated size of the forward and backward hidden states
                              from the bidirectional GRU.
            dec_hid_dim     : the size required by the decoder hidden state.
        """
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        # Dropout layer applies dropout to the embeddings to prevent overfitting during training
        self.dropout = nn.Dropout(dropout)
        
        def forward(self, src, src_len):
            
            # Encoding and Dropout:
            # The input sequence src is first passed through the embedding layer to get dense vector representations.
            # dropout is then applied to these embeddings
            embedded = self.dropout(self.embedding(src))
            
            # Packing the Sequence:
            # The embedded seqs are then packed into a packed sequenceb object, which helps the GRU handle
            # variable-length seqs efficiently; The lenghts of the seq are moved to CPU to be used by the packing function.
            packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len.to('cpu'))
            
            # GRU Processing:
            # The packed sequences are passed through the bidirectional GRU, 
            # giving the hidden states for each time in the packed sequence, 
            # and the final hidden seq for each direction of the GRU
            packed_outputs, hidden = self.rnn(packed_embedded)
            
            # Unpacking the Sequence:
            # The packed seq is unpacked back to a packed seq of output
            # outputs: are the hidden states at each time step, padded to the longest sequence.
            outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_sequence)
            
            # Concating hidden states: 
            # The final hidden states from both forward and backward GRU are concatenated
            # and passed through a fully connected layer and a tanh function to create 
            # the initial state of the decoder
            hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)))
            return outputs, hidden