# Project Overview

This project focuses on **text summarization** using two approaches: a traditional **Seq2Seq model** with LSTM/GRU and a **Transformer-based model**. The goal is to see how each model performs and understand the difference between step-by-step sequence processing and attention-based processing.

### Steps in the Project
1. **Dataset Preparation**  
   - Load the XSum dataset with articles and summaries.  
   - Tokenize and pad sequences so they can be fed into the models.

2. **Seq2Seq Model (LSTM/GRU)**  
   - Build an encoder-decoder model.  
   - Train it to generate summaries from the input articles.  
   - Use attention to help the model focus on relevant parts of the input.

3. **Transformer Model**  
   - Build a Transformer-based encoder-decoder model.  
   - Use self-attention to capture relationships between all tokens.  
   - Train on the same dataset to generate summaries.

4. **Comparison**  
   - Compare the two models using metrics like ROUGE.  
   - Look at differences in summary quality, speed, and how well they handle long sequences.

# Seq2Seq and Encoder-Decoder

## What is a Seq2Seq Model
A sequence-to-sequence (Seq2Seq) model is designed to take an input sequence and produce an output sequence. It’s widely used in tasks like machine translation, text summarization, and chatbots.

**Example:**  
Input: "Hello, how are you?"  
Output: "Ciao, come stai?"

## Encoder-Decoder Architecture
A typical Seq2Seq model has two main parts:

### Encoder
The encoder processes the input sequence and compresses it into a single context vector or hidden state. This vector is meant to summarize the important information from the input.  

### Decoder
The decoder takes the context vector from the encoder and generates the output sequence one step at a time.  
During training, it often uses teacher forcing, meaning it receives the correct previous token rather than its own prediction.  

Encoders and decoders are usually implemented with RNNs, LSTMs, or GRUs.

## How It Works
1. The encoder reads the input sequence and outputs the final hidden state.  
2. The decoder starts from this hidden state and generates the output sequence token by token.  
3. During training, the model compares each generated token to the true token and computes a loss (e.g., cross-entropy).

## Challenges and Improvements
- A common issue is that long input sequences can lose important information in the hidden state.  
- Attention mechanisms help by letting the decoder look at all encoder outputs instead of just the final hidden state, improving performance on tasks like translation and summarization.

# Transformers
Transformers can be seen as an evolution of Seq2Seq models, as they replace step-by-step LSTM/GRU processing with parallel attention-based mechanisms, allowing better handling of long sequences. They rely entirely on **attention mechanisms** to understand relationships between all tokens in the input at once.

### Key Components
- **Self-Attention:** Allows the model to weigh the importance of each token in the sequence relative to the others. This helps capture long-range dependencies better than RNNs.
- **Encoder-Decoder Structure:** Like Seq2Seq models, Transformers have an encoder that processes the input and a decoder that generates the output. Both use layers of self-attention and feed-forward networks.
- **Positional Encoding:** Since Transformers don’t process tokens sequentially, they add positional information so the model knows the order of tokens.

### Advantages over LSTM/GRU Seq2Seq
- Can process sequences **in parallel**, speeding up training.
- Handle **long sequences** more effectively with attention.
- Easier to scale to large datasets and very deep models.

### Use Cases
Transformers are the backbone of many state-of-the-art models for tasks such as:
- Machine translation (e.g., T5, MarianMT)
- Text summarization (e.g., BART, Pegasus)
- Question answering and chatbots (e.g., GPT, BERT-based models)

In [1]:
%%capture
!pip install -q datasets

In [2]:
from datasets import load_dataset
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2025-10-06 15:35:59.271236: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1759764959.473993      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1759764959.531091      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
# load_dataset("xsum") downloads and loads the XSum dataset using the Hugging Face datasets library.
# Each split is a Hugging Face `Dataset` object, similar to a DataFrame, with columns like "document" and "summary".

dataset = load_dataset("xsum", trust_remote_code=True)
train_data = dataset['train']
val_data = dataset['validation']
test_data = dataset['test']

README.md: 0.00B [00:00, ?B/s]

xsum.py: 0.00B [00:00, ?B/s]

data/XSUM-EMNLP18-Summary-Data-Original.(…):   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [4]:
train_data

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

In [5]:
# The Tokenizer converts raw text into sequences of integers that can be processed by a neural network.
# Each unique word in the dataset is assigned a unique integer index.
# When we call texts_to_sequences(), each word in a sentence is replaced by its corresponding index.
# This allows the model to work with numbers instead of raw text, which is required for embeddings and LSTM layers.
# Padding is applied to ensure all sequences have the same length, so they can be processed in batches.

doc_tokenizer = Tokenizer()
doc_tokenizer.fit_on_texts([d['document'] for d in train_data])

summary_tokenizer = Tokenizer()
summary_tokenizer.fit_on_texts([d['summary'] for d in train_data])

In [6]:
# pad_sequences ensures that all input sequences have the same length by either
# truncating longer sequences or padding shorter ones with a special value (usually 0).
# This is necessary because neural networks, like LSTMs, require fixed-length input sequences.

max_doc_len = 400 
max_summary_len = 50

X_train = pad_sequences(doc_tokenizer.texts_to_sequences([d['document'] for d in train_data]), maxlen=max_doc_len, padding='post')
y_train = pad_sequences(summary_tokenizer.texts_to_sequences([d['summary'] for d in train_data]), maxlen=max_summary_len, padding='post')


In [7]:
# In seq2seq models, the decoder predicts the next token in the target sequence given the previous tokens. 
#
# y_train_input = y_train[:, :-1] -> takes all tokens of the target sequence except the last one. 
#    The decoder learns to predict the next token based on these inputs.
#
# y_train_output = y_train[:, 1:] -> takes all tokens of the target sequence except the first one.
#    The decoder is trained to produce these tokens step by step.

y_train_input = y_train[:, :-1]
y_train_output = y_train[:, 1:]