# Project Overview

This project focuses on **text summarization** using two approaches: a traditional **Seq2Seq model** with LSTM/GRU and a **Transformer-based model**. The goal is to see how each model performs and understand the difference between step-by-step sequence processing and attention-based processing.

### Steps in the Project
1. **Dataset Preparation**  
   - Load the XSum dataset with articles and summaries.  
   - Tokenize and pad sequences so they can be fed into the models.

2. **Seq2Seq Model (LSTM/GRU)**  
   - Build an encoder-decoder model.  
   - Train it to generate summaries from the input articles.  
   - Use attention to help the model focus on relevant parts of the input.

3. **Transformer Model**  
   - Build a Transformer-based encoder-decoder model.  
   - Use self-attention to capture relationships between all tokens.  
   - Train on the same dataset to generate summaries.

4. **Comparison**  
   - Compare the two models using metrics like ROUGE.  
   - Look at differences in summary quality, speed, and how well they handle long sequences.

# Seq2Seq and Encoder-Decoder

## What is a Seq2Seq Model
A sequence-to-sequence (Seq2Seq) model is designed to take an input sequence and produce an output sequence. It’s widely used in tasks like machine translation, text summarization, and chatbots.

**Example:**  
Input: "Hello, how are you?"  
Output: "Ciao, come stai?"

## Encoder-Decoder Architecture
A typical Seq2Seq model has two main parts:

### Encoder
The encoder processes the input sequence and compresses it into a single context vector or hidden state. This vector is meant to summarize the important information from the input.  

### Decoder
The decoder takes the context vector from the encoder and generates the output sequence one step at a time.  
During training, it often uses teacher forcing, meaning it receives the correct previous token rather than its own prediction.  

Encoders and decoders are usually implemented with RNNs, LSTMs, or GRUs.

## How It Works
1. The encoder reads the input sequence and outputs the final hidden state.  
2. The decoder starts from this hidden state and generates the output sequence token by token.  
3. During training, the model compares each generated token to the true token and computes a loss (e.g., cross-entropy).


## Attention
In a basic Seq2Seq model, the encoder squashes the whole input document into a single **context vector**. That works okay for short texts, but with longer documents the decoder can easily “forget” important details.

This is where an **attention mechanism** comes in. Instead of just using the encoder’s last hidden state, the decoder can look at **all the hidden states** of the encoder. At each step of generating the summary, it calculates **weights** for each input token to figure out which parts of the text are most relevant.

Using attention usually improves the quality of the summaries, since the model can focus on the right parts of the input at the right time.  
This idea is also the core of **Transformers**, which take it further with **self-attention**, letting all tokens interact at once and handle long sequences much better.

### How Attention Works 
The key idea of attention is that the decoder doesn't have to rely solely on the last hidden state of the encoder. Instead, it looks at **all the hidden states** of the encoder and decides how much to focus on each one when predicting the next token.


1. **Score Calculation**  
   - For each decoder step, the model calculates a **score** for every encoder hidden state.  
   - This score measures how relevant each input token is to the current token being generated.

2. **Softmax to Get Weights**  
   - The scores are passed through a **softmax function**, turning them into weights that sum to 1.  
   - These weights tell the decoder how much attention to pay to each input token.

3. **Context Vector**  
   - The encoder hidden states are combined using these attention weights to form a **dynamic context vector**.  
   - Unlike the single context vector in vanilla Seq2Seq, this vector changes at every decoder step depending on what the model is currently generating.

4. **Decoder Output**  
   - The context vector is then fed into the decoder (along with the previous token) to predict the next token.  
   - This allows the decoder to “focus” on the most relevant parts of the input for each step.

**Intuition:**  
Think of it like reading a paragraph and highlighting the important words as you write a summary. The decoder “looks back” at all the input words and decides which ones matter most at each step.



# Transformers
Transformers can be seen as an evolution of Seq2Seq models, as they replace step-by-step LSTM/GRU processing with parallel attention-based mechanisms, allowing better handling of long sequences. They rely entirely on **attention mechanisms** to understand relationships between all tokens in the input at once.

### Key Components
- **Self-Attention:** Allows the model to weigh the importance of each token in the sequence relative to the others. This helps capture long-range dependencies better than RNNs.
- **Encoder-Decoder Structure:** Like Seq2Seq models, Transformers have an encoder that processes the input and a decoder that generates the output. Both use layers of self-attention and feed-forward networks.
- **Positional Encoding:** Since Transformers don’t process tokens sequentially, they add positional information so the model knows the order of tokens.

### Advantages over LSTM/GRU Seq2Seq
- Can process sequences **in parallel**, speeding up training.
- Handle **long sequences** more effectively with attention.
- Easier to scale to large datasets and very deep models.

### Use Cases
Transformers are the backbone of many state-of-the-art models for tasks such as:
- Machine translation (e.g., T5, MarianMT)
- Text summarization (e.g., BART, Pegasus)
- Question answering and chatbots (e.g., GPT, BERT-based models)

In [1]:
%%capture
!pip install -q datasets

In [2]:
from datasets import load_dataset
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2025-10-07 11:48:04.712176: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1759837684.988202      12 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1759837685.059612      12 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
# load_dataset("xsum") downloads and loads the XSum dataset using the Hugging Face datasets library.
# Each split is a Hugging Face `Dataset` object, similar to a DataFrame, with columns like "document" and "summary".

dataset = load_dataset("xsum", trust_remote_code=True)
train_data = dataset['train']
val_data = dataset['validation']
test_data = dataset['test']

README.md: 0.00B [00:00, ?B/s]

xsum.py: 0.00B [00:00, ?B/s]

data/XSUM-EMNLP18-Summary-Data-Original.(…):   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [4]:
train_data

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

In [5]:
train_data = train_data.select(range(1000))

In [6]:
# The Tokenizer converts raw text into sequences of integers that can be processed by a neural network.
# Each unique word in the dataset is assigned a unique integer index.
# When we call texts_to_sequences(), each word in a sentence is replaced by its corresponding index.
# This allows the model to work with numbers instead of raw text, which is required for embeddings and LSTM layers.
# Padding is applied to ensure all sequences have the same length, so they can be processed in batches.

doc_tokenizer = Tokenizer()
doc_tokenizer.fit_on_texts([d['document'] for d in train_data])

summary_tokenizer = Tokenizer()
summary_tokenizer.fit_on_texts([d['summary'] for d in train_data])

In [7]:
# pad_sequences ensures that all input sequences have the same length by either
# truncating longer sequences or padding shorter ones with a special value (usually 0).
# This is necessary because neural networks, like LSTMs, require fixed-length input sequences.

max_doc_len = 200 
max_summary_len = 50

X_train = pad_sequences(doc_tokenizer.texts_to_sequences([d['document'] for d in train_data]), maxlen=max_doc_len, padding='post')
y_train = pad_sequences(summary_tokenizer.texts_to_sequences([d['summary'] for d in train_data]), maxlen=max_summary_len, padding='post')


In [8]:
# In seq2seq models, the decoder predicts the next token in the target sequence given the previous tokens. 
#
# y_train_input = y_train[:, :-1] -> takes all tokens of the target sequence except the last one. 
#    The decoder learns to predict the next token based on these inputs.
#
# y_train_output = y_train[:, 1:] -> takes all tokens of the target sequence except the first one.
#    The decoder is trained to produce these tokens step by step.

y_train_input = y_train[:, :-1]
y_train_output = y_train[:, 1:]

In [9]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense


vocab_doc = len(doc_tokenizer.word_index) + 1  # +1 for padding
vocab_summary = len(summary_tokenizer.word_index) + 1
embedding_dim = 256 
latent_dim = 512   


# Encoder 
encoder_inputs = Input(shape=(max_doc_len,))  
encoder_embedding = Embedding(input_dim=vocab_doc, 
                              output_dim=embedding_dim, 
                              mask_zero=True)(encoder_inputs)

# Encoder LSTM
encoder_lstm = LSTM(latent_dim, return_state=True)
_, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c] 

# Decoder
decoder_inputs = Input(shape=(max_summary_len-1,)) 
decoder_embedding = Embedding(input_dim=vocab_summary, 
                              output_dim=embedding_dim, 
                              mask_zero=True)(decoder_inputs)


decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(vocab_summary, activation='softmax')  
decoder_outputs = decoder_dense(decoder_outputs)


# Model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()


2025-10-07 11:49:48.582449: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [10]:
history = model.fit(
    [X_train, y_train_input],
    y_train_output[..., None], 
    batch_size=64,
    epochs=10,
    validation_split=0.1
)

model.save_weights("/kaggle/working/seq2seq.weights.h5")

Epoch 1/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 5s/step - loss: 8.3185 - val_loss: 7.3260
Epoch 2/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 5s/step - loss: 6.9916 - val_loss: 7.3455
Epoch 3/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 5s/step - loss: 6.8091 - val_loss: 7.2934
Epoch 4/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 5s/step - loss: 6.6644 - val_loss: 7.2518
Epoch 5/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 5s/step - loss: 6.5583 - val_loss: 7.2260
Epoch 6/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 4s/step - loss: 6.4590 - val_loss: 7.1923
Epoch 7/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 4s/step - loss: 6.3093 - val_loss: 7.1429
Epoch 8/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 4s/step - loss: 6.1743 - val_loss: 7.1302
Epoch 9/10
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[

In [11]:
model.load_weights("/kaggle/working/seq2seq.weights.h5")