<a href="https://colab.research.google.com/github/ru0983162-dot/NEWS-Article-Summarizer/blob/main/NEWS_Artical_Summrizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
pariza_bbc_news_summary_path = kagglehub.dataset_download('pariza/bbc-news-summary')

print('Data source import complete.')


Downloading from https://www.kaggle.com/api/v1/datasets/download/pariza/bbc-news-summary?dataset_version_number=2...


100%|██████████| 8.91M/8.91M [00:00<00:00, 138MB/s]

Extracting files...





Data source import complete.


# 📰 Abstractive Text Summarization - BBC News Articles
## End-to-End Implementation using PEGASUS Transformer

**Author:** Expert NLP Engineer  
**Date:** December 2025  
**Framework:** PyTorch + Hugging Face Transformers  

---

## 📋 Project Overview

This notebook implements a complete abstractive text summarization system for news articles using state-of-the-art transformer models.

### Key Specifications:
- **Dataset:** BBC News Summary (Kaggle)
- **Task:** Abstractive Summarization
- **Model:** PEGASUS (google/pegasus-cnn_dailymail) or BART
- **Evaluation:** ROUGE-1, ROUGE-2, ROUGE-L, BLEU
- **Interface:** Gradio Web UI

---

## 🔧 Step 1: Environment Setup

### What we're installing:
- **transformers**: Hugging Face library for pre-trained models (PEGASUS/BART)
- **datasets**: Data handling utilities
- **evaluate**: Metrics computation (ROUGE, BLEU)
- **rouge_score**: ROUGE metric implementation
- **nltk**: Natural language processing toolkit (sentence tokenization)
- **torch**: PyTorch deep learning framework
- **pandas**: Data manipulation and analysis
- **gradio**: Web UI creation
- **accelerate**: Training optimization

### Why these libraries?
Modern NLP requires specialized tools for transformer models. Hugging Face provides the de-facto standard ecosystem for working with pre-trained language models.

In [7]:
# Install required packages
!pip install -q transformers datasets evaluate rouge_score nltk torch pandas gradio accelerate

# Verify installation
import transformers
print(f"Transformers version: {transformers.__version__}")
print("✅ All packages installed successfully!")

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
Transformers version: 4.57.3
✅ All packages installed successfully!


---

## 📂 Step 2: Data Loading & Preprocessing

### Understanding the BBC News Dataset Structure

The BBC News Summary dataset from Kaggle has the following structure:
```
BBC News Summary/
├── News Articles/
│   ├── business/
│   │   ├── 001.txt
│   │   ├── 002.txt
│   │   └── ...
│   ├── entertainment/
│   ├── politics/
│   ├── sport/
│   └── tech/
└── Summaries/
    ├── business/
    │   ├── 001.txt
    │   ├── 002.txt
    │   └── ...
    ├── entertainment/
    ├── politics/
    ├── sport/
    └── tech/
```

### Our Approach:
1. **Traverse** both directories (News Articles and Summaries)
2. **Match** articles with summaries using identical filenames
3. **Load** content into a structured DataFrame
4. **Clean** text while preserving important features

### Why minimal cleaning?
Modern transformers like PEGASUS are pre-trained on real-world text with:
- Punctuation (helps with sentence boundaries)
- Capitalization (indicates proper nouns, sentence starts)
- Special characters (quotes, dashes, etc.)

Over-cleaning can remove contextual information the model needs!

In [8]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
import re

def load_bbc_news_data(base_path='BBC News Summary'):
    """
    Load BBC News articles and their corresponding summaries.

    Parameters:
    -----------
    base_path : str
        Root directory containing 'News Articles' and 'Summaries' folders

    Returns:
    --------
    pandas.DataFrame
        DataFrame with columns: ['category', 'article', 'summary']
    """
    articles_path = os.path.join(base_path, 'News Articles')
    summaries_path = os.path.join(base_path, 'Summaries')

    data_records = []

    # Traverse all category folders
    for category in os.listdir(articles_path):
        category_articles_path = os.path.join(articles_path, category)
        category_summaries_path = os.path.join(summaries_path, category)

        if not os.path.isdir(category_articles_path):
            continue

        # Match article files with summary files
        for article_file in os.listdir(category_articles_path):
            if not article_file.endswith('.txt'):
                continue

            article_filepath = os.path.join(category_articles_path, article_file)
            summary_filepath = os.path.join(category_summaries_path, article_file)

            try:
                with open(article_filepath, 'r', encoding='utf-8', errors='ignore') as f:
                    article_text = f.read()
                with open(summary_filepath, 'r', encoding='utf-8', errors='ignore') as f:
                    summary_text = f.read()

                data_records.append({
                    'category': category,
                    'article': article_text,
                    'summary': summary_text
                })
            except FileNotFoundError:
                print(f"⚠️  Warning: Summary not found for {article_file}")
                continue

    return pd.DataFrame(data_records)


def clean_text(text):
    """
    Basic text cleaning while preserving linguistic features.
    """
    # Remove excessive whitespace and newlines
    text = re.sub(r'\s+', ' ', text)
    # Remove leading/trailing whitespace
    text = text.strip()
    return text


def preprocess_data(df):
    """Apply cleaning and filtering to the dataset."""
    # Clean both articles and summaries
    df['article'] = df['article'].apply(clean_text)
    df['summary'] = df['summary'].apply(clean_text)

    # Remove entries that are too short (likely errors)
    df = df[(df['article'].str.len() > 50) & (df['summary'].str.len() > 10)]

    # Reset index
    df = df.reset_index(drop=True)

    return df

# Execute data loading
print("📥 Loading BBC News Dataset...")
df = load_bbc_news_data(os.path.join(pariza_bbc_news_summary_path, 'BBC News Summary'))
df = preprocess_data(df)

print(f"✅ Loaded {len(df)} article-summary pairs")
print(f"\n📊 Dataset Statistics:")
print(f"   - Average article length: {df['article'].str.len().mean():.0f} characters")
print(f"   - Average summary length: {df['summary'].str.len().mean():.0f} characters")
print(f"\n📁 Categories: {df['category'].unique().tolist()}")
print(f"\n🔍 Sample Data:")
print(df.head(2))

📥 Loading BBC News Dataset...
✅ Loaded 2225 article-summary pairs

📊 Dataset Statistics:
   - Average article length: 2259 characters
   - Average summary length: 1001 characters

📁 Categories: ['politics', 'business', 'tech', 'sport', 'entertainment']

🔍 Sample Data:
   category                                            article  \
0  politics  UK 'needs true immigration data' A former Home...   
1  politics  No election TV debate, says Blair Tony Blair h...   

                                             summary  
0  She said this would counter "so-called indepen...  
1  Tony Blair has said he will not take part in a...  


### 📊 Data Splitting Strategy

We'll split our dataset into three sets:
- **Training Set (80%)**: Used to fine-tune the model
- **Validation Set (10%)**: Monitor performance and prevent overfitting
- **Test Set (10%)**: Final evaluation on completely unseen data

### Why this ratio?
- Pre-trained models (PEGASUS) already know language patterns
- Fine-tuning requires less data than training from scratch
- 80% provides sufficient examples for domain adaptation
- 10% validation is enough to track generalization
- 10% test gives reliable final evaluation

In [9]:
# Split into Train/Validation/Test (80/10/10)
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['category'])
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, stratify=temp_df['category'])

print("📊 Dataset Splits:")
print("="*50)
print(f"Training Set:   {len(train_df):4d} samples ({len(train_df)/len(df)*100:.1f}%)")
print(f"Validation Set: {len(val_df):4d} samples ({len(val_df)/len(df)*100:.1f}%)")
print(f"Test Set:       {len(test_df):4d} samples ({len(test_df)/len(df)*100:.1f}%)")
print("="*50)

📊 Dataset Splits:
Training Set:   1780 samples (80.0%)
Validation Set:  222 samples (10.0%)
Test Set:        223 samples (10.0%)


---

## 🔤 Step 3: Text Representation & Tokenization

### Understanding Tokenization in Transformers

**What is tokenization?**
Tokenization converts text into numerical representations that neural networks can process.

### The Pipeline:
```
Raw Text → Tokens → Token IDs → Embeddings (inside model)
```

**Example:**
```
Text: "The cat sat on the mat."
Tokens: ["The", "cat", "sat", "on", "the", "mat", "."]
Token IDs: [101, 2368, 2938, 2006, 1996, 13523, 1012, 102]
```

### Key Tokenizer Operations:

1. **Vocabulary Mapping**: Each token → unique integer ID
2. **Special Tokens**: Add `<s>`, `</s>`, `<pad>` markers
3. **Padding**: Extend short sequences to fixed length
4. **Truncation**: Shorten long sequences to max length
5. **Attention Masks**: Mark real tokens (1) vs padding (0)

### Why fixed lengths?
- GPUs process batches efficiently with uniform shapes
- MAX_INPUT_LENGTH = 512 tokens ≈ 350-400 words (news articles)
- MAX_TARGET_LENGTH = 128 tokens ≈ 90-100 words (summaries)

### Embeddings: Where do they come from?
**Important:** Embeddings are NOT created during tokenization!
- Token IDs are just integers
- The model's **embedding layer** converts IDs → dense vectors
- This happens automatically during forward pass
- Embedding layer is pre-trained and gets fine-tuned

In [10]:
from transformers import AutoTokenizer
from torch.utils.data import Dataset
import torch

# Model Selection
# PEGASUS: Best quality for summarization (pre-trained with gap sentence generation)
# BART: Good alternative if memory/compute is limited
MODEL_NAME = "google/pegasus-cnn_dailymail"
# Alternative: "facebook/bart-base"

print(f"🤖 Initializing tokenizer for: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Tokenization parameters
MAX_INPUT_LENGTH = 512   # Maximum article length in tokens
MAX_TARGET_LENGTH = 128  # Maximum summary length in tokens

print(f"\n✅ Tokenizer loaded successfully!")
print(f"   - Vocabulary size: {len(tokenizer)}")
print(f"   - Max input length: {MAX_INPUT_LENGTH} tokens")
print(f"   - Max target length: {MAX_TARGET_LENGTH} tokens")

# Demonstrate tokenization
sample_text = "The BBC News dataset contains articles from multiple categories."
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.encode(sample_text)

print(f"\n🔍 Tokenization Example:")
print(f"   Original: {sample_text}")
print(f"   Tokens: {tokens[:10]}...")
print(f"   Token IDs: {token_ids[:10]}...")

🤖 Initializing tokenizer for: google/pegasus-cnn_dailymail


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]


✅ Tokenizer loaded successfully!
   - Vocabulary size: 96103
   - Max input length: 512 tokens
   - Max target length: 128 tokens

🔍 Tokenization Example:
   Original: The BBC News dataset contains articles from multiple categories.
   Tokens: ['▁The', '▁BBC', '▁News', '▁dataset', '▁contains', '▁articles', '▁from', '▁multiple', '▁categories', '.']...
   Token IDs: [139, 6442, 2380, 20886, 1733, 2391, 135, 1079, 3510, 107]...


### 🏗️ Building a Custom PyTorch Dataset

We create a custom `Dataset` class to:
1. Wrap our DataFrame for PyTorch compatibility
2. Handle tokenization dynamically
3. Prepare inputs and labels for training
4. Enable efficient batch loading

**Why custom instead of Hugging Face datasets?**
- More control over preprocessing
- Easier to modify for specific needs
- Direct integration with our DataFrame
- Clear understanding of data flow

In [11]:
class SummarizationDataset(Dataset):
    """
    Custom PyTorch Dataset for abstractive text summarization.

    This class handles:
    - Dynamic tokenization of articles and summaries
    - Padding and truncation to fixed lengths
    - Creation of attention masks
    - Proper label formatting for Seq2Seq training
    """

    def __init__(self, dataframe, tokenizer, max_input_len, max_target_len):
        """
        Parameters:
        -----------
        dataframe : pd.DataFrame
            DataFrame with 'article' and 'summary' columns
        tokenizer : PreTrainedTokenizer
            Tokenizer from transformers library
        max_input_len : int
            Maximum length for input articles
        max_target_len : int
            Maximum length for target summaries
        """
        self.data = dataframe.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_input_len = max_input_len
        self.max_target_len = max_target_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        """
        Get a single tokenized example.

        Returns:
        --------
        dict with keys:
            - input_ids: Token IDs for the article
            - attention_mask: Mask for padding (1=real token, 0=padding)
            - labels: Token IDs for the summary (training target)
        """
        article = str(self.data.loc[index, 'article'])
        summary = str(self.data.loc[index, 'summary'])

        # Tokenize input article
        inputs = self.tokenizer(
            article,
            max_length=self.max_input_len,
            padding='max_length',      # Pad to max_length
            truncation=True,            # Truncate if longer than max_length
            return_tensors='pt'         # Return PyTorch tensors
        )

        # Tokenize target summary
        targets = self.tokenizer(
            summary,
            max_length=self.max_target_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': targets['input_ids'].squeeze()
        }


# Create dataset objects
print("🏗️  Creating PyTorch datasets...")
train_dataset = SummarizationDataset(train_df, tokenizer, MAX_INPUT_LENGTH, MAX_TARGET_LENGTH)
val_dataset = SummarizationDataset(val_df, tokenizer, MAX_INPUT_LENGTH, MAX_TARGET_LENGTH)
test_dataset = SummarizationDataset(test_df, tokenizer, MAX_INPUT_LENGTH, MAX_TARGET_LENGTH)

print(f"✅ Datasets created successfully!")
print(f"\n📦 Sample Batch Shape:")
sample = train_dataset[0]
print(f"   - input_ids: {sample['input_ids'].shape}")
print(f"   - attention_mask: {sample['attention_mask'].shape}")
print(f"   - labels: {sample['labels'].shape}")

🏗️  Creating PyTorch datasets...
✅ Datasets created successfully!

📦 Sample Batch Shape:
   - input_ids: torch.Size([512])
   - attention_mask: torch.Size([512])
   - labels: torch.Size([128])


---

## 🧠 Step 4: Model Definition & Architecture

### PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

**Why PEGASUS for summarization?**
1. **Purpose-built**: Pre-trained specifically for summarization tasks
2. **Gap Sentence Generation (GSG)**: Pre-training objective similar to actual summarization
3. **State-of-the-art**: Achieves best performance on multiple summarization benchmarks
4. **Transfer Learning**: Already knows how to compress information

### Model Architecture:
```
┌─────────────────────────────────────────┐
│          PEGASUS Architecture           │
├─────────────────────────────────────────┤
│  Input: Article (512 tokens)            │
│           ↓                              │
│  Encoder (12 layers)                     │
│   - Self-attention                       │
│   - Feed-forward                         │
│   - Layer normalization                  │
│           ↓                              │
│  Encoder Output (contextualized vectors) │
│           ↓                              │
│  Decoder (12 layers)                     │
│   - Self-attention                       │
│   - Cross-attention (to encoder)         │
│   - Feed-forward                         │
│           ↓                              │
│  Output: Summary (128 tokens)            │
└─────────────────────────────────────────┘
```

### Generation Parameters:

**Beam Search (num_beams=4):**
- Explores multiple sequence possibilities simultaneously
- Keeps top-K candidates at each step
- Higher beams = better quality but slower
- 4 beams is a good balance

**Length Penalty:**
- Controls summary length preference
- 1.0 = neutral, >1.0 = longer, <1.0 = shorter

**Early Stopping:**
- Stops generation when all beams finish
- Saves computation time

In [12]:
from transformers import AutoModelForSeq2SeqLM

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"🖥️  Device: {device}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Load pre-trained model
print(f"\n📥 Loading pre-trained model: {MODEL_NAME}")
print("   This may take a few minutes...")
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Move model to GPU if available
model = model.to(device)

# Model statistics
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n✅ Model loaded successfully!")
print(f"   - Total parameters: {total_params:,}")
print(f"   - Trainable parameters: {trainable_params:,}")
print(f"   - Model size: ~{total_params * 4 / 1e9:.2f} GB (fp32)")

# Configure generation parameters
generation_config = {
    'max_length': MAX_TARGET_LENGTH,
    'num_beams': 4,              # Beam search with 4 beams
    'length_penalty': 1.0,       # Neutral length preference
    'early_stopping': True,      # Stop when all beams finish
    'no_repeat_ngram_size': 3    # Avoid 3-gram repetition
}

model.config.update(generation_config)
print(f"\n⚙️  Generation config: {generation_config}")

🖥️  Device: cpu

📥 Loading pre-trained model: google/pegasus-cnn_dailymail
   This may take a few minutes...


pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]


✅ Model loaded successfully!
   - Total parameters: 570,797,056
   - Trainable parameters: 568,699,904
   - Model size: ~2.28 GB (fp32)

⚙️  Generation config: {'max_length': 128, 'num_beams': 4, 'length_penalty': 1.0, 'early_stopping': True, 'no_repeat_ngram_size': 3}


---

## 🎯 Step 5: Model Training (Fine-tuning)

### Understanding Fine-tuning vs Training from Scratch

**Training from Scratch:**
- Random initialization
- Requires millions of examples
- Takes weeks on multiple GPUs
- Learns language from zero

**Fine-tuning (our approach):**
- Start with pre-trained weights
- Requires thousands of examples
- Takes hours on single GPU
- Adapts existing knowledge to our domain

### Training Configuration Explained:

**Batch Size & Gradient Accumulation:**
```
per_device_batch_size = 4        # Samples per GPU per step
gradient_accumulation_steps = 4   # Accumulate over 4 steps
Effective batch size = 4 × 4 = 16
```
*Why?* Large batches don't fit in Colab's 15GB GPU memory

**Learning Rate (5e-5):**
- Standard for fine-tuning transformers
- Too high → unstable training, forgets pre-training
- Too low → very slow convergence

**Weight Decay (0.01):**
- L2 regularization to prevent overfitting
- Penalizes large weights

**Warmup Steps (500):**
- Gradually increase learning rate from 0 → 5e-5
- Prevents early training instability

**FP16 (Mixed Precision):**
- Uses 16-bit floats instead of 32-bit
- 2x faster training
- 50% less memory
- Minimal accuracy loss

### Evaluation Metrics:

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation):**
- ROUGE-1: Unigram overlap (content coverage)
- ROUGE-2: Bigram overlap (fluency)
- ROUGE-L: Longest common subsequence (structure)
- Recall-focused: How much reference content is captured?

**BLEU (Bilingual Evaluation Understudy):**
- N-gram precision metric
- Precision-focused: How much generated content is correct?
- Originally for machine translation

Together, ROUGE + BLEU give comprehensive quality assessment.

In [13]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import evaluate
import nltk
import numpy as np # Import numpy

# Download NLTK data for sentence tokenization
nltk.download('punkt', quiet=True)

# Load evaluation metrics
print("📊 Loading evaluation metrics...")
rouge = evaluate.load('rouge')
bleu = evaluate.load('bleu')
print("✅ Metrics loaded: ROUGE, BLEU")

def compute_metrics(eval_pred):
    """
    Compute ROUGE and BLEU scores for evaluation.

    This function is called by the Trainer during evaluation.

    Parameters:
    -----------
    eval_pred : tuple
        (predictions, labels) from model generation

    Returns:
    --------
    dict
        Dictionary with rouge1, rouge2, rougeL, and bleu scores
    """
    predictions, labels = eval_pred

    # Replace -100 in labels (used for padding in loss calculation)
    # -100 is ignored by CrossEntropyLoss
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decode token IDs back to text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # ROUGE expects newline-separated sentences
    rouge_decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    rouge_decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Compute ROUGE scores
    rouge_result = rouge.compute(
        predictions=rouge_decoded_preds,
        references=rouge_decoded_labels,
        use_stemmer=True  # Use Porter stemmer for better matching
    )

    # For BLEU, `evaluate` library expects predictions as List[str] and references as List[List[str]].
    # Our `rouge_decoded_preds` and `rouge_decoded_labels` are already List[str].
    # We need to wrap each reference string in its own list to match List[List[str]] format.
    bleu_result = bleu.compute(
        predictions=rouge_decoded_preds,
        references=[[label] for label in rouge_decoded_labels] # Ensure references is List[List[str]]
    )

    return {
        'rouge1': rouge_result['rouge1'],
        'rouge2': rouge_result['rouge2'],
        'rougeL': rouge_result['rougeL'],
        'bleu': bleu_result['bleu']
    }

# Data collator for dynamic padding
# Instead of padding all sequences to max_length, pad only to max in batch
# Saves memory and computation
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

print("\n✅ Data collator created (dynamic padding enabled)")

📊 Loading evaluation metrics...


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

✅ Metrics loaded: ROUGE, BLEU

✅ Data collator created (dynamic padding enabled)


### 🎛️ Training Arguments Configuration

Below we set up all hyperparameters for training using `Seq2SeqTrainingArguments`.

**Why Seq2SeqTrainingArguments?**
- Specialized for sequence-to-sequence models (encoder-decoder)
- Supports generation during evaluation (`predict_with_generate=True`)
- Handles beam search parameters
- Better than standard TrainingArguments for summarization

In [14]:
training_args = Seq2SeqTrainingArguments(
    # Output directory for checkpoints and logs
    output_dir='./results',

    # Number of training epochs
    num_train_epochs=3,

    # Batch sizes
    per_device_train_batch_size=1,       # Training batch per GPU
    per_device_eval_batch_size=1,        # Evaluation batch per GPU - Changed to 1
    gradient_accumulation_steps=16,       # Accumulate gradients over 16 steps
    #increased to compensate foe small batch size
    # fp16=True, # REMOVED: This was a duplicate
    #Mandatory: cuts memory usage by 50%
    # REMOVED: gradient_checkpointing=True, as it's defined later
    # Optimization
    learning_rate=5e-5,                  # Fine-tuning learning rate
    weight_decay=0.01,                   # L2 regularization
    warmup_steps=500,                    # Learning rate warmup
    max_grad_norm=1.0,                   # Gradient clipping
    optim='adamw_torch',                 # Explicitly set optimizer to avoid fused mode issues

    # Evaluation strategy - RENAMED HERE
    eval_strategy='epoch',               # Updated from evaluation_strategy
    save_strategy='epoch',               # Save checkpoint after each epoch
    save_total_limit=2,                  # Keep only 2 best checkpoints
    load_best_model_at_end=True,         # Load best model for final evaluation
    metric_for_best_model='rouge1',      # Use ROUGE-1 for model selection

    # Logging
    logging_dir='./logs',
    logging_steps=100,                   # Log every 100 steps

    # Generation parameters for evaluation
    predict_with_generate=True,          # Generate summaries during eval
    generation_max_length=MAX_TARGET_LENGTH,
    generation_num_beams=4,

    # Performance optimization
    fp16=True,                           # Mixed precision training - Changed to True
    gradient_checkpointing=True,         # Enable gradient checkpointing
    dataloader_num_workers=2,            # Parallel data loading

    # Disable external logging
    report_to='none',                    # Don't use wandb/tensorboard
)


In [10]:
import torch
import gc

# Force clean the GPU memory
gc.collect()
torch.cuda.empty_cache()

### 🚀 Initialize Trainer and Start Training

The `Seq2SeqTrainer` orchestrates the entire training process:
- Forward pass through model
- Loss calculation
- Backpropagation
- Optimizer step
- Evaluation with generation
- Metric computation
- Checkpoint saving

**Training will take approximately:**
- ~45-60 minutes on Google Colab T4 GPU
- ~30-40 minutes on A100 GPU
- Several hours on CPU (not recommended)

In [15]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [12]:
# Initialize Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print("✅ Trainer initialized")
print("\n" + "="*70)
print("🚀 STARTING TRAINING".center(70))
print("="*70)
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Steps per epoch: {len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps)}")
print("="*70)

# Start training
# This will take approximately 45-60 minutes on Colab T4 GPU
train_result = trainer.train()

print("\n" + "="*70)
print("✅ TRAINING COMPLETED!".center(70))
print("="*70)

  trainer = Seq2SeqTrainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


✅ Trainer initialized

                         🚀 STARTING TRAINING                          
Training samples: 1780
Validation samples: 222
Steps per epoch: 111


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.7833,1.190429,0.494113,0.341856,0.343785,0.30851
2,1.2074,0.930786,0.540949,0.386583,0.36768,0.381511
3,0.9645,0.846705,0.569692,0.445681,0.411964,0.393704


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].



                        ✅ TRAINING COMPLETED!                         


### 💾 Save the Fine-tuned Model

After training, we save:
1. Model weights (fine-tuned parameters)
2. Model configuration
3. Tokenizer files
4. Generation config

This allows us to reload and use the model later without retraining.

In [16]:
# Save the fine-tuned model
print("\n💾 Saving fine-tuned model...")
trainer.save_model('./fine_tuned_summarizer')
tokenizer.save_pretrained('./fine_tuned_summarizer')

print("✅ Model saved to: ./fine_tuned_summarizer")
print("\nSaved files:")
print("   - pytorch_model.bin (model weights)")
print("   - config.json (model configuration)")
print("   - tokenizer files")
print("   - generation_config.json")


💾 Saving fine-tuned model...


NameError: name 'trainer' is not defined

### 📈 Evaluate on Test Set

Now we evaluate the fine-tuned model on the held-out test set to get final performance metrics.

In [17]:
# Evaluate on test set
print("\n📊 Evaluating on test set...")
print("="*70)
test_results = trainer.evaluate(test_dataset)

print("\n🎯 Test Set Results:")
print("="*70)
for key, value in test_results.items():
    if 'rouge' in key or 'bleu' in key:
        print(f"{key.upper():>15}: {value:.4f}")
print("="*70)

# Interpretation guide
print("\n📖 Interpreting the Scores:")
print("   - ROUGE-1 (0.4+): Good content coverage")
print("   - ROUGE-2 (0.15+): Good fluency")
print("   - ROUGE-L (0.3+): Good structural similarity")
print("   - BLEU (0.2+): Good precision")


📊 Evaluating on test set...


NameError: name 'trainer' is not defined

---

## 🔮 Step 6: Inference Pipeline

### Creating a Standalone Prediction Function

Now that we have a fine-tuned model, we need a simple function to generate summaries for new articles.

### Inference Pipeline:
```
Raw Text → Tokenize → Encode → Model → Decode → Summary
```

### Generation Process (Beam Search):

At each decoding step:
1. Model predicts probability distribution over vocabulary
2. Keep top-K candidates (beams)
3. Extend each beam with all possible next tokens
4. Select top-K from all extensions
5. Repeat until `</s>` token or max_length
6. Return highest-scoring complete sequence

**Why `model.eval()` and `torch.no_grad()`?**
- `model.eval()`: Disables dropout, uses running stats for normalization
- `torch.no_grad()`: Disables gradient computation (faster, less memory)

In [18]:
def predict_summary(article_text, model, tokenizer, device, max_length=128, num_beams=4):
    """
    Generate an abstractive summary for a given article.

    Parameters:
    -----------
    article_text : str
        Input news article (raw text)
    model : PreTrainedModel
        Fine-tuned summarization model
    tokenizer : PreTrainedTokenizer
        Corresponding tokenizer
    device : torch.device
        CPU or CUDA device
    max_length : int
        Maximum summary length in tokens
    num_beams : int
        Number of beams for beam search

    Returns:
    --------
    str
        Generated summary
    """
    # Set model to evaluation mode
    model.eval()

    # Tokenize input article
    inputs = tokenizer(
        article_text,
        max_length=MAX_INPUT_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    # Move tensors to device (CPU/GPU)
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)

    # Generate summary (no gradient computation needed)
    with torch.no_grad():
        summary_ids = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_length=max_length,
            num_beams=num_beams,
            length_penalty=1.0,
            early_stopping=True,
            no_repeat_ngram_size=3  # Avoid repetitive 3-grams
        )

    # Decode token IDs back to text
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

print("✅ Inference function created: predict_summary()")

✅ Inference function created: predict_summary()


### 🧪 Test the Inference Function

Let's test our inference pipeline on a sample from the test set.

In [19]:
# Test the inference function
print("\n🧪 Testing inference on sample article...")
print("="*70)

# Get a sample from test set
sample_idx = 0
sample_article = test_df.iloc[sample_idx]['article']
sample_reference = test_df.iloc[sample_idx]['summary']
sample_category = test_df.iloc[sample_idx]['category']

# Generate summary
generated_summary = predict_summary(
    sample_article,
    model,
    tokenizer,
    device,
    max_length=MAX_TARGET_LENGTH,
    num_beams=4
)

# Display results
print(f"\n📁 Category: {sample_category.upper()}")
print(f"\n📰 Article (first 400 characters):")
print("-" * 70)
print(sample_article[:400] + "...")
print("-" * 70)

print(f"\n✅ Reference Summary:")
print("-" * 70)
print(sample_reference)
print("-" * 70)

print(f"\n🤖 Generated Summary:")
print("-" * 70)
print(generated_summary)
print("-" * 70)

print("\n💡 Notice how the generated summary:")
print("   - Captures key information")
print("   - Uses different wording (abstractive)")
print("   - Maintains coherence")
print("   - Stays within length limits")


🧪 Testing inference on sample article...





📁 Category: POLITICS

📰 Article (first 400 characters):
----------------------------------------------------------------------
Labour's Cunningham to stand down Veteran Labour MP and former Cabinet minister Jack Cunningham has said he will stand down at the next election. One of the few Blair-era ministers to serve under Jim Callaghan, he was given the agriculture portfolio when Labour regained power in 1997. Mr Cunningham went on to become Tony Blair's "cabinet enforcer". He has represented the constituency now known as ...
----------------------------------------------------------------------

✅ Reference Summary:
----------------------------------------------------------------------
Veteran Labour MP and former Cabinet minister Jack Cunningham has said he will stand down at the next election.Mr Blair said he was a "huge figure" in Labour and a "valued, personal friend".One of the few Blair-era ministers to serve under Jim Callaghan, he was given the agriculture portfolio when Labou

---

## 🎨 Step 7: Gradio Web Interface

### Building an Interactive Demo

Gradio allows us to create a web interface for our model with just a few lines of code.

### Features:
- **Input**: Large text box for pasting news articles
- **Output**: Generated summary display
- **Examples**: Pre-loaded test samples for quick testing
- **Share Link**: Public URL to share your demo

### Why Gradio?
- No frontend coding required
- Instant shareable links
- Beautiful, responsive UI
- Easy deployment to Hugging Face Spaces

**The interface will:**
1. Accept any news article as input
2. Tokenize and process the text
3. Generate summary using beam search
4. Display the result instantly

In [20]:
import gradio as gr

def summarize_article(article_text):
    """
    Wrapper function for Gradio interface.

    Parameters:
    -----------
    article_text : str
        Input article from the user

    Returns:
    --------
    str
        Generated summary or error message
    """
    # Validation
    if not article_text.strip():
        return "⚠️ Please enter an article to summarize."

    if len(article_text.strip()) < 50:
        return "⚠️ Article is too short. Please provide at least 50 characters."

    # Generate summary
    try:
        summary = predict_summary(
            article_text,
            model,
            tokenizer,
            device,
            max_length=MAX_TARGET_LENGTH,
            num_beams=4
        )
        return summary
    except Exception as e:
        return f"❌ Error generating summary: {str(e)}"

# Prepare examples from test set
examples = [
    [test_df.iloc[i]['article']] for i in range(min(5, len(test_df)))
]

# Create Gradio interface
demo = gr.Interface(
    fn=summarize_article,

    # Input component
    inputs=gr.Textbox(
        lines=15,
        placeholder="Paste your news article here...\n\nExample: The government announced today that...",
        label="📰 News Article",
        info="Enter a news article (minimum 50 characters)"
    ),

    # Output component
    outputs=gr.Textbox(
        lines=8,
        label="📝 Generated Summary"
    ),

    # Interface metadata
    title="📰 Abstractive News Summarization",
    description=(
        f"Generate concise, abstractive summaries of news articles using fine-tuned **{MODEL_NAME.split('/')[-1].upper()}**.\n\n"
        "This model was trained on BBC News articles covering business, entertainment, politics, sport, and technology."
    ),

    # Pre-loaded examples
    examples=examples,
    examples_per_page=5,

    # Styling
    theme=gr.themes.Soft(),

    # Additional settings
    article="Built with 🤗 Transformers and Gradio"
)

print("\n✅ Gradio interface created!")
print("\n" + "="*70)
print("🎨 Ready to launch web interface!".center(70))
print("="*70)


✅ Gradio interface created!

                   🎨 Ready to launch web interface!                   


### 🚀 Launch the Interface

Run the cell below to launch the Gradio interface. It will:
1. Start a local server
2. Generate a public share link (valid for 72 hours)
3. Display the interface in an iframe

**You can:**
- Test with example articles
- Paste your own news articles
- Share the public link with others

In [21]:
# Launch the interface
print("\n🚀 Launching Gradio interface...")
print("   Please wait for the interface to load...")
print("\n💡 Tips:")
print("   - Click on examples to test quickly")
print("   - Copy the share link to show others")
print("   - Interface will run until you stop the cell")
print("="*70)

# Launch with share link
demo.launch(
    share=True,        # Create public share link
    debug=False,       # Disable debug mode
    show_error=True    # Show errors in interface
)


🚀 Launching Gradio interface...
   Please wait for the interface to load...

💡 Tips:
   - Click on examples to test quickly
   - Copy the share link to show others
   - Interface will run until you stop the cell
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://985207e5f9617ce25d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




---

## 🎉 Project Complete!

### What We Built:

1. ✅ **Data Pipeline**: Loaded and preprocessed BBC News dataset
2. ✅ **Tokenization**: Created custom PyTorch dataset with proper tokenization
3. ✅ **Model**: Fine-tuned PEGASUS transformer for abstractive summarization
4. ✅ **Evaluation**: Computed ROUGE and BLEU metrics
5. ✅ **Inference**: Built prediction pipeline for new articles
6. ✅ **Interface**: Created interactive Gradio web UI

### 📊 Expected Performance:

On BBC News dataset, you should achieve approximately:
- **ROUGE-1**: 0.40 - 0.45 (good content coverage)
- **ROUGE-2**: 0.15 - 0.20 (good fluency)
- **ROUGE-L**: 0.30 - 0.35 (good structure)
- **BLEU**: 0.20 - 0.30 (good precision)

### 🚀 Next Steps:

**Improve Performance:**
- Train for more epochs (5-10)
- Increase batch size (if more GPU memory)
- Try different learning rates
- Experiment with length penalty
- Use BART or T5 models

**Deploy to Production:**
- Save model to Hugging Face Hub
- Deploy on Hugging Face Spaces
- Create REST API with FastAPI
- Optimize with ONNX or TorchScript

**Extend Functionality:**
- Multi-document summarization
- Query-focused summarization
- Support multiple languages
- Add content filtering

### 📚 Resources:

- **PEGASUS Paper**: https://arxiv.org/abs/1912.08777
- **Hugging Face Docs**: https://huggingface.co/docs/transformers
- **ROUGE Metrics**: https://aclanthology.org/W04-1013/
- **Gradio Docs**: https://gradio.app/docs/

---

**Thank you for using this notebook! 🙏**

*If you found this helpful, please ⭐ star the repository!*