<a href="https://colab.research.google.com/github/sathu0622/25-26J-438-AI-Powered-LMS-for-Visually-Impaired-Students/blob/main/T5_Summarization_Training_Improved_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI-Powered Historical Content Summarization
## Training FLAN-T5-Base Model for Newspaper/Magazine/Book Summarization

**Research Focus:** Voice-Based Summarization of Historical Content for Visually Impaired Students

**Task:** Train a model to summarize based on source type:
- **Newspaper**: 3-4 sentences (short)
- **Magazine**: ~50% of original length (medium)
- **Book**: ~80% depth (long, detailed)

**Model:** google/flan-t5-base (250M params) - Optimized for A100 GPU


## 1Ô∏è‚É£ Setup & Installation


In [None]:
# Install required packages
!pip install --upgrade transformers datasets evaluate accelerate rouge-score sentencepiece peft bitsandbytes --quiet

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"üí° LoRA/QLoRA will reduce checkpoint size from 10GB+ to ~50-200MB!")


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m512.3/512.3 kB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m47.7/47.7 MB[0m [31m52.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
ü

In [None]:
torch.cuda.empty_cache()
import gc; gc.collect()


96

## 2Ô∏è‚É£ Mount Google Drive & Load Dataset


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import json
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Load your dataset
dataset_path = '/content/drive/MyDrive/history_dataset.json'

with open(dataset_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

print(f"Total samples: {len(data)}")
print(f"\nSample distribution by type:")
for source_type in ['newspaper', 'magazine', 'book']:
    count = sum(1 for item in data if item.get('source_type') == source_type)
    print(f"  {source_type}: {count}")


Total samples: 1523

Sample distribution by type:
  newspaper: 503
  magazine: 507
  book: 513


## 3Ô∏è‚É£ Improved Prompt Engineering

**Why better prompts matter:** T5 models are prompt-sensitive. Clear, task-specific prompts significantly improve performance.


In [None]:
def build_prompt(text, source_type):
    """
    Build optimized prompts for different source types.
    These prompts guide T5 to generate summaries of appropriate length.
    """
    text = text.strip()

    # Handle subscription/paywall content
    if "purchase a subscription" in text.lower() or len(text) < 50:
        return "summarize: The article content is unavailable. Provide a 2-sentence generic summary."

    if source_type == "newspaper":
        # Newspaper: Very concise, factual summary (3-4 sentences)
        prompt = f"summarize newspaper article in 3-4 factual sentences: {text}"

    elif source_type == "magazine":
        # Magazine: Medium length, ~50% of original, descriptive
        prompt = f"summarize magazine article in about half the original length with key details: {text}"

    elif source_type == "book":
        # Book: Detailed summary, ~80% depth, preserve key ideas and context
        prompt = f"summarize book excerpt in detail preserving key ideas and context: {text}"

    else:
        # Fallback
        prompt = f"summarize: {text}"

    return prompt

# Test the prompt function
# Use a longer test text to see actual prompts (must be > 50 characters)
test_text = "This is a test article about historical events. It discusses various important moments in world history and their impact on modern society. The article covers multiple topics including ancient civilizations, medieval periods, and contemporary historical analysis."
for st in ["newspaper", "magazine", "book"]:
    print(f"\n{st.upper()}:")
    prompt = build_prompt(test_text, st)
    print(f"Prompt preview (first 150 chars): {prompt[:150]}...")
    print(f"Full prompt length: {len(prompt)} characters")



NEWSPAPER:
Prompt preview (first 150 chars): summarize newspaper article in 3-4 factual sentences: This is a test article about historical events. It discusses various important moments in world ...
Full prompt length: 317 characters

MAGAZINE:
Prompt preview (first 150 chars): summarize magazine article in about half the original length with key details: This is a test article about historical events. It discusses various im...
Full prompt length: 342 characters

BOOK:
Prompt preview (first 150 chars): summarize book excerpt in detail preserving key ideas and context: This is a test article about historical events. It discusses various important mome...
Full prompt length: 330 characters


## 4Ô∏è‚É£ Prepare Dataset with Improved Preprocessing


In [None]:
# Preprocess dataset
texts, summaries, source_types = [], [], []

for item in data:
    source_type = item.get('source_type', 'book')
    content = item.get('content', '').strip()
    target_summary = item.get('target_summary', '').strip()

    # Skip empty or invalid entries
    if not content or not target_summary:
        continue

    # Build prompt with source type
    prompt = build_prompt(content, source_type)

    texts.append(prompt)
    summaries.append(target_summary)
    source_types.append(source_type)

print(f"Processed {len(texts)} samples")
print(f"\nLength statistics:")
print(f"  Average input length: {sum(len(t) for t in texts) / len(texts):.0f} characters")
print(f"  Average summary length: {sum(len(s) for s in summaries) / len(summaries):.0f} characters")

# Create dataset
dataset_dict = {
    'text': texts,
    'summary': summaries,
    'source_type': source_types
}

dataset = Dataset.from_dict(dataset_dict)

# Train/validation split (90/10)
train_test = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_test['train']
eval_dataset = train_test['test']

print(f"\nSplit:")
print(f"  Training: {len(train_dataset)} samples")
print(f"  Validation: {len(eval_dataset)} samples")


Processed 1520 samples

Length statistics:
  Average input length: 5503 characters
  Average summary length: 1840 characters

Split:
  Training: 1368 samples
  Validation: 152 samples


## 5Ô∏è‚É£ Load FLAN-T5-Base Model (RECOMMENDED)

**Model Choice - FLAN-T5-Base (BEST ALTERNATIVE):**
- ‚úÖ **FLAN-T5-base** (250M params): Instruction-tuned, 3√ó less VRAM, much more stable with long sequences - **WE'RE USING THIS!**
- **T5-base** (220M params): Good baseline, faster training
- **T5-large** (770M params): Better quality, but requires more memory
- **FLAN-T5-large** (780M params): Instruction-tuned, but requires 3√ó more VRAM

**Why FLAN-T5-Base is the best choice:**
- ‚úÖ Instruction-tuned (same prompt behavior as FLAN-T5-large)
- ‚úÖ 250M params (vs 780M in FLAN-T5-large)
- ‚úÖ 3√ó less VRAM usage
- ‚úÖ Much more stable with long sequences
- ‚úÖ Same code, same pipeline
- ‚úÖ Optimized for A100 GPU


In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, DataCollatorForSeq2Seq
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Using FLAN-T5-Base - BEST ALTERNATIVE (RECOMMENDED)
# ‚úÖ FLAN-T5-base (250M params) - Instruction-tuned, 3√ó less VRAM, much more stable with long sequences - **CURRENT CHOICE**
# Option 2: FLAN-T5-large (780M params) - Instruction-tuned, but requires 3√ó more VRAM
# Option 3: T5-large (770M params) - Standard T5, good quality
# Option 4: T5-base (220M params) - Faster, less memory

model_name = "google/flan-t5-base"  # FLAN-T5-Base - BEST ALTERNATIVE (RECOMMENDED)
# model_name = "google/flan-t5-large"  # Uncomment to use FLAN-T5-large instead (requires more VRAM)
# model_name = "t5-large"  # Uncomment to use standard T5-large
# model_name = "t5-base"  # Uncomment to use standard T5-base

print(f"Loading model: {model_name}")
print(f"üí° FLAN-T5-Base is instruction-tuned and optimized for A100 GPU!")
print(f"üí° Benefits: 3√ó less VRAM, more stable with long sequences, same prompt behavior")
print(f"üöÄ USING LoRA - Checkpoint size: 10GB+ ‚Üí ~50-200MB (50-200√ó smaller!)")

# Configure device for A100 GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    print(f"üí° GPU: {torch.cuda.get_device_name(0)} (A100 detected - bf16 will be enabled)")

# Load tokenizer and base model
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Configure LoRA (Low-Rank Adaptation)
# LoRA only trains ~1-5% of parameters, dramatically reducing checkpoint size
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,  # Sequence-to-sequence task
    inference_mode=False,
    r=16,  # Rank - higher = more parameters (better quality but larger checkpoints)
           # r=16 is a good balance (checkpoint ~100-200MB)
           # r=8 = ~50-100MB, r=32 = ~200-300MB
    lora_alpha=32,  # Scaling factor - usually 2√ó rank
    lora_dropout=0.1,  # Dropout for LoRA layers
    target_modules=["q", "v", "k", "o", "wi_0", "wi_1", "wo"],  # T5 attention modules
    bias="none",  # Don't train bias terms
)

# Apply LoRA to the model
print("\nüîß Applying LoRA configuration...")
model = get_peft_model(model, lora_config)

model.config.use_cache = False
model.enable_input_require_grads()

model = model.to(device)
# Count trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
trainable_percentage = 100 * trainable_params / all_params

print(f"‚úÖ LoRA applied successfully!")
print(f"   Total parameters: {all_params:,}")
print(f"   Trainable parameters: {trainable_params:,} ({trainable_percentage:.2f}%)")
print(f"   Checkpoint size estimate: ~{trainable_params * 4 / (1024**2):.1f} MB (vs 10GB+ full model)")

# Move model to GPU
if torch.cuda.is_available():
    model = model.to(device)
    print(f"‚úÖ Model moved to {device}")

model.config.use_cache = False

print(f"\n‚úÖ Model loaded successfully with LoRA!")
print(f"   Vocab size: {tokenizer.vocab_size}")
print(f"   Model: {model_name}")
print(f"   Model type: FLAN-T5-Base + LoRA (Instruction-tuned, 250M base + {trainable_params:,} trainable)")
print(f"   Device: {device}")
print(f"   üíæ Checkpoint savings: 50-200√ó smaller (~50-200MB vs 10GB+)")


Loading model: google/flan-t5-base
üí° FLAN-T5-Base is instruction-tuned and optimized for A100 GPU!
üí° Benefits: 3√ó less VRAM, more stable with long sequences, same prompt behavior
üöÄ USING LoRA - Checkpoint size: 10GB+ ‚Üí ~50-200MB (50-200√ó smaller!)
üí° GPU: NVIDIA A100-SXM4-80GB (A100 detected - bf16 will be enabled)

üîß Applying LoRA configuration...
‚úÖ LoRA applied successfully!
   Total parameters: 254,360,832
   Trainable parameters: 6,782,976 (2.67%)
   Checkpoint size estimate: ~25.9 MB (vs 10GB+ full model)
‚úÖ Model moved to cuda

‚úÖ Model loaded successfully with LoRA!
   Vocab size: 32000
   Model: google/flan-t5-base
   Model type: FLAN-T5-Base + LoRA (Instruction-tuned, 250M base + 6,782,976 trainable)
   Device: cuda
   üíæ Checkpoint savings: 50-200√ó smaller (~50-200MB vs 10GB+)


## 6Ô∏è‚É£ Tokenization with Adaptive Lengths

**Key improvements:**
- Longer max_input_length (512‚Üí1024) for better context
- Variable max_target_length based on source type (short/medium/long)
- Better truncation strategy


In [None]:
# Tokenization parameters
# If you still get OOM errors, reduce max_input_length to 512
max_input_length = 1024  # Increased for better context understanding
# max_input_length = 512  # Uncomment if you still get OOM errors

# Different max lengths for different source types
max_target_lengths = {
    'newspaper': 128,   # Short summaries (3-4 sentences)
    'magazine': 512,    # Medium summaries (~50%)
    'book': 768         # Long summaries (~80% depth)
}

# For simplicity in training, use a single max length (will truncate longer summaries)
# We use 512 to accommodate all types reasonably
max_target_length = 512  # Can handle medium summaries well, books may truncate
# If you still get OOM errors, reduce to 256: max_target_length = 256

# For even better results with books, you could use 768 or 1024, but requires more memory
# max_target_length = 768  # Better for books, but needs more GPU memory

print(f"Max input length: {max_input_length}")
print(f"Max target length: {max_target_length}")
print(f"üí° If you get OOM errors, reduce max_input_length to 512 or max_target_length to 256")


Max input length: 1024
Max target length: 512
üí° If you get OOM errors, reduce max_input_length to 512 or max_target_length to 256


In [None]:
def preprocess_function(examples):
    """
    Tokenize inputs and targets.
    Uses padding='max_length' for consistent batch sizes.
    """
    # Tokenize inputs (prompts)
    model_inputs = tokenizer(
        examples['text'],
        max_length=max_input_length,
        truncation=True,
        padding='max_length'
    )

    # Tokenize targets (summaries)
    labels = tokenizer(
        examples['summary'],
        max_length=max_target_length,
        truncation=True,
        padding='max_length'
    )

    # For T5, labels should be input_ids (not a separate field)
    # Also, replace padding token id's with -100 so they're ignored in loss calculation
    # Convert to list of lists with plain Python integers (not numpy)
    labels_input_ids = []
    for label_seq in labels['input_ids']:
        # Replace padding token ids with -100, ensure plain Python ints
        label_seq_clean = [
            int(token if token != tokenizer.pad_token_id else -100)
            for token in label_seq
        ]
        labels_input_ids.append(label_seq_clean)

    model_inputs['labels'] = labels_input_ids

    return model_inputs

# Apply preprocessing
print("Tokenizing training dataset...")
train_dataset = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names  # Remove original columns
)

print("Tokenizing validation dataset...")
eval_dataset = eval_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=eval_dataset.column_names
)

print("\n‚úÖ Tokenization complete!")

# Set format for PyTorch - this ensures proper tensor conversion
# Use 'numpy' first to avoid issues, then convert to torch in collator
train_dataset.set_format(type='numpy', columns=['input_ids', 'attention_mask', 'labels'])
eval_dataset.set_format(type='numpy', columns=['input_ids', 'attention_mask', 'labels'])


Tokenizing training dataset...


Map:   0%|          | 0/1368 [00:00<?, ? examples/s]

Tokenizing validation dataset...


Map:   0%|          | 0/152 [00:00<?, ? examples/s]


‚úÖ Tokenization complete!


In [None]:
import numpy as np
import evaluate
from transformers import EvalPrediction

# Load ROUGE metric (using evaluate library - newer API)
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    vocab_size = tokenizer.vocab_size
    pad_id = tokenizer.pad_token_id

    # üîí SAFELY CLIP INVALID TOKEN IDS
    predictions = np.where(
        (predictions >= 0) & (predictions < vocab_size),
        predictions,
        pad_id
    )

    labels = np.where(
        (labels >= 0) & (labels < vocab_size),
        labels,
        pad_id
    )

    decoded_preds = tokenizer.batch_decode(
        predictions,
        skip_special_tokens=True
    )

    decoded_labels = tokenizer.batch_decode(
        labels,
        skip_special_tokens=True
    )

    result = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )

    return {k: round(v * 100, 4) for k, v in result.items()}



## 8Ô∏è‚É£ Training Configuration

**Optimized hyperparameters for summarization:**
- Learning rate: 3e-4 (standard for T5)
- Batch size: Adjusted for GPU memory
- Gradient accumulation: Simulates larger batch size
- Warmup steps: Helps model adapt gradually


In [None]:
# from transformers import TrainingArguments, Trainer
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

import torch

# Check if A100 GPU (supports bf16) - A100 is required for optimal performance
use_bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"GPU detected: {gpu_name}")
    print(f"Using bf16: {use_bf16} (A100/Ampere+ GPU detected)")
    if not use_bf16:
        print("‚ö†Ô∏è Warning: bf16 not available. Consider using A100 GPU for optimal performance.")
else:
    print("‚ö†Ô∏è Warning: CUDA not available. Training will be slow on CPU.")
    use_bf16 = False

# Training arguments - Optimized for FLAN-T5-Base + LoRA on A100 GPU
# LoRA uses MUCH less memory, allowing larger batch sizes and faster training
output_dir = '/content/drive/MyDrive/flan_t5_base_lora_summarization_model'

# training_args = TrainingArguments(
#     output_dir=output_dir,

#     # Training settings - Memory-optimized (reduced batch sizes)
#     num_train_epochs=5,  # Increase to 7-10 for better results if time permits
#     per_device_train_batch_size=2,  # Reduced from 8 to 2 to save memory
#     per_device_eval_batch_size=4,   # Reduced from 8 to 4 for evaluation
#     gradient_accumulation_steps=8,  # Increased to maintain effective batch size = 2 * 8 = 16

#     # Learning rate
#     learning_rate=3e-4,  # Standard for T5
#     warmup_steps=500,  # Gradual learning rate increase
#     weight_decay=0.01,  # L2 regularization

#     # Evaluation
#     eval_strategy="epoch",  # Evaluate every epoch
#     save_strategy="epoch",
#     load_best_model_at_end=True,
#     metric_for_best_model="rouge1",  # Use ROUGE-1 as main metric
#     greater_is_better=True,

#     # Logging
#     logging_steps=50,
#     logging_dir=f"{output_dir}/logs",

#     # Saving
#     save_total_limit=3,  # Keep only last 3 checkpoints

#     # Performance - Memory optimized
#     bf16=use_bf16,  # Use bf16 on A100 (better than fp16)
#     fp16=not use_bf16,  # Fallback to fp16 if not A100
#     gradient_checkpointing=True,  # Enable gradient checkpointing to save memory
#     dataloader_num_workers=2,  # Reduced to save memory
#     dataloader_pin_memory=True,  # Faster data transfer to GPU


#     # Memory optimization
#     max_grad_norm=1.0,  # Gradient clipping
#     ddp_find_unused_parameters=False,  # Memory optimization for DDP

#     # Other
#     report_to="none",  # Disable wandb/tensorboard (or enable if you want)
#     seed=42,
#     remove_unused_columns=False,  # Keep all columns for data collator
# )
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    num_train_epochs=5,

    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,

    learning_rate=5e-4,
    warmup_steps=300,
    weight_decay=0.01,

    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",

    logging_steps=50,
    save_total_limit=1,
    save_safetensors=True,

    bf16=use_bf16,
    fp16=not use_bf16,

    # ‚ùå REMOVE THIS
    # gradient_checkpointing=True,

    predict_with_generate=True,
    generation_max_length=512,
    generation_num_beams=4,

    remove_unused_columns=False,
    report_to="none",
    seed=42,
)


print("‚úÖ Training arguments configured for FLAN-T5-Base + LoRA on A100 GPU!")
print(f"   Batch size: {training_args.per_device_train_batch_size} (increased - LoRA uses less memory)")
print(f"   Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Learning rate: {training_args.learning_rate} (higher for LoRA - faster training)")
print(f"   Save total limit: {training_args.save_total_limit} (only best model - saves disk space!)")
print(f"   Checkpoint format: safetensors (smaller, faster)")
print(f"   Gradient checkpointing: {training_args.gradient_checkpointing}")
print(f"   Mixed precision: {'bf16 (A100 optimized)' if use_bf16 else 'fp16'}")
print(f"   Model: FLAN-T5-Base + LoRA (~50-200MB checkpoints vs 10GB+ full model)")
print(f"   üíæ Disk savings: 50-200√ó smaller checkpoints!")


GPU detected: NVIDIA A100-SXM4-80GB
Using bf16: True (A100/Ampere+ GPU detected)
‚úÖ Training arguments configured for FLAN-T5-Base + LoRA on A100 GPU!
   Batch size: 8 (increased - LoRA uses less memory)
   Gradient accumulation: 2
   Effective batch size: 16
   Learning rate: 0.0005 (higher for LoRA - faster training)
   Save total limit: 1 (only best model - saves disk space!)
   Checkpoint format: safetensors (smaller, faster)
   Gradient checkpointing: False
   Mixed precision: bf16 (A100 optimized)
   Model: FLAN-T5-Base + LoRA (~50-200MB checkpoints vs 10GB+ full model)
   üíæ Disk savings: 50-200√ó smaller checkpoints!


In [None]:
# Custom data collator to handle numpy arrays properly
from transformers import DataCollatorForSeq2Seq
import torch
import numpy as np

class FixedDataCollatorForSeq2Seq(DataCollatorForSeq2Seq):
    """Fixed data collator that handles numpy arrays properly"""
    def __call__(self, features, return_tensors=None):
        # Convert labels from numpy arrays to lists if needed
        for feature in features:
            if 'labels' in feature:
                if isinstance(feature['labels'], np.ndarray):
                    feature['labels'] = feature['labels'].tolist()
                elif isinstance(feature['labels'], list) and len(feature['labels']) > 0:
                    if isinstance(feature['labels'][0], np.ndarray):
                        feature['labels'] = [int(x) for x in feature['labels']]
        return super().__call__(features, return_tensors=return_tensors or self.return_tensors)

# Enable gradient checkpointing on the model to save memory
if hasattr(model, 'gradient_checkpointing_enable'):
    model.gradient_checkpointing_enable()
    print("‚úÖ Gradient checkpointing enabled on model")

# Data collator for sequence-to-sequence tasks
data_collator = FixedDataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
    return_tensors="pt"
)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    model = model.to(device)
    print(f"‚úÖ Model moved to {device}: {torch.cuda.get_device_name(0)}")

# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


print("‚úÖ Trainer initialized successfully!")
print(f"   Training samples: {len(train_dataset)}")
print(f"   Validation samples: {len(eval_dataset)}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Mixed precision: {'bf16' if use_bf16 else 'fp16'}")
print(f"   Gradient checkpointing: {training_args.gradient_checkpointing}")


‚úÖ Gradient checkpointing enabled on model
‚úÖ Model moved to cuda: NVIDIA A100-SXM4-80GB
‚úÖ Trainer initialized successfully!
   Training samples: 1368
   Validation samples: 152
   Effective batch size: 16
   Mixed precision: bf16
   Gradient checkpointing: False


  trainer = Seq2SeqTrainer(


## 9Ô∏è‚É£ Train the Model


In [None]:
# Clear GPU memory before training
import gc
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    gc.collect()
    print("üßπ GPU memory cleared")
    print(f"   GPU memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
    print(f"   GPU memory reserved: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB")

# Start training
print("\nüöÄ Starting training...")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(eval_dataset)}")

print("Trainable params:")
model.print_trainable_parameters()

train_result = trainer.train()

print("\n‚úÖ Training completed!")
print(f"Training loss: {train_result.training_loss:.4f}")


üßπ GPU memory cleared
   GPU memory allocated: 1.91 GB
   GPU memory reserved: 2.21 GB

üöÄ Starting training...
Training samples: 1368
Validation samples: 152
Trainable params:
trainable params: 6,782,976 || all params: 254,360,832 || trainable%: 2.6667


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.7067,2.166459,42.2139,16.2148,25.3528,25.3587
2,2.4089,2.07754,45.8705,18.6589,28.0745,28.1164
3,2.318,2.03426,46.7971,19.5058,29.2914,29.3067
4,2.2738,2.008834,48.6033,20.1671,30.2191,30.2016
5,2.2101,1.999738,49.3808,20.8586,30.8084,30.8002



‚úÖ Training completed!
Training loss: 2.3679


## üîü Save Final Model


In [None]:
# Save the final LoRA adapter (ONLY adapter weights, not full model!)
# This saves only ~50-200MB instead of 10GB+!
final_model_path = f"{output_dir}/final"

# Save LoRA adapter (PEFT automatically saves only adapter weights)
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)

print(f"‚úÖ FLAN-T5-Base + LoRA adapter saved to: {final_model_path}")
print(f"üíæ Checkpoint size: ~50-200MB (vs 10GB+ for full model)")
print(f"   Only LoRA adapter weights saved (not full model)")

print(f"\nüì¶ To load the model for inference:")
print(f"  from transformers import T5Tokenizer, T5ForConditionalGeneration")
print(f"  from peft import PeftModel")
print(f"  import torch")
print(f"  ")
print(f"  base_model_name = 'google/flan-t5-base'")
print(f"  adapter_path = '{final_model_path}'")
print(f"  ")
print(f"  # Load base model")
print(f"  tokenizer = T5Tokenizer.from_pretrained(base_model_name)")
print(f"  model = T5ForConditionalGeneration.from_pretrained(base_model_name)")
print(f"  ")
print(f"  # Load LoRA adapter")
print(f"  model = PeftModel.from_pretrained(model, adapter_path)")
print(f"  ")
print(f"  # Move to GPU")
print(f"  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')")
print(f"  model = model.to(device)")
print(f"  model.eval()")


‚úÖ FLAN-T5-Base + LoRA adapter saved to: /content/drive/MyDrive/flan_t5_base_lora_summarization_model/final
üíæ Checkpoint size: ~50-200MB (vs 10GB+ for full model)
   Only LoRA adapter weights saved (not full model)

üì¶ To load the model for inference:
  from transformers import T5Tokenizer, T5ForConditionalGeneration
  from peft import PeftModel
  import torch
  
  base_model_name = 'google/flan-t5-base'
  adapter_path = '/content/drive/MyDrive/flan_t5_base_lora_summarization_model/final'
  
  # Load base model
  tokenizer = T5Tokenizer.from_pretrained(base_model_name)
  model = T5ForConditionalGeneration.from_pretrained(base_model_name)
  
  # Load LoRA adapter
  model = PeftModel.from_pretrained(model, adapter_path)
  
  # Move to GPU
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  model = model.to(device)
  model.eval()


## 1Ô∏è‚É£1Ô∏è‚É£ Evaluate Model (Optional - Quick Test)


In [None]:
# Evaluate on validation set
print("üìä Evaluating on validation set...")
eval_results = trainer.evaluate()

print("\nEvaluation Results:")
for key, value in eval_results.items():
    if 'rouge' in key.lower():
        print(f"  {key}: {value:.4f}")
    elif 'loss' in key.lower():
        print(f"  {key}: {value:.4f}")


üìä Evaluating on validation set...



Evaluation Results:
  eval_loss: 1.9997
  eval_rouge1: 49.3808
  eval_rouge2: 20.8586
  eval_rougeL: 30.8084
  eval_rougeLsum: 30.8002


## 1Ô∏è‚É£2Ô∏è‚É£ Test Inference (Sample Predictions)


In [None]:
# Test the model with a sample
import random

# Sample test text (newspaper example)
test_text = """
Shipwreck confirmed as lost WW1 warship A wreck discovered off the Aberdeenshire coast is a Royal Navy warship sunk by a torpedo during World War One, it has been confirmed. More than 500 of HMS Hawke's crew died when it was attacked by a German U-boat in October 1914. The ship caught fire and, following an explosion, sank in less than eight minutes with just 70 sailors surviving. The wreck was discovered by a team of divers about 70 miles east of Fraserburgh earlier this year in "remarkable" condition. After assessing the evidence, Royal Navy experts have now confirmed it was HMS Hawke. Analysis of footage, photographs and scans was carried out to confirm the ship's identity.
"""

source_type = "newspaper"  # Test with newspaper
prompt = build_prompt(test_text, source_type)

# Tokenize and move to same device as model
device = next(model.parameters()).device  # Get model's device
inputs = tokenizer(prompt, max_length=max_input_length, truncation=True, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}  # Move inputs to GPU (A100)

# Generate
model.eval()
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=max_target_length,
        num_beams=4,
        early_stopping=True,
        no_repeat_ngram_size=2,
        length_penalty=1.0,
    )

generated_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated Summary:")
print(generated_summary)



# Decode
generated_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Input (first 200 chars):")
print(prompt[:200] + "...")
print("\nGenerated Summary:")
print(generated_summary)
print("\n" + "="*50)


Generated Summary:
A Royal Navy warship sunk by a torpedo during World War One was discovered off the Aberdeenshire coast in "remarkable" condition. More than 500 of HMS Hawke's crew died when it was attacked by German U-boat in October 1914. The wreck was found by divers about 70 miles east of Fraserburgh earlier this year.
Input (first 200 chars):
summarize newspaper article in 3-4 factual sentences: Shipwreck confirmed as lost WW1 warship A wreck discovered off the Aberdeenshire coast is a Royal Navy warship sunk by a torpedo during World War ...

Generated Summary:
A Royal Navy warship sunk by a torpedo during World War One was discovered off the Aberdeenshire coast in "remarkable" condition. More than 500 of HMS Hawke's crew died when it was attacked by German U-boat in October 1914. The wreck was found by divers about 70 miles east of Fraserburgh earlier this year.



---

## üìù Summary of Improvements

### Key Changes from Your Original Code:

1. **Best Model Choice**: FLAN-T5-Base + LoRA (250M base, ~1-5% trainable) - RECOMMENDED ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
   - ‚úÖ **50-200√ó SMALLER CHECKPOINTS**: ~50-200MB vs 10GB+ (solves your disk space issue!)
   - ‚úÖ **Faster Training**: Only trains 1-5% of parameters
   - ‚úÖ **Better Accuracy**: Regularization effect from LoRA
   - ‚úÖ 3√ó less VRAM than FLAN-T5-Large
   - ‚úÖ Much more stable with long sequences
   - ‚úÖ Same instruction-tuned behavior
   - ‚úÖ Optimized for A100 GPU
2. **Improved Prompts**: More specific, task-oriented prompts for each source type
3. **Longer Context**: max_input_length increased from 512 to 1024
4. **Better Target Length**: max_target_length set to 512 (can increase to 768 for books)
5. **Evaluation Metrics**: Added ROUGE scores to track progress
6. **Better Training Config (LoRA Optimized)**:
   - **Larger batch sizes** (8 vs 2) - LoRA uses less memory
   - **Higher learning rate** (5e-4 vs 3e-4) - LoRA trains faster
   - **Only 1 checkpoint saved** (best model) - saves disk space
   - **Safetensors format** - smaller, faster loading
   - Gradient accumulation (effective batch size = 16)
   - Warmup steps for gradual learning
   - Epoch-based evaluation
   - bf16 mixed precision for A100 GPU
7. **Proper Data Collator**: Uses DataCollatorForSeq2Seq for better batching
8. **Label Handling**: Properly handles -100 for ignored tokens in loss
9. **A100 GPU Optimization**: Automatic bf16 detection and GPU configuration

### Why FLAN-T5-Base + LoRA (BEST SOLUTION FOR YOUR ISSUE)?
- ‚úÖ **50-200√ó SMALLER CHECKPOINTS**: Solves your 10GB+ disk space problem! (~50-200MB instead)
- ‚úÖ **Faster Training**: Only trains 1-5% of parameters (4-8√ó faster per epoch)
- ‚úÖ **Better Accuracy**: LoRA acts as regularization, often improves accuracy
- ‚úÖ **Lower Memory**: Less VRAM usage during training
- ‚úÖ **Instruction-tuned**: Better at following prompts (same as FLAN-T5-Large)
- ‚úÖ **3√ó Less VRAM**: 250M base params vs 780M in FLAN-T5-Large
- ‚úÖ **More Stable**: Much more stable with long sequences
- ‚úÖ **A100 Optimized**: Perfect balance of quality and efficiency

### Expected Improvements:
- **üíæ CHECKPOINT SIZE**: 10GB+ ‚Üí ~50-200MB (50-200√ó reduction!) - SOLVES YOUR ISSUE!
- **‚ö° Training Speed**: 4-8√ó faster training (only trains 1-5% of parameters)
- **üìà Better Accuracy**: LoRA regularization often improves model quality
- **Better summary quality** (especially for books)
- **Better adherence to source type requirements** (newspaper/magazine/book)
- **Better instruction following** (instruction-tuned advantage)
- **Measurable progress** via ROUGE scores
- **Lower memory usage** (LoRA + 3√ó less VRAM than FLAN-T5-Large)
- **More stable training** with long sequences

### A100 GPU + LoRA Benefits:
- **üíæ Disk Space**: Only ~50-200MB per checkpoint (vs 10GB+ full model)
- **‚ö° Faster Training**: 4-8√ó faster per epoch, larger batch sizes (8 vs 2)
- **bf16 Mixed Precision**: Automatic detection for optimal performance
- **Larger Batch Sizes**: LoRA allows 8 batch size vs 2 (faster training)
- **Faster Inference**: Can merge adapters for even faster inference
- **Better Memory Efficiency**: LoRA uses minimal memory overhead

### üéØ Problem Solved:
- ‚ùå **Before**: Each checkpoint = 10GB+ (too much disk space!)
- ‚úÖ **After**: Each checkpoint = ~50-200MB (50-200√ó smaller!)

### Next Steps:
1. Train the model (faster with LoRA - 4-8√ó speedup)
2. Evaluate ROUGE scores
3. Test on sample inputs
4. Fine-tune hyperparameters if needed (adjust LoRA rank `r` if needed)
5. Deploy for inference (load base model + adapter)

### Tips:
- **If checkpoints still too large**: Reduce LoRA rank `r` from 16 to 8 (in Cell 12)
- **If want better accuracy**: Increase LoRA rank `r` from 16 to 32 (in Cell 12)
- **Checkpoint location**: `{output_dir}/final/` contains only adapter weights (~50-200MB)

---


In [None]:
# ===============================
# 1Ô∏è‚É£ Import libraries
# ===============================
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from peft import PeftModel

# ===============================
# 2Ô∏è‚É£ Set model paths
# ===============================
base_model_name = "google/flan-t5-base"   # Base FLAN-T5 model
adapter_path = "/content/drive/MyDrive/flan_t5_base_lora_summarization_model/final"  # LoRA adapter path

# ===============================
# 3Ô∏è‚É£ Load tokenizer and model
# ===============================
tokenizer = T5Tokenizer.from_pretrained(base_model_name)

# Load base model
model = T5ForConditionalGeneration.from_pretrained(base_model_name)

# Attach LoRA adapter
model = PeftModel.from_pretrained(model, adapter_path)

# Move to GPU (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
print(f"Model loaded on {device}")

# ===============================
# 4Ô∏è‚É£ Prompt builder
# ===============================
def build_prompt(text, source_type):
    text = text.strip()
    if "purchase a subscription" in text.lower() or len(text) < 50:
        return "summarize: The article content is unavailable. Provide a 2-sentence generic summary."
    if source_type == "newspaper":
        return f"summarize newspaper article in 3-4 factual sentences: {text}"
    elif source_type == "magazine":
        return f"summarize magazine article in about half the original length with key details: {text}"
    elif source_type == "book":
        return f"summarize book excerpt in detail preserving key ideas and context: {text}"
    else:
        return f"summarize: {text}"

# ===============================
# 5Ô∏è‚É£ Example book input
# ===============================
test_text = """
Shipwreck confirmed as lost WW1 warship A wreck discovered off the Aberdeenshire coast is a Royal Navy warship sunk by a torpedo during World War One, it has been confirmed. More than 500 of HMS Hawke's crew died when it was attacked by a German U-boat in October 1914. The ship caught fire and, following an explosion, sank in less than eight minutes with just 70 sailors surviving. The wreck was discovered by a team of divers about 70 miles east of Fraserburgh earlier this year in "remarkable" condition. After assessing the evidence, Royal Navy experts have now confirmed it was HMS Hawke. Analysis of footage, photographs and scans was carried out to confirm the ship's identity.
"""
source_type = "book"
prompt = build_prompt(test_text, source_type)

# ===============================
# 6Ô∏è‚É£ Tokenize input
# ===============================
max_input_length = 1024
max_target_length = 768  # Increased for detailed book summaries

inputs = tokenizer(prompt, max_length=max_input_length, truncation=True, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

# ===============================
# 7Ô∏è‚É£ Generate summary
# ===============================
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=max_target_length,
        num_beams=6,  # More beams for better quality
        early_stopping=True,
        no_repeat_ngram_size=3,  # Avoid repetition in long summaries
        length_penalty=1.0,
        do_sample=False  # Deterministic output
    )

# ===============================
# 8Ô∏è‚É£ Decode summary
# ===============================
generated_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n=== INPUT (first 200 chars) ===")
print(prompt[:200] + "...")
print("\n=== GENERATED BOOK SUMMARY ===")
print(generated_summary)
print("\n" + "="*50)


Model loaded on cuda

=== INPUT (first 200 chars) ===
summarize book excerpt in detail preserving key ideas and context: Shipwreck confirmed as lost WW1 warship A wreck discovered off the Aberdeenshire coast is a Royal Navy warship sunk by a torpedo duri...

=== GENERATED BOOK SUMMARY ===
A shipwreck discovered off the Aberdeenshire coast is a Royal Navy warship sunk by a torpedo during World War One. More than 500 of HMS Hawke's crew died when it was attacked by German U-boat in October 1914. The ship caught fire and, following an explosion, sank in less than eight minutes, with just 70 sailors surviving. The wreck was discovered by divers about 70 miles east of Fraserburgh earlier this year in "remarkable" condition. Royal Navy experts have now confirmed it as the ship's identity.

