✅ Workflow Overview:
Install dependencies

Load model and tokenizer

Load and preprocess Samsum dataset

Fine-tune Pegasus

Evaluate with ROUGE

Save and load model

Use model for prediction







In [1]:
%pip install transformers datasets evaluate rouge_score accelerate





In [3]:
%pip install transformers[SentencePiece]

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Define the model name
model_name = "google/pegasus-xsum"

# Load the tokenizer and model using the existing variables
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

# Think of this as:

# tokenizer = your translator that speaks human and robot language.

# model = your AI brain that reads in robot language (tokens) and writes summaries.

# model_name = you’re telling it which pre-trained brain to use (in this case, trained on news articles).



Note: you may need to restart the kernel to use updated packages.


tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [4]:
%pip install py7zr

from datasets import load_dataset

dataset = load_dataset("samsum",trust_remote_code=True)

# Rename columns to standard format
dataset = dataset.map(lambda x: {"input_text": x["dialogue"], "target_text": x["summary"]})

# Tokenize
max_input_len = 512
max_target_len = 128

def preprocess(examples):
    inputs = tokenizer(
        examples["input_text"], truncation=True, padding="max_length", max_length=max_input_len
    )
    targets = tokenizer(
        examples["target_text"], truncation=True, padding="max_length", max_length=max_target_len
    )
    inputs["labels"] = targets["input_ids"]
    return inputs

tokenized_dataset = dataset.map(preprocess, batched=True, remove_columns=dataset["train"].column_names)

#Convert the raw text to token IDs the model can understand.



Collecting py7zrNote: you may need to restart the kernel to use updated packages.

  Using cached py7zr-0.22.0-py3-none-any.whl.metadata (16 kB)
Collecting texttable (from py7zr)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pycryptodomex>=3.16.0 (from py7zr)
  Using cached pycryptodomex-3.22.0-cp37-abi3-win_amd64.whl.metadata (3.4 kB)
Collecting pyzstd>=0.15.9 (from py7zr)
  Downloading pyzstd-0.16.2-cp312-cp312-win_amd64.whl.metadata (2.5 kB)
Collecting pyppmd<1.2.0,>=1.1.0 (from py7zr)
  Downloading pyppmd-1.1.1-cp312-cp312-win_amd64.whl.metadata (5.6 kB)
Collecting pybcj<1.1.0,>=1.0.0 (from py7zr)
  Downloading pybcj-1.0.3-cp312-cp312-win_amd64.whl.metadata (4.0 kB)
Collecting multivolumefile>=0.2.3 (from py7zr)
  Downloading multivolumefile-0.2.3-py3-none-any.whl.metadata (6.3 kB)
Collecting inflate64<1.1.0,>=1.0.0 (from py7zr)
  Downloading inflate64-1.0.1-cp312-cp312-win_amd64.whl.metadata (3.9 kB)
Collecting brotli>=1.1.0 (from py7zr)
  Downloa

README.md:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

samsum.py:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

corpus.7z:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=trainer_args,
    train_dataset=tokenized_dataset["test"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

# We'll use Hugging Face's Trainer API, which handles training, evaluation, and saving.


Explanation + Pros & Cons of Each Argument
Setting	Explanation	✅ Pros	❌ Cons
num_train_epochs=1	Train for only 1 full pass through the data	Fast, good for testing	May underfit — model might not learn enough
warmup_steps=500	Slowly ramp up learning rate for first 500 steps	Can stabilize training	Only useful if training >500 steps
per_device_train_batch_size=1	Process 1 example per GPU per step	Allows training on low-memory systems	Training will be slower
gradient_accumulation_steps=16	Accumulate gradients over 16 steps to simulate batch size of 16	Lets you train “as if” batch size = 16	Makes training steps slower; can delay updates
evaluation_strategy='steps'	Evaluate every eval_steps steps	Early feedback during training	Less consistent than evaluating at end of epoch
eval_steps=500	Run evaluation every 500 steps	Useful for watching progress	Might miss trends if eval is too sparse
save_steps=1e6	Save model every 1 million steps (effectively never during training)	Saves disk space during dev	Risky — no saved checkpoints if training crashes
logging_steps=10	Print logs every 10 steps	Helps track training live	Can flood logs if too frequent
✅ When to Use This Version
Use this setup if:

You're running on limited GPU/CPU memory

You want faster prototyping

You’re doing debugging or early experiments

You can't fit large batches into memory

❌ When to Avoid or Modify It
If you're doing serious training for production, 1 epoch and small batch sizes might underfit the model.

save_steps=1e6 means no model is saved mid-training unless you stop it manually or use trainer.save_model() — this can be dangerous if training is interrupted.

Low per_device_train_batch_size is OK with gradient accumulation, but training will be slower per epoch.



In [None]:
%pip install evaluate
%pip install rouge_score

from evaluate import load

rouge = load("rouge")

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    return result

# Update trainer with compute_metrics
trainer.compute_metrics = compute_metrics

# Evaluate
trainer.evaluate()


In [None]:
#Save fine tuned model

trainer.save_model("./pegasus_samsum_final")
tokenizer.save_pretrained("./pegasus_samsum_final")


**Load the saved model and predict summaries from new dialogue.**




In [None]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load the saved model
model = PegasusForConditionalGeneration.from_pretrained("./pegasus_samsum_final")
tokenizer = PegasusTokenizer.from_pretrained("./pegasus_samsum_final")

# Your input chat/dialogue
chat = "John: Hey, are we meeting later?\nAlice: Yes, at 6 PM.\nJohn: Perfect, see you!"

# Tokenize and generate summary
inputs = tokenizer(chat, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
summary_ids = model.generate(**inputs, max_length=60, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Summary:", summary)
