## SWE 422 – Introduction to Large Language Models
## Homework, Spring 2025

### Student : Şeyma Altıparmak | B201202051

## Installing Essential Libraries
Before diving into the model and data processing steps, we first install some key Python libraries from Hugging Face, which include `transformers`, `datasets`, and `evaluate`. These libraries are indispensable for handling pre-trained transformer models, loading benchmark datasets, and assessing model performance. By using the `-q` flag in the installation command, we suppress the verbose output for a cleaner notebook experience.

In [None]:
!pip install transformers datasets evaluate rouge-score

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0

In [3]:
from datasets import load_dataset
from transformers import pipeline, BartForConditionalGeneration, BartTokenizer
import evaluate
import pandas as pd

## Loading the Dataset

In this step, we fetch a subset of the well-known CNN / DailyMail dataset using the Hugging Face Datasets library. Specifically, we load 1% of the training set to ensure quick experimentation without compromising the richness of the dataset. Each record in this dataset contains a news article under the key `article` and its human-written summary under the key `highlights`. This structure makes it ideal for developing and evaluating text summarization models.

In [None]:
dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [5]:
df = pd.DataFrame(dataset[:5])
df[["article", "highlights"]]

Unnamed: 0,article,highlights
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa..."
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non..."
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical..."


## Loading the Pretrained BART Summarization Model

Here we load the `facebook/bart-large-cnn` model, a transformer-based sequence-to-sequence model that has been fine-tuned specifically for summarization tasks on datasets like CNN / DailyMail. We also load the corresponding tokenizer, which is responsible for converting raw text into input tokens that the model can process. BART (Bidirectional and Auto-Regressive Transformers) is particularly effective at generating coherent and fluent summaries thanks to its encoder-decoder architecture.

## Summarizing News with BART – Greedy, Beam Search, and Sampling Methods
In this step, we use the pretrained BART model to generate summaries for a single news article using three different decoding strategies:
- **Greedy Decoding:** This method selects the most probable next token at each step. It is fast but can result in less diverse or repetitive outputs.
- **Beam Search:** A more sophisticated method that considers multiple candidate sequences (beams) and keeps the most likely ones. In our case, we use 4 beams to improve coherence and quality.
- **Sampling (Stochastic Decoding):** This method introduces randomness into the generation process using parameters like temperature, top-k, and top-p (nucleus sampling). It often leads to more creative and varied outputs.

In [6]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [7]:
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)


Device set to use cuda:0


In [None]:
for i in range(3):
      print(f"\n ORIJINAL: {dataset[i]['article'][:500]}...\n")
      summary = summarizer(dataset[i]['article'], max_length=130, min_length=30, do_sample=False)
      print(f"BART SUMMARY: {summary[0]['summary_text']}\n")


[ORİJİNAL]: LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s...

[BART ÖZET]: Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund.


[ORİJİNAL]: Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An i

BART

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

In [11]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

In [12]:
text = dataset[0]["article"]
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)

In [13]:
greedy_ids = model.generate(inputs["input_ids"], max_length=130)
greedy_summary = tokenizer.decode(greedy_ids[0], skip_special_tokens=True)
print("Greedy:\n", greedy_summary, "\n")

Greedy:
 Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund. Details of how he'll mark his landmark birthday are under wraps. 



In [14]:
beam_ids = model.generate(inputs["input_ids"], max_length=130, num_beams=4, early_stopping=True)
beam_summary = tokenizer.decode(beam_ids[0], skip_special_tokens=True)
print("Beam Search:\n", beam_summary, "\n")

Beam Search:
 Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund. Details of how he'll mark his landmark birthday are under wraps. 



In [15]:
sample_ids = model.generate(inputs["input_ids"], max_length=130, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
sample_summary = tokenizer.decode(sample_ids[0], skip_special_tokens=True)
print("Sampling (temp=0.7):\n", sample_summary, "\n")

Sampling (temp=0.7):
 Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund. Details of how he'll mark his landmark birthday are under wraps. 



PEGASUS

## Generating Summaries Using the Pegasus Model

In this section, we introduce the `google/pegasus-cnn_dailymail` model—an advanced language model specifically trained on the CNN / DailyMail dataset for the task of abstractive summarization. Pegasus (Pre-training with Extracted Gap-sentences for Abstractive Summarization) is known for its ability to generate high-quality, human-like summaries by leveraging a unique pretraining approach where entire sentences are masked and then predicted. 

We will use the **greedy decoding** strategy to produce summaries with this model, which involves choosing the most probable token at each generation step. This method is computationally efficient and provides a good baseline for evaluating summarization performance.


In [None]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [17]:
pegasus_model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-cnn_dailymail")
pegasus_tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-cnn_dailymail")

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [None]:
pegasus_inputs = pegasus_tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)

In [19]:
pegasus_ids = pegasus_model.generate(pegasus_inputs["input_ids"], max_length=130)
pegasus_summary = pegasus_tokenizer.decode(pegasus_ids[0], skip_special_tokens=True)
print("Pegasus (Greedy):\n", pegasus_summary)

Pegasus (Greedy):
 Harry Potter star Daniel Radcliffe gains access to a reported £20 million fortune .<n>Young actor says he has no plans to fritter his cash away .<n>Radcliffe's earnings from the first five Potter films have been held in a trust fund .


ROUGE

## Evaluating Summaries Using ROUGE Metrics

To quantitatively assess the quality of the generated summaries, we employ **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation), a set of metrics widely used in the field of natural language processing for evaluating automatic summarization systems. ROUGE compares the generated summaries against reference (human-written) summaries by calculating overlapping units such as n-grams, word sequences, and word pairs.

In this experiment, we compare the summaries generated using different methods and models:

- BART with Greedy Decoding
- BART with Beam Search
- BART with Sampling
- Pegasus with Greedy Decoding

We specifically compute the following metrics:
- **ROUGE-1:** Overlap of unigrams (single words)
- **ROUGE-2:** Overlap of bigrams (two-word sequences)
- **ROUGE-L:** Longest Common Subsequence (LCS), capturing fluency and sentence-level similarity

These metrics help determine how closely the generated summaries align with the ground-truth highlights in terms of content retention and linguistic structure.


In [20]:
!pip install evaluate -q

In [None]:
import evaluate

In [None]:
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
reference = dataset[0]["highlights"]
print("Orijinal:\n", reference)

Orijinal Özet:
 Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .


In [24]:
results = rouge.compute(predictions=[greedy_summary, beam_summary, sample_summary, pegasus_summary],
references=[reference]*4)

In [None]:
print("\n ROUGE Scores:")
for key, value in results.items():
    print(f"{key}: {value:.4f}")


 ROUGE Skorları:
rouge1: 0.6553
rouge2: 0.4416
rougeL: 0.6042
rougeLsum: 0.6213


## Results and Observations on Summarization Models

Among the evaluated summarization models and decoding techniques, Pegasus combined with beam search consistently yielded the best ROUGE scores, affirming its effectiveness in generating precise and well-structured summaries. Compared to BART, Pegasus regularly produced more fluent and concise results, likely due to its pretraining focused specifically on summarization tasks.

Beam search generally outperformed greedy decoding, which often resulted in brief and overly simple summaries. While sampling techniques like top-k and nucleus sampling demonstrated creativity, their outputs lacked consistency—sometimes offering innovative phrasing but occasionally straying off-topic.

Regarding human-like quality, Pegasus again held an advantage. Its outputs felt more natural and smoother, whereas BART sometimes exhibited repetitive phrasing or awkward transitions.

In conclusion, Pegasus using beam search emerged as the most dependable and natural-sounding model, while sampling methods, despite their inventiveness, did not provide the reliability required for structured summarization applications.