# Assignment 3: Summarization with LLMs

**Description:** This assignment covers the task of summarization which is the process of generating an abridged version of the input. With the ascendance of LLMs, we have a new way of generating summaries. Now, rather than fine-tuning. moel to generate summaries, we can simply provide explicit instructios for the summary we want the model to generate.  By finishing this assignment you should also be able to develop an intuition for:


* How well summarization systems work
* The effects of hyperparameters on outcomes
* The effects of prompts on the output of an LLM
* Evaluation of output using ROUGE



This notebook must be run on a Google Colab but it does not require a GPU. By default, when you open the notebook in Colab it will NOT configure a GPU.  Summarization commands can take up to five minutes to run depending on the hyperparameters you use. This notebook will NOT run on your GCP instance as the summary models are larger than the avaialble memory.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-summer-main/blob/master/assignment/a3/SummarizationLLM_test.ipynb)

The overall assignment structure is as follows:

 Setup

1. Gemma 2 for abstractive summarization




**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* In order to complete the assignment with the Gemma model you will need to get an account on [Hugging Face](https://huggingface.co).  It is free.  Once you have the account on Hugging Face you will need to create an Access Token.  Go
to Access Token under your profile and generate a token with write permissions for colab.  You will need to copy that token and add it to the secrets in your Colab account with the name `HF_TOKEN` and the value of the string of your access token.

* In addition, you will need to visit the [Model Card for the Gemma 2 model](https://huggingface.co/google/gemma-2-9b-it).  At the top of the page you will see a notice saying you need to request perrmission to use the model.  While logged in to your Hugging Face account, click the button to request permission.  It can sometimes take up to 10 or 15 minutes to get approved.  Once you are approved the message on the Model Card will change to indicate you have been granted access to the model.


## Setup

In [1]:
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
!pip install -q -U flash_attn
!pip install -q -U datasets

In [2]:
#help track which versions of libraries we're using
!pip list | grep transformers
!pip list | grep accelerate
!pip list | grep bitsandbytes
!pip list | grep datasets

sentence-transformers                 4.1.0
transformers                          4.53.0
accelerate                            1.8.1
bitsandbytes                          0.46.0
datasets                              3.6.0
tensorflow-datasets                   4.9.9
vega-datasets                         0.9.0


In [5]:
!pip uninstall -y flash-attn

Found existing installation: flash_attn 2.8.0.post2
Uninstalling flash_attn-2.8.0.post2:
  Successfully uninstalled flash_attn-2.8.0.post2


In [6]:
import datasets
from transformers import pipeline, BitsAndBytesConfig
import bitsandbytes as bnb
import torch
import random
import pandas as pd
from tqdm import tqdm


In [7]:
!pip install -q evaluate
import evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [8]:
!pip install -q rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [9]:
#let's make longer output readable without horizontal scrolling
from pprint import pprint

Now let's get the data we're going to use.

In [10]:
import datasets
import random
import torch

def load_dataset(num_samples=11):
    """
    Load and sample records from the X-Sum dataset
    """
    dataset = datasets.load_dataset("xsum", split="train", cache_dir=None, trust_remote_code=True)
    selected_indices = random.sample(range(len(dataset)), num_samples)
    selected_samples = dataset.select(selected_indices)
    return selected_samples

In [11]:
# Set random seed for reproducibility
random.seed(42)
torch.manual_seed(42)

# Load dataset
print("Loading dataset...")
dataset = load_dataset()

Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

xsum.py: 0.00B [00:00, ?B/s]

(…)SUM-EMNLP18-Summary-Data-Original.tar.gz:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [12]:
dataset

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 11
})

What do our input documents lok like?  Let's see the first of them.

In [13]:
dataset[0]['document']

'Private Harry Vasey, who was part of the 1st Airborne Battalion, The Border Regiment, was killed during Operation Market Garden in Oosterbeek in 1944.\nNow his identity has been confirmed, the Ministry of Defence (MoD) want to trace his family so his grave can be rededicated in the Netherlands.\nThe MoD said plans were also in place to change his headstone.\nBorn in Durham in May 1916 to Harry Vasey and Annie Young, he enlisted in April 1940 when he lived in Bowburn, County Durham.\nAn MoD spokesman said: "Unfortunately that is about all we know about Private Vasey and his family and that\'s where the trail goes cold.\n"We are hoping that there are some of his family still living in that area."\nSince WW2, a section of the Royal Netherlands Army has been working to identify the graves of unknown soldiers killed in battle.\nThe exhumation reports were scrutinised for clues to the identities of these men and the research was presented to the MoD.\nMr Vasey is one of six Border Regiment 

And what does the corresponding summmary look like?  This is our target.

In [14]:
dataset[0]['summary']

'The family of a soldier killed during World War Two is being sought after his final resting place was confirmed.'

We'll also take advantage of a Hugging Face abstraction called a pipeline.  It is an easy way of experimenting with a model in inference mode.  We'll use that here to experiment with prompts (and possibly some hyperparameters) to imporve the quality of our results.

It takes a while to load this model -- on the order of ten minutes -- but once it is loaded you can keep reusing the loaded model and improve your prompt.



In [32]:
"""
Initialize the pipeline with bitsandbytes quantization
"""

# Configure bitsandbytes for 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Initialize pipeline
model_id = "google/gemma-2-9b-it"

summarizer = pipeline(
   "text-generation",
   model=model_id,
   model_kwargs={"torch_dtype": torch.bfloat16, "quantization_config": quantization_config},
   device_map="auto",
   trust_remote_code=True,
)

config.json:   0%|          | 0.00/857 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Device set to use cuda:0


As a reminder, here's the record we're dealing with.

In [33]:
dataset[0]

{'document': 'Private Harry Vasey, who was part of the 1st Airborne Battalion, The Border Regiment, was killed during Operation Market Garden in Oosterbeek in 1944.\nNow his identity has been confirmed, the Ministry of Defence (MoD) want to trace his family so his grave can be rededicated in the Netherlands.\nThe MoD said plans were also in place to change his headstone.\nBorn in Durham in May 1916 to Harry Vasey and Annie Young, he enlisted in April 1940 when he lived in Bowburn, County Durham.\nAn MoD spokesman said: "Unfortunately that is about all we know about Private Vasey and his family and that\'s where the trail goes cold.\n"We are hoping that there are some of his family still living in that area."\nSince WW2, a section of the Royal Netherlands Army has been working to identify the graves of unknown soldiers killed in battle.\nThe exhumation reports were scrutinised for clues to the identities of these men and the research was presented to the MoD.\nMr Vasey is one of six Bor

Let's just generate one summary so we can see what it looks like

In [34]:
prompt = [
            {"role": "user", "content": "Generate a summary of this text: " + dataset[0]['document']}
        ]



outputs = summarizer(
  prompt,
  max_new_tokens=256,
  do_sample = True,
  temperature = 0.3,
  top_p = 0.95
)

summary = outputs[0]["generated_text"][-1]

Let's see what the generated summary looks like.

In [35]:
summary

{'role': 'assistant',
 'content': "Private Harry Vasey, a soldier from the 1st Airborne Battalion, The Border Regiment, was killed during Operation Market Garden in 1944. His identity has recently been confirmed by the Ministry of Defence (MoD) through the work of the Royal Netherlands Army in identifying unknown soldiers' graves.  \n\nThe MoD is now seeking Vasey's family to rededicate his grave in Oosterbeek, Netherlands, and change his headstone.  Little is known about Vasey's family, who may still live in the Bowburn, County Durham area.  \n\nOperation Market Garden, aimed at capturing strategic bridges near Arnhem, was a failed Allied operation resulting in over 1,400 Allied deaths and 6,000 captured.  The MoD hopes Vasey's family can attend a service in his honour at Oosterbeek Cemetery on September 14th. \n\n\n"}

How does it compare with the reference? Let's compare your candidate and the reference using the ROUGE metric.

In [36]:
rouge = evaluate.load('rouge')


# Process each sample
print("Generating summaries and calculating ROUGE scores...")



# Calculate ROUGE scores
predictions = [summary['content']]
references = [[dataset[0]['summary']]]
rouge_scores = rouge.compute(predictions=predictions, references=references)
rouge_scores

Downloading builder script: 0.00B [00:00, ?B/s]

Generating summaries and calculating ROUGE scores...


{'rouge1': np.float64(0.14965986394557823),
 'rouge2': np.float64(0.027586206896551724),
 'rougeL': np.float64(0.09523809523809523),
 'rougeLsum': np.float64(0.12244897959183673)}

Now, it's your turn.  Please improve the prompt below so that you get output that, when scored using ROUGE, the average scores for the entire data sample of 11 records exceeds these thresholds:
* Rouge-1 > 0.2
* Rouge-2 > 0.03
* Rouge-L > 0.15

You may use sampling with Top K or Top P and termperatire if you like but the prompt is what will have the greatest effect on your output.  Your prompt should give as specific instructions as possible.  These LLMs are trained to follow instructions so be very specific in your request.  Individual words can make a large difference so take a little time to experiment with synonyms and alternate ways of phrasing things.

In [37]:
# Store results for aggregate scoring
results = []

Enter your prompt in the space below and then run the code.  

In [38]:
dataset[6]

{'document': 'Each year this part of south-west London becomes a very neat campsite as lines of tents emerge in carefully arranged sections, put up by people happy to sit and wait in the hope of catching a glimpse of their favourite tennis stars.\nYet the Wimbledon queue is a very multicultural place, with thousands of people travelling from across the world and waiting together to enter the championship grounds.\nSo in this melting point of international opinion, what are the thoughts of the people in the queue about Britain\'s decision to leave the EU?\n\'Deeply depressed\'\nWearing a hat lined with the EU\'s gold stars, it\'s no surprise what Dave Treanor thinks about Brexit.\n"I think it\'s an utter disaster. I\'ve been deeply depressed ever since", the south-west Londoner said.\nHe believes the Leave campaign\'s arguments were "a con" but accuses the Remain campaign of being "very poorly managed".\n"All they were saying is we could control immigration. They\'re not going to contro

In [39]:
for idx, sample in enumerate(tqdm(dataset)):
    try:
      prompt = [
      {"role": "user", "content": (
          "You are a helpful assistant. Please summarize the following news article "
          "in one concise sentence, highlighting only the main event or outcome. "
          "Avoid unnecessary details, quotes, or background unless essential. "
          "Here is the article:\n\n" + sample['document']
      )}
              ]


      # Generate summary via the pipeline
      outputs = summarizer(
                          prompt,
                          max_new_tokens=512,
      )

      summary = outputs[0]["generated_text"][-1]

      # Calculate ROUGE scores
      predictions = [summary['content']]
      references = [[sample['summary']]]
      rouge_scores = rouge.compute(predictions=predictions, references=references)


      # Store results
      results.append({
          'id': idx,
          'original_text': sample['document'][:500],  # Store truncated text for readability
          'reference_summary': sample['summary'],
          'generated_summary': summary,
           **rouge_scores
      })

      # Print progress update every 10 samples
      if (idx + 1) % 10 == 0:
          print(f"\nProcessed {idx + 1} samples")
          print(f"Latest ROUGE-1: {rouge_scores['rouge1']:.4f}")

    except Exception as e:
      print(f"Error processing sample {idx}: {str(e)}")
      continue

 82%|████████▏ | 9/11 [01:21<00:21, 10.50s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
 91%|█████████ | 10/11 [01:27<00:09,  9.03s/it]


Processed 10 samples
Latest ROUGE-1: 0.3077


100%|██████████| 11/11 [01:33<00:00,  8.51s/it]


# Calculate and print the average scores.

In [40]:
# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Calculate and print average ROUGE scores
avg_scores = results_df[['rouge1', 'rouge2', 'rougeL']].mean()
print("\nAverage ROUGE Scores:")
for metric, score in avg_scores.items():
   print(f"{metric}: {score:.4f}")

# Print some example summaries
print("\nExample Summaries:")
for i in range(min(5, len(results_df))):
   print(f"\nExample {i+1}:")
   print(f"Reference: {results_df.iloc[i]['reference_summary']}")
   print(f"Generated: {results_df.iloc[i]['generated_summary']}")


Average ROUGE Scores:
rouge1: 0.2399
rouge2: 0.0831
rougeL: 0.1991

Example Summaries:

Example 1:
Reference: The family of a soldier killed during World War Two is being sought after his final resting place was confirmed.
Generated: {'role': 'assistant', 'content': 'The identity of Private Harry Vasey, a soldier killed in Operation Market Garden, has been confirmed, and the Ministry of Defence is seeking his family to attend a rededication ceremony at his grave in the Netherlands.  \n'}

Example 2:
Reference: Gloucester lock Jeremy Thrush will make his first appearance of the season against Stade Rochelais.
Generated: {'role': 'assistant', 'content': 'David Humphreys made 10 changes to the Gloucester team that defeated Stade Rochelais last month.  \n'}

Example 3:
Reference: Staff at the University of Aberdeen have backed plans for industrial action in a dispute over planned job losses.
Generated: {'role': 'assistant', 'content': 'Staff at the University of Aberdeen voted to strike o

**QUESTION:**

1.1 What is the number of words in your prompt once you've met the scoring criteria?

1.2 What is the avg ROUGE-1 score you get once you've met the scoring criteria?

1.3 What is the avg ROUGE-2 score you get once you've met the scoring criteria?

1.4 What is the avg ROUGE-L score you get once you've met the scoring criteria?

1.5 How helpful do you find ROUGE to be in creating better summaries?  How do you think it could be improved? Please write a five sentence response in the text cell below.

*** YOUR ANSWER TO QUESTION 1.5 HERE ***

*** END YOUR ANSWER ***

ROUGE is a useful starting point for evaluating summarization models because it provides quantitative feedback based on word overlap with the reference summary. It encourages succinctness and can help detect when a summary drifts too far from the original content. However, ROUGE doesn’t account for semantic similarity or paraphrasing, which can penalize valid but reworded outputs. It also struggles with very short or high-abstraction summaries common in datasets like XSum. An improvement could involve combining ROUGE with embedding-based metrics like BERTScore or using human evaluation for factual correctness and readability.