# Unit 2

## Scoring and Comparing Models with ROUGE

# Welcome to the World of ROUGE: A Guide to Evaluating Summaries

Welcome back\! In the previous lesson, you learned how to use Large Language Models (LLMs) for text summarization by crafting effective prompts. Now, we'll take a step further and focus on evaluating the quality of these generated summaries. Evaluating text generation models is crucial for understanding their performance and improving their outputs. One of the most popular metrics for this purpose is **ROUGE**, which stands for **Recall-Oriented Understudy for Gisting Evaluation**.

ROUGE is widely used to assess the quality of text summaries by comparing them to reference summaries. It measures the overlap of n-grams, word sequences, and word pairs between the generated and reference summaries. In this lesson, you will learn how to use ROUGE to score and compare different models, specifically **GPT-3.5** and **GPT-4**, on their summarization capabilities.

-----

### What is ROUGE and Why Does It Matter?

ROUGE is a set of metrics that compares a machine-generated summary to one or more human-written reference summaries. It measures the overlap between the model’s summary and a reference summary—the more overlap, the better the summary is generally considered. Higher ROUGE scores indicate a stronger similarity between the generated summary and the reference, meaning the model captured more key information.

ROUGE is widely used in natural language processing for tasks like text summarization and even machine translation. It became the go-to metric for summarization evaluation because it correlates reasonably well with human judgments of summary quality, while being automatic and fast.

-----

### Unigrams, Bigrams, and the ROUGE Variants (ROUGE-1, ROUGE-2, ROUGE-L)

ROUGE comes in several flavors, each measuring overlap in a slightly different way. The most common variants are **ROUGE-N** (for different values of N) and **ROUGE-L**.

  * **ROUGE-1** counts overlapping **unigrams**—that is, individual words. It checks how many words in the model’s summary appear in the reference summary (and vice versa). A high ROUGE-1 means the summary has a lot of the same words as the reference.
  * **ROUGE-2** counts overlapping **bigrams**, which are pairs of consecutive words. This is a stricter measure: two words in a row in the model’s summary have to match two words in a row in the reference. ROUGE-2 gives a sense of whether the model is not just capturing individual words, but also some short phrases or word combinations from the reference.
  * **ROUGE-L** stands for **Longest Common Subsequence**. A subsequence in this context is a sequence of words that appear in both summaries in the same order (but not necessarily contiguously). ROUGE-L finds the longest sequence of words that the two summaries share in order, and uses the length of this sequence to evaluate the summary. This metric is very useful for summarization because it rewards the model for capturing longer chunks of the reference text, even if there are extra words in between.

-----

### Precision, Recall, and F1 in the Context of ROUGE

ROUGE is usually reported in terms of **recall**, **precision**, and **F1** (also called F-measure). These are standard evaluation metrics in information retrieval and summarization, adapted to count overlaps:

  * **Recall** measures how much of the reference summary’s content the model’s summary covered. A high recall means the model didn’t miss much from the reference.
  * **Precision** measures how much of the model’s summary was relevant to the reference. This tells us if the model’s summary added a lot of extra information or wording that wasn’t in the reference.
  * **F1 Score (F-Measure)** is the harmonic mean of precision and recall. The F1 gives a single combined score that balances recall and precision. In summarization, a balance is often desired: we want the summary to get most of the important stuff (high recall) and not stray too far or add too much extra (decent precision).

-----

### Using ROUGE for Summarization Evaluation in Python

Let’s see how we can compute ROUGE scores in practice, and then interpret what those scores mean. For this, we can use the `rouge_score` library in Python. Here’s a simple example with a reference summary and a candidate (model-generated) summary:

```python
from rouge_score import rouge_scorer

# Initialize a ROUGE scorer for ROUGE-1, ROUGE-2, and ROUGE-L.
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
reference = "the cat was under the bed"
candidate = "the cat was found under the bed"

scores = scorer.score(reference, candidate)
for metric, score in scores.items():
    print(f"{metric}: precision={score.precision:.3f}, recall={score.recall:.3f}, F1={score.fmeasure:.3f}")
```

This code evaluates each model using ROUGE-1, ROUGE-2, and ROUGE-L metrics and prints the precision, recall, and F1 scores for each, providing a clear comparison of their summarization capabilities.

-----

### Interpreting ROUGE Scores

Interpreting ROUGE scores is essential for understanding model performance. The **ROUGE-L F1 score** reflects the balance between precision and recall, indicating how well the generated summary matches the reference summary. A higher score suggests a better match, meaning the model has captured more of the essential information. Conversely, a lower score may indicate that the model's summary is missing key details or includes irrelevant information. By comparing the scores of different models, you can determine which one performs better in generating accurate and concise summaries.

-----

### Summary and Next Steps

In this lesson, you learned how to evaluate text generation models using the ROUGE metric. We covered the setup of the environment, the structure of the evaluation code, and how to interpret ROUGE scores. This knowledge will be invaluable as you move on to the practice exercises, where you'll apply these concepts to score and compare models on your own. Remember, evaluating models is a critical step in improving their performance and ensuring they meet your summarization needs. Good luck with the exercises, and continue to explore the fascinating world of text generation\!

## Loading Data for ROUGE Evaluation

Now that you understand how ROUGE works for evaluating text summaries, let's start with the first practical step: loading our dataset. In this exercise, you'll work with a CSV file that contains news articles, their human-written summaries, and summaries generated by GPT-3.5. The columns in the CSV file are named article, summary, and model_response_gpt3_5.

Your task is to:

Open the provided CSV file "cnn_dailymail_with_gpt3_5.csv"
Read and convert the data into a list of dictionaries
Print a sample of the first three rows to verify successful loading
Return the complete dataset for later use
This data loading step is crucial because, before we can evaluate model performance with ROUGE, we need clean, properly formatted data to work with. Mastering this foundation will prepare you for future tasks involving the ROUGE metrics. For now, focus on getting familiar with the dataset structure and content.

```python
import csv

file_path = "cnn_dailymail_with_gpt3_5.csv"
data = []

# TODO: Open the CSV file using a context manager (with statement)

# TODO: Read the CSV file and convert it to a list of dictionaries

# TODO: Print the first 3 rows to verify data loading
# Format should show row number and the first 100 characters of:
# - article
# - summary (human summary)
# - model_response_gpt3_5 (GPT-3.5 generated summary)

print(f"\nTotal rows loaded: {len(data)}")
```

```python
import csv

file_path = "cnn_dailymail_with_gpt3_5.csv"
data = []

# Open the CSV file using a context manager (with statement)
with open(file_path, 'r', encoding='utf-8') as csvfile:
    # Read the CSV file and convert it to a list of dictionaries
    csv_reader = csv.DictReader(csvfile)
    for row in csv_reader:
        data.append(row)

# Print the first 3 rows to verify data loading
print("Sample of first 3 rows:")
for i, row in enumerate(data[:3]):
    print(f"\n--- Row {i+1} ---")
    print(f"Article: {row['article'][:100]}...")
    print(f"Human Summary: {row['summary'][:100]}...")
    print(f"GPT-3.5 Summary: {row['model_response_gpt3_5'][:100]}...")

print(f"\nTotal rows loaded: {len(data)}")
```

## Setting Up ROUGE for Summary Evaluation

Now that you've learned about loading data for ROUGE evaluation, let's focus on the ROUGE metric itself. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a key tool for measuring how well a generated summary matches a reference summary.

In this exercise, you'll set up a ROUGE scorer and run a simple test to understand how it works. You'll:

Import the rouge_scorer module.
Create a ROUGE scorer with three common metrics (ROUGE-1, ROUGE-2, and ROUGE-L).
Compare two example summaries from the cnn_dailymail_with_gpt3_5.csv dataset.
Examine the precision, recall, and F1 scores for each metric.
This hands-on practice will help you understand what ROUGE actually measures before we apply it to evaluate model-generated summaries. By working with simple test examples first, you'll gain a clearer understanding of how these metrics reflect summary quality.


```python
# TODO: Import the rouge_scorer from the rouge_score library
# TODO: Import the csv module to read the dataset

# TODO: Load the dataset and take the first row using csv.DictReader
# Use list(csv.DictReader(f)) to read the file and extract the first row

# TODO: Print the first row summaries
print("Reference Summary:")
# Print the reference summary here
print("\nModel's Summary:")
# Print the model's summary here

# TODO: Initialize the ROUGE scorer with metrics 'rouge1', 'rouge2', and 'rougeL'
# Make sure to enable the stemmer by setting use_stemmer=True

# TODO: Calculate ROUGE scores by comparing the reference and candidate summaries

# TODO: Print the ROUGE-1 scores (precision, recall, and F1)
print("\nROUGE-1 Scores:")
# Print precision, recall, and F1 here

# TODO: Print the ROUGE-2 scores (precision, recall, and F1)
print("\nROUGE-2 Scores:")
# Print precision, recall, and F1 here

# TODO: Print the ROUGE-L scores (precision, recall, and F1)
print("\nROUGE-L Scores:")
# Print precision, recall, and F1 here
```

```python
from rouge_score import rouge_scorer
import csv

# Load the dataset and take the first row using csv.DictReader
with open("cnn_dailymail_with_gpt3_5.csv", 'r', encoding='utf-8') as f:
    rows = list(csv.DictReader(f))
    first_row = rows[0]
    
reference_summary = first_row['summary']
candidate_summary = first_row['model_response_gpt3_5']

# Print the first row summaries
print("Reference Summary:")
print(reference_summary)
print("\nModel's Summary:")
print(candidate_summary)

# Initialize the ROUGE scorer with metrics 'rouge1', 'rouge2', and 'rougeL'
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calculate ROUGE scores by comparing the reference and candidate summaries
scores = scorer.score(reference_summary, candidate_summary)

# Print the ROUGE-1 scores (precision, recall, and F1)
print("\nROUGE-1 Scores:")
print(f"Precision: {scores['rouge1'].precision:.3f}")
print(f"Recall: {scores['rouge1'].recall:.3f}")
print(f"F1: {scores['rouge1'].fmeasure:.3f}")

# Print the ROUGE-2 scores (precision, recall, and F1)
print("\nROUGE-2 Scores:")
print(f"Precision: {scores['rouge2'].precision:.3f}")
print(f"Recall: {scores['rouge2'].recall:.3f}")
print(f"F1: {scores['rouge2'].fmeasure:.3f}")

# Print the ROUGE-L scores (precision, recall, and F1)
print("\nROUGE-L Scores:")
print(f"Precision: {scores['rougeL'].precision:.3f}")
print(f"Recall: {scores['rougeL'].recall:.3f}")
print(f"F1: {scores['rougeL'].fmeasure:.3f}")
```

## Evaluating Summaries with ROUGE Metrics

Now that you've successfully loaded the dataset, let's dive into evaluating text summaries using ROUGE. In this exercise, you'll calculate ROUGE scores to measure how well model-generated summaries match their reference counterparts.

You'll work with the CNN/Daily Mail dataset, which contains news articles, human-written reference summaries, and summaries generated by GPT-3.5. Your task is to:

Loop through each article in the dataset.
Calculate the ROUGE-L F1 score for each GPT-3.5 summary compared to its reference summary.
Store these scores in a list.
Print each article's individual score.
Calculate and display the average ROUGE-L F1 score across all articles.
This systematic evaluation will help you understand how well the model captures the key information from the reference summaries. By calculating scores for multiple examples and finding the average, you'll get a more reliable measure of the model's overall performance than looking at just one example. This approach is standard practice in NLP evaluation workflows.

```python
# TODO: Import the rouge_scorer from the rouge_score library
# TODO: Import the csv module to read the dataset

# TODO: Initialize the ROUGE scorer with metrics 'rouge1', 'rouge2', and 'rougeL'
# Make sure to enable the stemmer by setting use_stemmer=True

# TODO: Load the dataset from the CSV file
with open("cnn_dailymail_with_gpt3_5.csv") as f:
    # TODO: Convert the CSV reader to a list of dictionaries
    pass

# TODO: Create a list to store the ROUGE-L F1 scores

# TODO: Loop through each row in the dataset
for i, row in enumerate([]):  # Replace the empty list with your rows
    # TODO: Extract the reference summary and model summary from the current row
    
    # TODO: Calculate ROUGE scores using the scorer
    
    # TODO: Extract the ROUGE-L F1 score from the scores dictionary
    
    # TODO: Store the score in your list
    
    # TODO: Print the score for this article with its number

# TODO: Calculate the average ROUGE-L F1 score (sum of scores divided by number of articles)
# TODO: Print the average score with appropriate formatting
```

That's a very common and helpful error message\! The `KeyError: 'reference_summary'` tells you exactly what the problem is: your code is trying to access a dictionary key named `'reference_summary'`, but that key does not exist in the dictionary it's currently looking at.

The output from `head -n 5 cnn_dailymail_with_gpt3_5.csv` confirms this. The first line of the CSV file contains the column headers, which become the keys in the dictionary when you use `csv.DictReader`.

The headers in your CSV are:
`article,summary,model_response_gpt3_5`

The column you are trying to access is called `'reference_summary'`, but based on the CSV headers, it seems the correct column name is `'summary'`.

**Here is the corrected code:**

```python
import csv
from rouge_score import rouge_scorer

# Initialize the ROUGE scorer with metrics 'rouge1', 'rouge2', and 'rougeL'
# Make sure to enable the stemmer by setting use_stemmer=True
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Load the dataset from the CSV file
with open("cnn_dailymail_with_gpt3_5.csv", mode='r', encoding='utf-8') as f:
    # Convert the CSV reader to a list of dictionaries
    reader = csv.DictReader(f)
    rows = list(reader)

# Create a list to store the ROUGE-L F1 scores
rouge_l_f1_scores = []

# Loop through each row in the dataset
for i, row in enumerate(rows):
    # Extract the reference summary and model summary from the current row
    # The reference summary is in the 'summary' column
    reference_summary = row['summary']
    # The model summary is in the 'model_response_gpt3_5' column
    model_summary = row['model_response_gpt3_5']
    
    # Calculate ROUGE scores using the scorer
    scores = scorer.score(reference_summary, model_summary)
    
    # Extract the ROUGE-L F1 score from the scores dictionary
    rouge_l_f1 = scores['rougeL'].fmeasure
    
    # Store the score in your list
    rouge_l_f1_scores.append(rouge_l_f1)
    
    # Print the score for this article with its number
    print(f"Article {i+1}: ROUGE-L F1 Score = {rouge_l_f1:.4f}")

# Calculate the average ROUGE-L F1 score (sum of scores divided by number of articles)
average_rouge_l_f1 = sum(rouge_l_f1_scores) / len(rouge_l_f1_scores)

# Print the average score with appropriate formatting
print(f"\nAverage ROUGE-L F1 Score across all articles: {average_rouge_l_f1:.4f}")
```

**Key change and explanation:**

I changed the line:

`reference_summary = row['reference_summary']`

to:

`reference_summary = row['summary']`

This change makes the code match the actual column names in your CSV file, resolving the `KeyError`. I also updated the `model_summary` variable to use the correct column name `'model_response_gpt3_5'`.

## Evaluating GPT-4 Summarization with ROUGE-L

In this task, you will evaluate the performance of GPT-4 on text summarization using the ROUGE metric, just as you did previously for GPT-3.5. We have prepared a file named cnn_dailymail_with_gpt4.csv, which contains reference summaries and the corresponding summaries generated by GPT-4. Your objective is to calculate the ROUGE-L F1 scores for each summary pair and determine the average ROUGE-L F1 score across all articles. After completing this task, compare the results you obtain for GPT-4 with those from the previous GPT-3.5 evaluation to analyze how the two models differ in summarization quality. Follow the steps outlined in the starter code, which include importing necessary libraries, initializing the ROUGE scorer, loading the dataset, calculating ROUGE scores, and printing both individual and average ROUGE-L F1 scores.

```python
# TODO: Import the rouge_scorer from the rouge_score library
# TODO: Import the csv module to read the dataset

# TODO: Initialize the ROUGE scorer with metrics 'rouge1', 'rouge2', and 'rougeL'
# Make sure to enable the stemmer by setting use_stemmer=True

# TODO: Load the dataset from the CSV file
with open("cnn_dailymail_with_gpt4.csv") as f:
    # TODO: Convert the CSV reader to a list of dictionaries
    pass

# TODO: Create a list to store the ROUGE-L F1 scores

# TODO: Loop through each row in the dataset
for i, row in enumerate([]):  # Replace the empty list with your rows
    # TODO: Extract the reference summary and model summary from the current row
    
    # TODO: Calculate ROUGE scores using the scorer
    
    # TODO: Extract the ROUGE-L F1 score from the scores dictionary
    
    # TODO: Store the score in your list
    
    # TODO: Print the score for this article with its number

# TODO: Calculate the average ROUGE-L F1 score (sum of scores divided by number of articles)
# TODO: Print the average score with appropriate formatting
```

```python
import csv
from rouge_score import rouge_scorer

# Initialize the ROUGE scorer with metrics 'rouge1', 'rouge2', and 'rougeL'
# Make sure to enable the stemmer by setting use_stemmer=True
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Load the dataset from the CSV file
with open("cnn_dailymail_with_gpt4.csv", mode='r', encoding='utf-8') as f:
    # Convert the CSV reader to a list of dictionaries
    reader = csv.DictReader(f)
    rows = list(reader)

# Create a list to store the ROUGE-L F1 scores
rouge_l_f1_scores = []

# Loop through each row in the dataset
for i, row in enumerate(rows):
    # Extract the reference summary and model summary from the current row
    reference_summary = row['reference_summary']
    model_summary = row['model_summary_gpt4']
    
    # Calculate ROUGE scores using the scorer
    scores = scorer.score(reference_summary, model_summary)
    
    # Extract the ROUGE-L F1 score from the scores dictionary
    rouge_l_f1 = scores['rougeL'].fmeasure
    
    # Store the score in your list
    rouge_l_f1_scores.append(rouge_l_f1)
    
    # Print the score for this article with its number
    print(f"Article {i+1}: ROUGE-L F1 Score = {rouge_l_f1:.4f}")

# Calculate the average ROUGE-L F1 score (sum of scores divided by number of articles)
average_rouge_l_f1 = sum(rouge_l_f1_scores) / len(rouge_l_f1_scores)

# Print the average score with appropriate formatting
print(f"\nAverage ROUGE-L F1 Score across all articles: {average_rouge_l_f1:.4f}")
```

This code is similar to the one you used for GPT-3.5, but it has been adapted to use the new file name and column names, which are likely `reference_summary` and `model_summary_gpt4` based on the file's purpose. The `csv.DictReader` will use the first row of your CSV file as the dictionary keys, so it's crucial to have the correct names. This video shows how to use Python's `csv.DictReader` to easily process data from a CSV file.

[Reading CSV Files in Python with csv.DictReader](https://www.youtube.com/watch?v=5CEsJkKhS78)
http://googleusercontent.com/youtube_content/3

The `KeyError: 'reference_summary'` traceback and the `head -n 2` output of your CSV file are once again providing you with the exact information you need to fix the code.

The `csv.DictReader` class reads the first line of your CSV file as the column headers and uses them as keys for each row's dictionary. Your CSV file's header is:

`article,summary,model_response_gpt4`

This means the keys for each row's dictionary will be `'article'`, `'summary'`, and `'model_response_gpt4'`.

Your code is trying to access a key named `'reference_summary'`, which does not exist. The correct key for the human-written summary is `'summary'`.

**Here is the corrected code:**

```python
import csv
from rouge_score import rouge_scorer

# Initialize the ROUGE scorer with metrics 'rouge1', 'rouge2', and 'rougeL'
# Make sure to enable the stemmer by setting use_stemmer=True
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Load the dataset from the CSV file
with open("cnn_dailymail_with_gpt4.csv", mode='r', encoding='utf-8') as f:
    # Convert the CSV reader to a list of dictionaries
    reader = csv.DictReader(f)
    rows = list(reader)

# Create a list to store the ROUGE-L F1 scores
rouge_l_f1_scores = []

# Loop through each row in the dataset
for i, row in enumerate(rows):
    # Extract the reference summary and model summary from the current row
    # Use 'summary' as the key for the reference summary
    reference_summary = row['summary']
    # Use 'model_response_gpt4' for the GPT-4 summary
    model_summary = row['model_response_gpt4']
    
    # Calculate ROUGE scores using the scorer
    scores = scorer.score(reference_summary, model_summary)
    
    # Extract the ROUGE-L F1 score from the scores dictionary
    rouge_l_f1 = scores['rougeL'].fmeasure
    
    # Store the score in your list
    rouge_l_f1_scores.append(rouge_l_f1)
    
    # Print the score for this article with its number
    print(f"Article {i+1}: ROUGE-L F1 Score = {rouge_l_f1:.4f}")

# Calculate the average ROUGE-L F1 score (sum of scores divided by number of articles)
average_rouge_l_f1 = sum(rouge_l_f1_scores) / len(rouge_l_f1_scores)

# Print the average score with appropriate formatting
print(f"\nAverage ROUGE-L F1 Score across all articles: {average_rouge_l_f1:.4f}")
```

This video demonstrates how the `csv.DictReader` works and how it uses the header row to create dictionary keys, which is directly relevant to understanding your error.

[Read CSV Files in Python with csv.DictReader](https://www.youtube.com/watch?v=5CEsJkKhS78)
http://googleusercontent.com/youtube_content/5