# **Task - Summarization**

## Introduction

Summarization is a core task in Natural Lanugage Processing (NLP), where the goal is to condense lengthy input text into a shorter text while retaining the core information from the original text.

This tutorial provides an hands-on introduction to how to perform summarization evaluation on the GPT3.5 Turbo language model.

The tutorial uses Extreme Summarization (XSUM),  established benchmark for evaluating summarization tasks. It includes code to load and explore the XSUM dataset, inference with the GPT3.5 Turbo via the OpenAI API, and evaluate the model's performance by comparing the summaries generated by the LLM to the human-annotated summary (ground truth).


The XSUM benchmark has a large collection of news articles and summaries.  

For additional information about the XSUM benchmark - https://arxiv.org/pdf/1808.08745

## **Step 1 - Install Pre-requisites**

In step 1, we will load the following pre-requisites

- `openai`: For interacting with the openAI API and to query the LLM
- `python-dotenv`:  To manage API keys securely using envionrment variables
- `datasets`: The datasets library provides easy access to a wide variety of datasets commonly used for natural language processing tasks.
- `tqdm`: Adds progress bars to loops, making it easier to monitor & visualize the progress
- `rouge_score`: A metric to evaluate the quality of generated summary
- `nltk`: An NLP library for tokenization and text analysis
- `bert_score`: A metric to evaluate the quality of generated summary
- `evaluate`: A library that consists of metrics useful in evaluating NLP tasks.

In [None]:
%pip install openai==0.28 python-dotenv datasets tqdm rouge_score nltk bert_score evaluate

In [None]:
from dotenv import load_dotenv
from IPython.display import display, Markdown
from rouge_score import rouge_scorer
from bert_score import score as bert_score
from bert_score import BERTScorer
from datasets import load_dataset
import evaluate
import os
import openai

## **Step 2 - Load LLM**

Next, to access the GPT3.5 Turbo model via API, we establish a connection using an API key.

In [None]:
# Load API key from environment file
load_dotenv(dotenv_path="../apikey.env.txt")  # replace the "file path" with the location of your API key file

APIKEY = os.getenv("APIKEY")

openai.api_key = APIKEY

##**Step 3 - Load test dataset**

The XSUM dataset consists of news articles and its summaries. Next, we will download the XSUM dataset using the datasets library from Hugging Face.

In [None]:
# Download the XSUM dataset
xsum_dataset = load_dataset("xsum")

**Exploring the XSUM dataset**

The XSUM test set contains 11344 test instances.

Each test input has a source text, which will passed an input to the LLM, and a summary that will serve as the ground truth.

The below code display the test set information, and a sample instance from the test set.


In [None]:
# test set
test_set = xsum_dataset['test']
num_test_instances = len(test_set)

# Print basic information about the test set and a displaying a sample test instance
print("=" * 50)
print("XSUM Test Set Summary")
print("=" * 50)
print(f"The XSUM test set contains {len(test_set)} instances.")
print("\nExample test instance: Test instance # 1")
print("=" * 50)

print("\nSource text:\n", test_set[0]['document'])
print("\nGround truth summary:\n", test_set[0]['summary'])

print("=" * 50)

Running summarization evaluation for the entire test set can be time consuming and costly.

For demonstration purposes, please provide the number of instances you want to select from the test set (randomly) for evaluating the LLM.

By default, if no input or an invalid input is provided by default, the notebook will run evaluations on 10 test instances (selected at random).



In [None]:
try:
    num_instances_to_evaluate = int(input("Enter the number of test instances to evaluate (or press Enter to use the default 10): ") or 10)
except ValueError:
    print("Invalid input. Using a default of 10 test instances, selected at random.")
    num_instances_to_evaluate = 10


# Ensure the user input is within a valid range
num_instances_to_evaluate = min(max(num_instances_to_evaluate, 1), num_test_instances)

print(f"\033[34mYou have chosen {num_instances_to_evaluate} test instances for evaluation\033[0m")

Evaluating the performance of Large Language Models (LLMs) on summarization tasks using specific metrics.

In this step, the necessary packages for these evaluations will be installed. The tutorial will cover both token overlap-based metrics like ROUGE and METEOR, as well as similarity-based metrics like BERTScore. These metrics will aid in measuring the summarization capabilities of the LLM.

In [None]:
# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])

# Initialize the METEOR scorer
meteor_score = evaluate.load("meteor")

# Initialize the BERTScore scorer
bertScore = BERTScorer(lang="en")

## **Step 4 - Prompt Construction**



Next, a prompt will be to interact with the GPT3.5 turbo model and produce a summary of the input.

In this tutorial, we are restricting the length of the summary to less than 500 words.

In [None]:
def GetModelResponse(system_content, user_content):
    system = {'role': 'system', 'content': system_content}
    user = {'role': 'user', 'content': user_content}

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        temperature=1.0,
        messages=[system, user],
        max_tokens=500
    )

    content = response.choices[0].message.content
    return content

In [None]:
# Lists to store the results
model_generated_summaries = []
ground_truth_summaries = []
rouge_scores = []
meteor_scores = []
bert_scores = []

# Iterate through the test set and generate summaries
for instance in test_set.select(range(num_instances_to_evaluate)):
    input_text = instance['document']
    ground_truth_summary = instance['summary']

    # model_generated_summary = generate_summary(input_text)
    system_content = f"You are a helpful assistant that generates concise summaries of the given text. The length of the summary should match {len(ground_truth_summary.split())}tokens"
    user_content = input_text
    model_generated_summary = GetModelResponse(system_content, user_content)

    # Evaluation - ROGUE
    rouge_result = scorer.score(ground_truth_summary, model_generated_summary)

    # Evaluation - BERTScore
    # Step 1 - Converting the summaries as a list
    model_generated_summary_list = [model_generated_summary]
    ground_truth_summary_list = [ground_truth_summary]
    # Step 2 - Calculating the BERTScore
    bert_result = bert_score(model_generated_summary_list, ground_truth_summary_list, lang="en", verbose=False)
    Precision_BERT1, Recall_BERT1, F1_Bert1 = bert_score(model_generated_summary_list, ground_truth_summary_list, lang="en", verbose=False)

    # Evaluation - Meteor Score
    meteor_result = meteor_score.compute(predictions=model_generated_summary_list, references=ground_truth_summary_list)

    # Store the results
    model_generated_summaries.append(model_generated_summary)
    ground_truth_summaries.append(ground_truth_summary)
    rouge_scores.append(rouge_result)
    meteor_scores.append(meteor_result)
    bert_scores.append(bert_result)

In [None]:
# To display results
for i, (source_text, ground_truth_summary, model_generated_summary, rouge_result, meteor_result, bert_score_result) in enumerate(zip(test_set['document'], ground_truth_summaries, model_generated_summaries, rouge_scores, meteor_scores, bert_scores)):
    display(Markdown(f"**Source Text:**\n{source_text}"))
    display(Markdown(f"**Ground Truth Summary:**\n{ground_truth_summary}"))
    display(Markdown(f"**Model Generated Summary:**\n{model_generated_summary}"))

    display(Markdown(f"**ROUGE Scores:**\nROUGE-1: {rouge_result['rouge1'].fmeasure:.4f}\nROUGE-2: {rouge_result['rouge2'].fmeasure:.4f}\nROUGE-L: {rouge_result['rougeL'].fmeasure:.4f}"))
    display(Markdown(f"**METEOR Score:** {meteor_result.get('meteor', 'Not available'):.4f}" if 'meteor' in meteor_result else "**METEOR Score:** Not available"))

    display(Markdown(f"**BERTScore Precision:** {bert_score_result[0].item():.4f}\n"))
    display(Markdown(f"**BERTScore Recall:** {bert_score_result[1].item():.4f}\n"))
    display(Markdown(f"**BERTScore F1:** {bert_score_result[2].item():.4f}\n"))
    display(Markdown("---"))