# Get data set for semantic analysis

Working with the data set form this paper https://arxiv.org/abs/1808.08745 accessible on hugging face on https://huggingface.co/datasets/EdinburghNLP/xsum

"We introduce extreme summarization, a new single-document summarization task which does not favor extractive strategies and calls for an abstractive modeling approach. The idea is to create a short, one-sentence news summary answering the question "What is the article about?". We collect a real-world, large-scale dataset for this task by harvesting online articles from the British Broadcasting Corporation (BBC)."

In [1]:
from datasets import load_dataset

ds = load_dataset("EdinburghNLP/xsum")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
ds["test"][0]

{'document': 'Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation.\nWorkers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders.\nThe Welsh Government said more people than ever were getting help to address housing problems.\nChanges to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation.\nPrison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because issues such as children or domestic violence were now considered.\nHowever, the same could not be said for men, the charity said, because issues which often affect them, such as post traumatic stress disorder or drug dependency, were often viewed as less of a priority.\nAndrew Stevens, who works in Welsh prisons trying to secure housing for prison leavers, said the

# Calculate Similarity Metrics

As per this paper: https://arxiv.org/html/2402.17008v1#S4

"ROUGE: ROUGE Lin (2004) is a family of metrics that score the lexical overlap between the generated text and the reference text. We used 3 variations, R-1, R-2, and R-L, which are widely adopted for evaluating text summarizing tasks. However, despite its popularity, works like Akter et al. (2022) and Bansal et al. (2022b) show that ROUGE is an unsuitable metric for comparing semantics. For this reason we also evaluate using metrics that have been designed with semantic awareness in mind.

BERTscore: While ROUGE can only convey information about lexical overlap, BERTscore is a metric that utilizes contextual embeddings from transformer models like BERT to evaluate the semantic similarity between the generated text and reference text. For this study, we compute BERTscore with the hashcode roberta-large_ L17_ no-idf_ version=0.3.12(hug_ trans=4.36.2)-rescaled.

SEM-F1: While ROUGE and BERTscore are useful and powerful metrics, SEM-F1 was specifically designed for the SOS task. SEM-F1 leverages rigorously fine-tuned sentence encoders to evaluate the SOS task using sentence-level similarity. It differs from BERTscore as BERTscore computes token-level similarity. For this study, we compute SEM-F1 with underlying models: USE Cer et al. (2018), RoBERTa Zhuang et al. (2021), and DistilRoBERTa Sanh et al. (2019)."

BERTScore is a metric for evaluating the similarity between two pieces of text (like a generated summary and a reference summary) using deep contextual embeddings from models like BERT.

How does it work?
Token Embeddings:
Both the candidate and reference texts are tokenized and passed through a pre-trained BERT model (or similar transformer). Each token gets a contextual embedding (a vector).

Similarity Matrix:
For each token in the candidate, BERTScore computes the cosine similarity with every token in the reference, resulting in a similarity matrix.

Precision, Recall, F1:

Precision: For each candidate token, find the most similar reference token. Average these scores.
Recall: For each reference token, find the most similar candidate token. Average these scores.
F1: Harmonic mean of precision and recall.


Why is it better than ROUGE/BLEU?

Semantic Matching: BERTScore can match words with similar meanings, even if the exact words are different (e.g., "car" and "vehicle").
Context Awareness: It uses the context of each word, not just the word itself.

In [5]:
from bert_score import score
from rouge_score import rouge_scorer 
import json


In [None]:
scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
scores = scorer.score(ds["test"][0]["document"], ds["test"][0]["summary"])
print(scores['rouge1'])

In [31]:
ds["test"][0]["summary"]

'There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.'

In [34]:
P, R, F1 = score(["Some test string"], [ds["test"][0]["summary"]], lang="en")
print(F1)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


KeyboardInterrupt: 

In [33]:
ds["test"][0]["summary"]

'There is a "chronic" need for more housing for prison leavers in Wales, according to a charity.'

# Prompt Generation 
Generate Prompt following: TELeR Prompts

Task: Summarize the following newsletter article in exactly one sentence that captures its core message.

Explanation: You are summarizing for industry professionals who need a fast, high-level understanding of the article. Your summary should include the key topic, any notable findings or updates, and the article’s main implication or takeaway.

Limitations:

Do not exceed one sentence.
Do not use bullet points or lists.
Do not add commentary, opinion, or context not present in the original article.
Use clear, informative language appropriate for a professional audience.

Input Article:
[Insert Any of test set articels]

In [7]:
test_set = ds["test"]
to_test=40

In [9]:
for i in range(0, min(to_test, len(test_set))):
    article_text = test_set[i]["document"]
    ground_truth = test_set[i]["summary"]
    prompt_template = (
        "Task: Summarize the following newsletter article in exactly one sentence that captures its core message.\n\n"
        "Explanation: You are summarizing for industry professionals who need a fast, high-level understanding of the article. "
        "Your summary should include the key topic, any notable findings or updates, and the article’s main implication or takeaway.\n\n"
        "Limitations:\n\n"
        "Do not exceed one sentence.\n"
        "Do not use bullet points or lists.\n"
        "Do not add commentary, opinion, or context not present in the original article.\n"
        "Use clear, informative language appropriate for a professional audience.\n\n"
        "Input Article:\n"
        f"{article_text}"
    )
    data = {"text": prompt_template, "truth": ground_truth}
    with open(f"data/prompt_{i}.jsonl", "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)