### References:

https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization

https://arxiv.org/pdf/2303.16634

In [1]:
import config
from openai import OpenAI
import pandas as pd
import json

tickets = open('results.json')
tickets = json.load(tickets)

summaries = {
    'No': [],
    'Query' : [],
    'Summary': []
    }

for k,v in tickets.items():
    summaries['No'].append(k)
    summaries['Query'].append(v['Query'])
    summaries['Summary'].append(v['Summary'])

# OPEN API KEY:

client = OpenAI()

excerpt = "To whome it may concern, I have brought up this issue previously. Nothing positively changed or happened. I am so queries, why everytime Causewaylink staff must skip few buses early morning and wasting people's time those start queuing early morning . I understand , your first bus service start at 7.30 a.m . Why it's so difficult to changed it to 7 a.m and do some consideration on people family time . Let's say my work finish at 5.30 morning. Am I suppose to queue up and wait until 7.30 a.m ? Every minutes is valuable for us after work . Just like today , I reached at Tuas Link Causewaylink point at 6 morning . I saw almost 12 of them waited for the bus . And this staff reached 6.45 there and sit on the chair like doll . So many buses came from Malaysia with full of passenger. Almost 5 buses (CW7) were skipped by him. Why can't we board the bus earlier ??? I understand some buses will u-turn back and some will enter Malaysia again . Why can't allow us to board those bus which enter back malaysia ? One of the lady was pregnant . She was looks so tired due to this first bus board time. Kindly consider about the people's family time since we all depending on public bus services. Every single of us rushing to home after work . Please revise ur first bus board time at every causelink . From , Volunteer From Hard Pain Customers."
eval_summary_1 = "Customer is complaning about the bus company"

# Evaluation prompt template based on G-Eval
EVALUATION_PROMPT_TEMPLATE = """
You will be given one summary written for an article. Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions very carefully. 
Please keep this document open while reviewing, and refer to it as needed.
Provide a single numerical score (integer) only for each metric.
Directly assign a score based on the Evaluation Criteria provided.
Do not ask clarifying questions.
Be neutral and precise.

Your output should be a single numerical score (integer) only for each metric.
Directly assign a score based on the Evaluation Criteria provided. Do not include anything else and ask anything clarifying questions.

Evaluation Criteria:

{criteria}

Evaluation Steps:

{steps}

Example:

Source Text:

{document}

Summary:

{summary}

Evaluation Form (scores ONLY):

- {metric_name} 
"""

# Metric 1: Relevance

RELEVANCY_SCORE_CRITERIA = """
Relevance(1-5) - selection of important content from the source. \
The summary should include only important information from the source document. \
Annotators were instructed to penalize summaries which contained redundancies and excess information.
"""

RELEVANCY_SCORE_STEPS = """
1. Read the summary and the source document carefully.
2. Compare the summary to the source document and identify the main points of the article.
3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.
4. Assign a relevance score from 1 to 5.
"""

# Metric 2: Coherence

COHERENCE_SCORE_CRITERIA = """
Coherence(1-5) - the collective quality of all sentences. \
We align this dimension with the DUC quality question of structure and coherence \
whereby "the summary should be well-structured and well-organized. \
The summary should not just be a heap of related information, but should build from sentence to a\
coherent body of information about a topic."
"""

COHERENCE_SCORE_STEPS = """
1. Read the article carefully and identify the main topic and key points.
2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,
and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.
"""

# Metric 3: Consistency

CONSISTENCY_SCORE_CRITERIA = """
Consistency(1-5) - the factual alignment between the summary and the summarized source. \
A factually consistent summary contains only statements that are entailed by the source document. \
Annotators were also asked to penalize summaries that contained hallucinated facts.
"""

CONSISTENCY_SCORE_STEPS = """
1. Read the article carefully and identify the main facts and details it presents.
2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.
3. Assign a score for consistency based on the Evaluation Criteria.
"""

# Metric 4: Fluency

FLUENCY_SCORE_CRITERIA = """
Fluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.
1: Poor. The summary has many errors that make it hard to understand or sound unnatural.
2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.
3: Good. The summary has few or no errors and is easy to read and follow.
"""

FLUENCY_SCORE_STEPS = """
Read the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.
"""


def get_geval_score(
    criteria: str, steps: str, document: str, summary: str, metric_name: str
):
    prompt = EVALUATION_PROMPT_TEMPLATE.format(
        criteria=criteria,
        steps=steps,
        metric_name=metric_name,
        document=document,
        summary=summary,
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": prompt},
            ],
        temperature=0,
        max_tokens=1,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    
    return response.choices[0].message.content

def highlight_max(s):
    is_max = s == s.max()
    return [
        "background-color: lightgreen; color: black" if v else "background-color: white; color: black"
        for v in is_max
    ]

evaluation_metrics = {
    "Relevance": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),
    "Coherence": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),
    "Consistency": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),
    "Fluency": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),
}

data = {"Evaluation Type": [], "Summary Type": [], "Score": []}

for eval_type, (criteria, steps) in evaluation_metrics.items():
    for i in range(len(summaries['No'])):
        
        query = summaries['Query'][i]
        summary_type = f'Summary {summaries["No"][i]}'
        summary = summaries['Summary'][i]
                
        data["Evaluation Type"].append(eval_type)
        data["Summary Type"].append(summary_type)
        
        result = get_geval_score(criteria, steps, query, summary, eval_type)
        score_num = int(result.strip())
        
        data["Score"].append(score_num)    

pivot_df = pd.DataFrame(data, index=None).pivot(
    index="Evaluation Type",
    columns="Summary Type",
    values="Score"
)

sorted_columns = sorted(pivot_df.columns, key=lambda x: int(x.split()[-1]))
pivot_df = pivot_df[sorted_columns]

styled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)
display(styled_pivot_df)

Summary Type,Summary 0,Summary 1,Summary 2,Summary 3,Summary 4,Summary 5,Summary 6,Summary 13,Summary 14,Summary 16,Summary 17,Summary 23,Summary 26,Summary 27,Summary 28,Summary 29,Summary 30,Summary 32,Summary 33,Summary 69,Summary 70,Summary 75,Summary 78,Summary 87,Summary 89,Summary 100,Summary 106,Summary 141,Summary 144,Summary 158,Summary 161,Summary 175,Summary 190,Summary 230,Summary 242,Summary 246,Summary 252,Summary 257,Summary 314,Summary 321,Summary 324,Summary 360,Summary 370,Summary 372,Summary 384,Summary 414,Summary 439,Summary 445,Summary 648,Summary 659,Summary 732,Summary 797,Summary 880,Summary 892,Summary 896,Summary 902,Summary 918,Summary 938,Summary 939,Summary 956,Summary 966,Summary 973,Summary 1149,Summary 1171,Summary 1173,Summary 1238,Summary 1239,Summary 1314,Summary 1359,Summary 1373,Summary 1388,Summary 1395,Summary 1402,Summary 1744,Summary 1797
Evaluation Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1
Coherence,5,5,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,3,5,5,5,5,5,4,5,5,5,4,5,5,5,5,5,5,5,5,5,5,5
Consistency,5,5,1,5,5,5,5,5,5,5,5,5,5,1,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,3,5,5,5,1,5,5,5,5,5,5,5,5,5,5,5
Fluency,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
Relevance,5,5,3,4,5,3,4,5,5,5,3,5,5,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,5,5,5,5,4,5,5,5,5,5,5,5,5,5,5,5,5,5,1,5,5,5,5,4,5,5,5,5,4,5,5,5,2,5,5,5,5,4,5,5,4,5,5,5


In [5]:
pivot_df.to_csv('results.csv')

In [16]:
# full score for each category

max_scores = {
    "Coherence": 5,
    "Consistency": 5,
    "Fluency": 3,
    "Relevance": 5
}

full_score_summaries = pivot_df.apply(lambda x: all(x[eval_type] == max_scores[eval_type] for eval_type in max_scores), axis=0) # identify which summaries have received full scores in all evaluation metrics

full_score_count = full_score_summaries.sum()

print(f"Number of summaries with full scores in all categories: {full_score_count}")

Number of summaries with full scores in all categories: 60
