## Detailed Article Explaination

The detailed code explanation for this article is available at the following link:

https://www.daniweb.com/programming/computer-science/tutorials/542208/gpt-4o-mini-vs-gpt-4o-vs-gpt-3-5-turbo-for-text-summarization

For my other articles for Daniweb.com, please see this link:

https://www.daniweb.com/members/1235222/usmanmalik57

## Importing and Installing Required Libraries

In [1]:
!pip install openai
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl



In [31]:
import os
import time
import pandas as pd
from rouge_score import rouge_scorer
from openai import OpenAI
from collections import defaultdict

## Importing the Dataset

In [3]:

# Kaggle dataset download link
# https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx


dataset = pd.read_excel(r"D:\Datasets\dataset.xlsx")
dataset = dataset.sample(frac=1)
print(dataset.shape)
dataset.head()

(1000, 10)


Unnamed: 0.1,Unnamed: 0,id,human_summary,publication,author,date,year,month,theme,content
199,0,17523,"Mr. Trump, who has embraced discredited links ...",New York Times,"Michael D. Shear, Nicholas Fandos and Jennifer...",2017-01-11,2017.0,1.0,politics,■ Robert F. Kennedy Jr. one of the nation’s mo...
362,259,17719,” Biden clearly loathes the new president he s...,New York Times,Jonathan Alter,2017-01-19,2017.0,1.0,politics,Joe Biden’s personal compartment on the modifi...
933,259,18388,but those criticisms were based on constitutio...,New York Times,Julie Hirschfeld Davis,2017-02-09,2017.0,2.0,business,"WASHINGTON — Judge Neil M. Gorsuch, Preside..."
865,259,18314,He said in a phone interview that his reaction...,New York Times,Jonah Engel Bromwich,2017-02-06,2017.0,2.0,business,Returning home on Saturday night after a dinne...
596,259,17986,It was published on a site affiliated with The...,New York Times,Liz Spayd,2017-01-30,2017.0,1.0,entertainment,The conservative radio host Glenn Beck called ...


In [4]:
dataset['summary_length'] = dataset['human_summary'].apply(len)
average_length = dataset['summary_length'].mean()
print(f"Average length of summaries: {average_length:.2f} characters")

Average length of summaries: 1168.78 characters


## Text Summarization with GPT-4o mini, GPT-4o, and GPT-3.5 Turbo

In [5]:
client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)

# Function to generate summary using OpenAI API
def generate_summary(model, article):
    prompt = f"Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1150,
        temperature=0.7
    )
    return response.choices[0].message.content

# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return {key: value.fmeasure for key, value in scores.items()}


In [6]:
models = ["gpt-4o-mini",
          "gpt-4o",
          "gpt-3.5-turbo"]

results = []

i = 0
for _, row in dataset[:20].iterrows():
    article = row['content']
    human_summary = row['human_summary']
    
    i = i + 1
    
    for model in models:
        
        print(f"Summarizing article {i} with model {model}")
              
        generated_summary = generate_summary(model, article)
        rouge_scores = calculate_rouge(human_summary, generated_summary)
        
        results.append({
            'model': model,
            'article_id': row.id,
            'generated_summary': generated_summary,
            'rouge1': rouge_scores['rouge1'],
            'rouge2': rouge_scores['rouge2'],
            'rougeL': rouge_scores['rougeL']
        })

# Create a DataFrame with results
results_df = pd.DataFrame(results)

Summarizing article 1 with model gpt-4o-mini
Summarizing article 1 with model gpt-4o
Summarizing article 1 with model gpt-3.5-turbo
Summarizing article 2 with model gpt-4o-mini
Summarizing article 2 with model gpt-4o
Summarizing article 2 with model gpt-3.5-turbo
Summarizing article 3 with model gpt-4o-mini
Summarizing article 3 with model gpt-4o
Summarizing article 3 with model gpt-3.5-turbo
Summarizing article 4 with model gpt-4o-mini
Summarizing article 4 with model gpt-4o
Summarizing article 4 with model gpt-3.5-turbo
Summarizing article 5 with model gpt-4o-mini
Summarizing article 5 with model gpt-4o
Summarizing article 5 with model gpt-3.5-turbo
Summarizing article 6 with model gpt-4o-mini
Summarizing article 6 with model gpt-4o
Summarizing article 6 with model gpt-3.5-turbo
Summarizing article 7 with model gpt-4o-mini
Summarizing article 7 with model gpt-4o
Summarizing article 7 with model gpt-3.5-turbo
Summarizing article 8 with model gpt-4o-mini
Summarizing article 8 with mode

In [20]:
results_df.head(20)

Unnamed: 0,model,article_id,generated_summary,rouge1,rouge2,rougeL
0,gpt-4o-mini,17523,"Robert F. Kennedy Jr., a notable vaccine skept...",0.354167,0.068063,0.166667
1,gpt-4o,17523,Donald J. Trump has asked Robert F. Kennedy Jr...,0.315789,0.089136,0.182825
2,gpt-3.5-turbo,17523,"Robert F. Kennedy Jr., a skeptic of childhood ...",0.283912,0.08254,0.170347
3,gpt-4o-mini,17719,During a flight on Air Force Two from Cartagen...,0.299728,0.032877,0.147139
4,gpt-4o,17719,Joe Biden's private space on Air Force Two fel...,0.38477,0.06841,0.172345
5,gpt-3.5-turbo,17719,"Joe Biden, while reflecting on his decision no...",0.25641,0.058065,0.153846
6,gpt-4o-mini,18388,"Judge Neil M. Gorsuch, nominated by President ...",0.282927,0.034314,0.136585
7,gpt-4o,18388,"Judge Neil M. Gorsuch, President Trump's Supre...",0.247619,0.028708,0.119048
8,gpt-3.5-turbo,18388,"President Trump's Supreme Court nominee, Judge...",0.21902,0.023188,0.138329
9,gpt-4o-mini,18314,"On Saturday night, Gregory Locke encountered a...",0.348348,0.048338,0.138138


In [11]:
average_scores = results_df.groupby('model')[['rouge1', 'rouge2', 'rougeL']].mean()
average_scores_sorted = average_scores.sort_values(by='rouge1', ascending=False)
print("Average ROUGE scores by model:")
average_scores_sorted.head()

Average ROUGE scores by model:


Unnamed: 0_level_0,rouge1,rouge2,rougeL
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
gpt-4o,0.381659,0.115391,0.203795
gpt-4o-mini,0.375407,0.099464,0.191419
gpt-3.5-turbo,0.363258,0.111231,0.200267


## Evaluating LLM-Generated Summary using an LLM

In [25]:
def llm_evaluate_summary(article, summary):
    prompt = f"""Evaluate the following summary for the given article. Rate it on a scale of 1-10 for:
    1. Completeness: Does it capture all key points?
    2. Conciseness: Is it brief and to the point?
    3. Coherence: Is it well-structured and easy to understand?

    Article: {article}

    Summary: {summary}

    Provide the ratings as a comma-separated list (completeness,conciseness,coherence).
    """
    response = client.chat.completions.create(
        model= "gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.7
    )
    return [float(score) for score in response.choices[0].message.content.strip().split(',')]


In [35]:
# Initialize a dictionary to store scores for each model
scores_dict = defaultdict(lambda: {'completeness': [], 'conciseness': [], 'coherence': []})


i = 0
for _, row in results_df.iterrows():
    i = i + 1
    # Corrected method to access content by article_id
    article = dataset.loc[dataset['id'] == row['article_id'], 'content'].iloc[0]
    scores = llm_evaluate_summary(article, row['generated_summary'])
    print(f"Model: {row['model']}, Scores: {scores}")
    
    # Store the scores for the model
    model = row['model']
    scores_dict[model]['completeness'].append(scores[0])
    scores_dict[model]['conciseness'].append(scores[1])
    scores_dict[model]['coherence'].append(scores[2])

Model: gpt-4o-mini, Scores: [8.0, 9.0, 9.0]
Model: gpt-4o, Scores: [8.0, 9.0, 9.0]
Model: gpt-3.5-turbo, Scores: [8.0, 9.0, 8.0]
Model: gpt-4o-mini, Scores: [8.0, 9.0, 9.0]
Model: gpt-4o, Scores: [8.0, 9.0, 9.0]
Model: gpt-3.5-turbo, Scores: [8.0, 9.0, 8.0]
Model: gpt-4o-mini, Scores: [8.0, 9.0, 9.0]
Model: gpt-4o, Scores: [8.0, 9.0, 9.0]
Model: gpt-3.5-turbo, Scores: [8.0, 9.0, 9.0]
Model: gpt-4o-mini, Scores: [8.0, 9.0, 9.0]
Model: gpt-4o, Scores: [8.0, 9.0, 8.0]
Model: gpt-3.5-turbo, Scores: [8.0, 9.0, 9.0]
Model: gpt-4o-mini, Scores: [8.0, 9.0, 8.0]
Model: gpt-4o, Scores: [8.0, 9.0, 8.0]
Model: gpt-3.5-turbo, Scores: [8.0, 9.0, 8.0]
Model: gpt-4o-mini, Scores: [8.0, 9.0, 9.0]
Model: gpt-4o, Scores: [8.0, 9.0, 9.0]
Model: gpt-3.5-turbo, Scores: [8.0, 9.0, 8.0]
Model: gpt-4o-mini, Scores: [8.0, 9.0, 9.0]
Model: gpt-4o, Scores: [8.0, 9.0, 8.0]
Model: gpt-3.5-turbo, Scores: [8.0, 9.0, 8.0]
Model: gpt-4o-mini, Scores: [8.0, 9.0, 8.0]
Model: gpt-4o, Scores: [8.0, 9.0, 9.0]
Model: gpt-3.5

In [37]:
# Calculate the average scores for each model
average_scores = {}
for model, scores in scores_dict.items():
    average_scores[model] = {
        'completeness': sum(scores['completeness']) / len(scores['completeness']),
        'conciseness': sum(scores['conciseness']) / len(scores['conciseness']),
        'coherence': sum(scores['coherence']) / len(scores['coherence']),
    }

# Convert to DataFrame for better visualization (optional)
average_scores_df = pd.DataFrame.from_dict(average_scores, orient='index')
average_scores_df.columns = ['Completeness', 'Conciseness', 'Coherence']

average_scores_df.head()


Unnamed: 0,Completeness,Conciseness,Coherence
gpt-4o-mini,8.0,8.9,8.65
gpt-4o,8.0,9.0,8.8
gpt-3.5-turbo,8.0,9.0,8.4
