# Evaluation using Rouge

In [17]:
pip install rouge

Note: you may need to restart the kernel to use updated packages.


In [18]:
from rouge import Rouge
import pandas as pd

After removing some outliers, I then preceed to evaluate the quality of text summaries using ROUGE scores. The ROUGE metrics show that the summaries are highly condensed, which seems to miss a lot of the original content. 

In [19]:
df = pd.read_csv('articles_with_removal.csv')

rouge = Rouge()
scores = rouge.get_scores(df['preprocessed_highlights'].tolist(), df['preprocessed_articles'].tolist(), avg=True)
print("ROUGE Scores:", scores)

ROUGE Scores: {'rouge-1': {'r': 0.03888613217339696, 'p': 0.822556244510026, 'f': 0.07374629530986938}, 'rouge-2': {'r': 0.012355982046341616, 'p': 0.4013977029069911, 'f': 0.023836888587387285}, 'rouge-l': {'r': 0.03199708585413308, 'p': 0.6821242826494922, 'f': 0.06070796872298221}}


To maximize the Rouge Score, I select the top 200 pairs that have the highest ROUGE-L F1 score, which is calculated using the function calculate_rouge_score.

In [20]:
# Function to calculate average ROUGE score for given pairs of articles and summaries
def calculate_rouge_score(article, summary):
    scores = rouge.get_scores(article, summary, avg=True)
    return scores['rouge-l']['f']

# Calculate the ROUGE score for each existing pair
df['rouge_score'] = df.apply(lambda row: calculate_rouge_score(row['preprocessed_articles'], row['preprocessed_highlights']), axis=1)

# Sort the pairs by the ROUGE score in descending order
df = df.sort_values(by='rouge_score', ascending=False)

# Select the top 200 unique articles
unique_articles = set()
top_200_pairs = []
for index, row in df.iterrows():
    if row['preprocessed_articles'] not in unique_articles:
        unique_articles.add(row['preprocessed_articles'])
        top_200_pairs.append(row)
    if len(top_200_pairs) == 200:
        break

# Convert the list of top pairs to a DataFrame
top_200_pairs_df = pd.DataFrame(top_200_pairs)

# Generate new IDs from 1 to 200
top_200_pairs_df['id'] = range(1, 201)

columns_to_save = ['id', 'articles', 'highlights']
top_200_pairs_to_save = top_200_pairs_df[columns_to_save]

# Save the top 200 pairs with their original information to a new CSV file
top_200_pairs_to_save.to_csv('top_200_article_summary_pairs.csv', index=False)
print("Top 200 article-summary pairs saved to 'top_200_article_summary_pairs.csv'")

Top 200 article-summary pairs saved to 'top_200_article_summary_pairs.csv'
