# ROUGE-Based Evaluation for Indonesian Review Summarization

This notebook demonstrates how to evaluate summarization quality using ROUGE metrics.

## What is ROUGE?

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization:

- **ROUGE-1**: Overlap of unigrams (single words)
- **ROUGE-2**: Overlap of bigrams (two consecutive words)
- **ROUGE-L**: Longest common subsequence

## Setup

In [None]:
import sys
sys.path.append('../src')

from indo_ecommerce_review_summarization.evaluation import (
    calculate_rouge,
    evaluate_predictions,
    format_rouge_scores
)

import pandas as pd
import numpy as np

## Example 1: Single Prediction Evaluation

In [None]:
# Example prediction and reference
prediction = "Produk berkualitas baik dengan pengiriman cepat dan harga terjangkau."
reference = "Barang bagus, pengiriman cepat, harga murah."

# Calculate ROUGE scores
scores = calculate_rouge(prediction, reference)

print("Single Example Evaluation:")
print("="*80)
print(f"Prediction: {prediction}")
print(f"Reference:  {reference}")
print("\nROUGE Scores:")
print(format_rouge_scores(scores))

## Example 2: Batch Evaluation

In [None]:
# Multiple predictions and references
predictions = [
    "Produk bagus dengan pengiriman cepat.",
    "Kualitas oke tapi pengiriman lambat.",
    "Harga murah dan kualitas baik.",
    "Seller responsif dan barang sesuai deskripsi.",
    "Packing rapi, produk berkualitas tinggi."
]

references = [
    "Barang bagus, pengiriman sangat cepat.",
    "Produk berkualitas namun pengiriman terlambat.",
    "Harga terjangkau dengan kualitas yang bagus.",
    "Seller sangat responsif, barang sesuai ekspektasi.",
    "Kemasan rapi dan produk sangat berkualitas."
]

# Evaluate all predictions
scores = evaluate_predictions(predictions, references)

print("Batch Evaluation Results:")
print("="*80)
print(format_rouge_scores(scores))

## Example 3: Per-Example Scores

In [None]:
# Get scores for each example
per_example_scores = evaluate_predictions(
    predictions,
    references,
    aggregate=False
)

# Display results
print("Per-Example ROUGE Scores:")
print("="*80)

for i, (pred, ref, scores) in enumerate(zip(predictions, references, per_example_scores), 1):
    print(f"\nExample {i}:")
    print(f"Prediction: {pred}")
    print(f"Reference:  {ref}")
    print(f"ROUGE-1 F1: {scores['rouge1']['fmeasure']:.4f}")
    print(f"ROUGE-2 F1: {scores['rouge2']['fmeasure']:.4f}")
    print(f"ROUGE-L F1: {scores['rougeL']['fmeasure']:.4f}")

## Example 4: Analyzing Results

In [None]:
# Create a dataframe for analysis
data = []
for i, (pred, ref, scores) in enumerate(zip(predictions, references, per_example_scores), 1):
    data.append({
        "Example": i,
        "Prediction": pred,
        "Reference": ref,
        "ROUGE-1": scores['rouge1']['fmeasure'],
        "ROUGE-2": scores['rouge2']['fmeasure'],
        "ROUGE-L": scores['rougeL']['fmeasure'],
    })

df = pd.DataFrame(data)

# Display summary statistics
print("Summary Statistics:")
print("="*80)
print(df[['ROUGE-1', 'ROUGE-2', 'ROUGE-L']].describe())

# Find best and worst examples
print("\nBest Example (by ROUGE-L):")
best_idx = df['ROUGE-L'].idxmax()
print(f"Prediction: {df.loc[best_idx, 'Prediction']}")
print(f"Reference:  {df.loc[best_idx, 'Reference']}")
print(f"ROUGE-L:    {df.loc[best_idx, 'ROUGE-L']:.4f}")

print("\nWorst Example (by ROUGE-L):")
worst_idx = df['ROUGE-L'].idxmin()
print(f"Prediction: {df.loc[worst_idx, 'Prediction']}")
print(f"Reference:  {df.loc[worst_idx, 'Reference']}")
print(f"ROUGE-L:    {df.loc[worst_idx, 'ROUGE-L']:.4f}")

## Example 5: Comparing Different Summarization Approaches

In [None]:
# Simulate different models/approaches
approach1_predictions = [
    "Produk berkualitas dengan pengiriman cepat.",
    "Kualitas baik namun pengiriman terlambat.",
    "Harga murah kualitas oke.",
]

approach2_predictions = [
    "Barang bagus, kirim cepat banget.",
    "Produk oke, sayang pengirimannya lama.",
    "Murah dan berkualitas.",
]

test_references = [
    "Barang bagus, pengiriman sangat cepat.",
    "Produk berkualitas namun pengiriman terlambat.",
    "Harga terjangkau dengan kualitas yang bagus.",
]

# Evaluate both approaches
scores1 = evaluate_predictions(approach1_predictions, test_references)
scores2 = evaluate_predictions(approach2_predictions, test_references)

# Compare
print("Approach Comparison:")
print("="*80)

comparison_df = pd.DataFrame({
    "Metric": ["ROUGE-1", "ROUGE-2", "ROUGE-L"],
    "Approach 1": [
        scores1['rouge1']['fmeasure'],
        scores1['rouge2']['fmeasure'],
        scores1['rougeL']['fmeasure']
    ],
    "Approach 2": [
        scores2['rouge1']['fmeasure'],
        scores2['rouge2']['fmeasure'],
        scores2['rougeL']['fmeasure']
    ]
})

comparison_df['Difference'] = comparison_df['Approach 1'] - comparison_df['Approach 2']
print(comparison_df.to_string(index=False))

## Interpreting ROUGE Scores

### General Guidelines:

- **ROUGE-1**: Measures content overlap at word level
  - Higher scores indicate better content coverage
  - Typical range: 0.3-0.6 for good summaries

- **ROUGE-2**: Measures phrase-level overlap
  - More stringent than ROUGE-1
  - Indicates better phrase matching
  - Typical range: 0.15-0.4 for good summaries

- **ROUGE-L**: Measures longest common subsequence
  - Captures sentence-level structure
  - Good indicator of fluency
  - Typical range: 0.3-0.5 for good summaries

### Important Notes:

1. ROUGE is not perfect - high scores don't always mean better quality
2. Human evaluation is still important
3. Consider multiple metrics together
4. Indonesian language has different characteristics than English

## Conclusion

This notebook demonstrated:
1. Single and batch ROUGE evaluation
2. Per-example analysis
3. Comparing different approaches
4. Interpreting ROUGE scores

## Next Steps

- Evaluate your model on a test set
- Compare with baseline models
- Perform error analysis on low-scoring examples
- Consider additional metrics (BLEU, METEOR, BERTScore)