# Translation Quality Evaluation Notebook

This notebook performs the following tasks:
- **Import and Setup:** Loads necessary libraries including Pandas, JSON, evaluation metrics (BLEU, ROUGE, CHRF) and helper libraries.
- **Metric Loading:** Loads the evaluation metric modules using the `evaluate` library.
- **Paragraph-Level Evaluation:** Similar to the sentence-level evaluation, this section reads paragraph-level translations along with additional type information. It computes the same metrics and aggregates the results by translation type.
- **Sentence-Level Evaluation:** Reads a JSON file listing the languages, iterates over each language’s CSV file containing sentence-level translations, computes evaluation metrics (BLEU, ROUGE, CHRF) and edit distance, and stores the results in a DataFrame.

In [1]:
import pandas as pd
import json
import evaluate
import editdistance
from tqdm import tqdm

## Load Evaluation Metrics

The following code loads the BLEU, ROUGE, and CHRF evaluation metrics using the `evaluate` library.

In [None]:
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
chrf = evaluate.load("chrf")

## Paragraph-Level Evaluation

Similar to the sentence-level evaluation, this section processes paragraph-level translations. For each language, the notebook reads the corresponding CSV file containing the generated and corrected sentences. It then computes the following metrics for each sentence:
- **BLEU Score**
- **ROUGE Scores (rouge1, rouge2, rougeL)**
- **CHRF Score** (normalized by dividing by 100)

Each paragraph’s metrics are computed and stored in the DataFrame `paragraph`.

In [None]:
with open("../data/post-edited/paragraph.json", "r") as json_file:
    par_langs = json.load(json_file)

entries = []
for lang in tqdm(par_langs):
    csv_file = f"../data/post-edited/paragraph/{lang}.csv"
    data = pd.read_csv(csv_file)
    types = pd.read_csv("../data/post-edited/types.csv")
    try:
        # Compute BLEU score
        bleu_score = bleu.compute(
            predictions=list(data["Generated"]),
            references=[[i] for i in data["Corrected"]],
        )
        # Compute ROUGE scores (rouge1, rouge2, and rougeL)
        rouge_score = rouge.compute(
            predictions=list(data["Generated"]),
            references=[[i] for i in data["Corrected"]],
        )
        # Compute CHRF score
        chrf_score = chrf.compute(
            predictions=list(data["Generated"]),
            references=[[i] for i in data["Corrected"]],
        )

        # Compute mean score as an average of normalized CHRF, BLEU, and ROUGE scores
        mean_score = (
            (chrf_score["score"] / 100)
            + bleu_score["bleu"]
            + rouge_score["rouge1"]
            + rouge_score["rouge2"]
            + rouge_score["rougeL"]
        ) / 5

        # Build a dictionary entry for the current language
        entry = {
            "language": lang.replace("_", " ").split()[0].lower(),
            "bleu": "%.2f" % bleu_score["bleu"],
            "rouge1": "%.2f" % rouge_score["rouge1"],
            "rouge2": "%.2f" % rouge_score["rouge2"],
            "rougeL": "%.2f" % rouge_score["rougeL"],
            "chrf": "%.2f" % (chrf_score["score"] / 100),
            "mean": "%.2f" % mean_score,
        }
        entries.append(entry)
    except Exception as e:
        print(f"Error processing {lang} from {csv_file}: {e}")
        continue

paragraph = (
    pd.DataFrame(entries)
    .sort_values(by=["mean"], ascending=False)
    .reset_index(drop=True)
)
paragraph
# Optionally, save the paragraph-level evaluation results:
# paragraph.to_json("../results/paragraph.json", orient="records", indent=4)

 91%|█████████ | 20/22 [00:01<00:00, 13.83it/s]

Error processing tamil from ../data/post-edited/paragraph/tamil.csv: list index out of range


100%|██████████| 22/22 [00:01<00:00, 12.67it/s]


Unnamed: 0,language,bleu,rouge1,rouge2,rougeL,chrf,mean
0,indonesian,0.86,0.93,0.85,0.91,0.93,0.89
1,turkish,0.79,0.92,0.85,0.91,0.9,0.87
2,tagalog,0.75,0.86,0.73,0.84,0.86,0.81
3,portuguese,0.73,0.85,0.74,0.83,0.85,0.8
4,spanish,0.64,0.85,0.73,0.84,0.82,0.78
5,vietnamese,0.66,0.85,0.75,0.81,0.76,0.77
6,swahili,0.59,0.81,0.78,0.81,0.71,0.74
7,cantonese,0.61,0.67,0.67,0.67,0.78,0.68
8,hindi,0.72,0.6,0.48,0.54,0.83,0.64
9,mandarin,0.36,0.67,0.67,0.67,0.79,0.63


## Sentence-Level Evaluation

This section reads a JSON file (`sentence.json`) which contains a list of language codes. A mean score is also computed across all these metrics. The results are stored in a DataFrame called `sentence`.

In [None]:
with open("../data/post-edited/sentence.json", "r") as json_file:
    sent_langs = json.load(json_file)

entries = []
for lang in tqdm(sent_langs):
    csv_file = f"../data/post-edited/sentence/{lang}.csv"
    data = pd.read_csv(csv_file)

    try:
        # Compute BLEU score
        bleu_score = bleu.compute(
            predictions=list(data["Generated"]),
            references=[[i] for i in data["Corrected"]],
        )
        # Compute ROUGE scores (rouge1, rouge2, and rougeL)
        rouge_score = rouge.compute(
            predictions=list(data["Generated"]),
            references=[[i] for i in data["Corrected"]],
        )
        # Compute CHRF score
        chrf_score = chrf.compute(
            predictions=list(data["Generated"]),
            references=[[i] for i in data["Corrected"]],
        )

        # Compute mean score as an average of normalized CHRF, BLEU, and ROUGE scores
        mean_score = (
            (chrf_score["score"] / 100)
            + bleu_score["bleu"]
            + rouge_score["rouge1"]
            + rouge_score["rouge2"]
            + rouge_score["rougeL"]
        ) / 5

        # Build a dictionary entry for the current language
        entry = {
            "language": lang.replace("_", " ").split()[0].lower(),
            "bleu": "%.2f" % bleu_score["bleu"],
            "rouge1": "%.2f" % rouge_score["rouge1"],
            "rouge2": "%.2f" % rouge_score["rouge2"],
            "rougeL": "%.2f" % rouge_score["rougeL"],
            "chrf": "%.2f" % (chrf_score["score"] / 100),
            "mean": "%.2f" % mean_score,
        }
        entries.append(entry)
    except Exception as e:
        print(f"Error processing {lang} from {csv_file}: {e}")
        continue

sentence = (
    pd.DataFrame(entries)
    .sort_values(by=["mean"], ascending=False)
    .reset_index(drop=True)
)
sentence  # Display the sentence-level evaluation DataFrame
# Optionally, save the results:
# sentence.to_json("../results/sentence.json", orient="records", indent=4)

 50%|█████     | 11/22 [00:00<00:00, 18.21it/s]

Error processing korean from ../data/post-edited/sentence/korean.csv: list index out of range


100%|██████████| 22/22 [00:01<00:00, 15.10it/s]


Unnamed: 0,language,bleu,rouge1,rouge2,rougeL,chrf,mean
0,swahili,0.89,0.96,0.83,0.96,0.93,0.91
1,indonesian,0.83,0.95,0.86,0.94,0.92,0.9
2,turkish,0.82,0.93,0.83,0.92,0.9,0.88
3,vietnamese,0.73,0.88,0.82,0.87,0.77,0.81
4,tagalog,0.75,0.83,0.7,0.82,0.85,0.79
5,portuguese,0.73,0.84,0.71,0.83,0.84,0.79
6,spanish,0.64,0.84,0.67,0.84,0.82,0.76
7,cantonese,0.78,0.4,0.31,0.4,0.77,0.53
8,hindi,0.7,0.4,0.16,0.39,0.82,0.5
9,mandarin,0.71,0.36,0.16,0.36,0.7,0.46
