# Translation Quality Evaluation Notebook

This notebook performs the following tasks:
- **Import and Setup:** Loads necessary libraries including Pandas, JSON, evaluation metrics (BLEU, ROUGE, CHRF) and helper libraries.
- **Metric Loading:** Loads the evaluation metric modules using the `evaluate` library.
- **Sentence-Level Evaluation:** Reads a JSON file listing the languages, iterates over each language’s CSV file containing sentence-level translations, computes evaluation metrics (BLEU, ROUGE, CHRF) and edit distance, and stores the results in a DataFrame.
- **Paragraph-Level Evaluation:** Similar to the sentence-level evaluation, this section reads paragraph-level translations along with additional type information. It computes the same metrics and aggregates the results by translation type.

In [1]:
import pandas as pd
import json
import evaluate
import editdistance
from tqdm import tqdm

## Load Evaluation Metrics

The following code loads the BLEU, ROUGE, and CHRF evaluation metrics using the `evaluate` library.

In [2]:
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
chrf = evaluate.load("chrf")

## Sentence-Level Evaluation

This section reads a JSON file (`sentence.json`) which contains a list of language codes. For each language, the notebook reads the corresponding CSV file containing the generated and corrected sentences. It then computes the following metrics for each sentence:
- **BLEU Score**
- **ROUGE Scores (rouge1, rouge2, rougeL)**
- **CHRF Score** (normalized by dividing by 100)
- **Edit Distance** between the generated and corrected sentences

A mean score is also computed across all these metrics. The results are stored in a DataFrame called `sentence`.

In [3]:
with open("../data/post-edited/sentence.json", "r") as json_file:
    sent_langs = json.load(json_file)

entries = []    
for lang in tqdm(sent_langs):
    csv_file = f"../data/post-edited/sentence/{lang}.csv"
    data = pd.read_csv(csv_file)

    try:
        for i in range(len(data)):
            entry = dict()
            # Compute evaluation scores for the given sentence.
            bleu_score = bleu.compute(predictions=[data.iloc[i]["Generated"]], references=[data.iloc[i]["Corrected"]])
            rouge_score = rouge.compute(predictions=[data.iloc[i]["Generated"]], references=[data.iloc[i]["Corrected"]])
            chrf_score = chrf.compute(predictions=[data.iloc[i]["Generated"]], references=[data.iloc[i]["Corrected"]])
            
            # Process language name and store computed scores.
            entry["language"] = lang.replace('_', ' ').split()[0].lower()
            entry["chrf"] = chrf_score['score'] / 100
            entry["bleu"] = bleu_score['bleu']
            entry["rouge1"] = rouge_score['rouge1']
            entry["rouge2"] = rouge_score['rouge2']
            entry["rougeL"] = rouge_score['rougeL']
            entry["edit"] = editdistance.eval(data.iloc[i]["Generated"], data.iloc[i]["Corrected"])
            # Mean score across all metrics
            entry["mean"] = (chrf_score["score"]/100 + bleu_score["bleu"] + 
                             rouge_score["rouge1"] + rouge_score["rouge2"] + rouge_score["rougeL"]) / 5
            entries.append(entry)
    except Exception as e:
        # Optionally log the error and continue with the next file
        print(f"Error processing {lang}: {e}")
    
sentence = pd.DataFrame(entries)
sentence  # Display the sentence-level evaluation DataFrame
# Optionally, save the results:
# sentence.to_json("../results/sentence.json", orient="records", indent=4)

100%|██████████| 22/22 [01:06<00:00,  3.04s/it]


Unnamed: 0,language,chrf,bleu,rouge1,rouge2,rougeL,edit,mean
0,arabic,0.667464,0.427287,0.000000,0.000000,0.000000,17,0.218950
1,arabic,0.643322,0.542066,0.000000,0.000000,0.000000,20,0.237078
2,arabic,0.804230,0.702850,0.000000,0.000000,0.000000,18,0.301416
3,arabic,0.648618,0.547669,0.000000,0.000000,0.000000,21,0.239257
4,arabic,0.880614,0.744373,0.000000,0.000000,0.000000,9,0.324998
...,...,...,...,...,...,...,...,...
1752,vietnamese,1.000000,1.000000,1.000000,1.000000,1.000000,2,1.000000
1753,vietnamese,1.000000,1.000000,1.000000,1.000000,1.000000,0,1.000000
1754,vietnamese,1.000000,0.000000,1.000000,1.000000,1.000000,1,0.800000
1755,vietnamese,0.706095,0.675092,0.779221,0.693333,0.753247,28,0.721398


## Paragraph-Level Evaluation

Similar to the sentence-level evaluation, this section processes paragraph-level translations. In addition to computing the same evaluation metrics, it also reads a `types.csv` file to associate each paragraph with a specific type (e.g., circular, conversation, email, lessons, misc).

Each paragraph’s metrics are computed and stored in the DataFrame `paragraph`.

In [None]:
with open("../data/post-edited/paragraph.json", "r") as json_file:
    par_langs = json.load(json_file)

entries = []    
for lang in tqdm(par_langs):
    csv_file = f"../data/post-edited/paragraph/{lang}.csv"
    data = pd.read_csv(csv_file)
    types = pd.read_csv("../data/post-edited/types.csv")
    try:
        for i in range(len(data)):
            entry = dict()
            bleu_score = bleu.compute(predictions=[data.iloc[i]["Generated"]], references=[data.iloc[i]["Corrected"]])
            rouge_score = rouge.compute(predictions=[data.iloc[i]["Generated"]], references=[data.iloc[i]["Corrected"]])
            chrf_score = chrf.compute(predictions=[data.iloc[i]["Generated"]], references=[data.iloc[i]["Corrected"]])
            
            entry["language"] = lang.replace('_', ' ').split()[0].lower()
            entry["chrf"] = chrf_score['score'] / 100
            entry["bleu"] = bleu_score['bleu']
            entry["rouge1"] = rouge_score['rouge1']
            entry["rouge2"] = rouge_score['rouge2']
            entry["rougeL"] = rouge_score['rougeL']
            entry["edit"] = editdistance.eval(data.iloc[i]["Generated"], data.iloc[i]["Corrected"])
            entry["mean"] = (chrf_score["score"]/100 + bleu_score["bleu"] + 
                             rouge_score["rouge1"] + rouge_score["rouge2"] + rouge_score["rougeL"]) / 5
            # Associate the translation type
            entry["type"] = types["type"].iloc[i]
            entries.append(entry)
    except Exception as e:
        print(f"Error processing {lang}: {e}")
    
paragraph = pd.DataFrame(entries)
# Optionally, save the paragraph-level evaluation results:
# paragraph.to_json("../results/paragraph.json", orient="records", indent=4)

100%|██████████| 22/22 [00:05<00:00,  3.78it/s]
