# BLEU and COMET Exploration and Basic Testing

In [8]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction
from comet import download_model, load_from_checkpoint

The purpose of this notebook is to explore the CWMT datasets I found on Kaggle [here](https://www.kaggle.com/datasets/warmth/cwmt-data). 

The dataset contains a large quantity of information from several years of CWMT conferences (2008, 2009, 2011) as well as a number of other sources, though for the purposes of this exploration I will be limiting my observations to specificially Chinese -> English datasets.

In [9]:
#importing the datasets properly for manipulation
df2008 = pd.read_csv("mt-dataset/cwmt2008_ce_news.tsv", delimiter="\t")
df2009 = pd.read_csv("mt-dataset/cwmt2009_ce_news.tsv", delimiter="\t")

# drops the 4 rows between the two datasets missing a third reference
df2008.dropna()
df2009.dropna()

df2008.head()

Unnamed: 0,datasource,domain,setid,srclang,trglang,src,ref1,ref2,ref3
0,cwmt2008,ce-news,zh_en_news_trans,zh,en,狭小的防震棚已经成为北川擂鼓镇农民张秀华（58岁）临时的家，而就在这个“家”的中央，悬挂了一...,A small narrow anti-earthquake tent became the...,The shockproof shed has become a temporary hom...,The narrow quakeproof shelter has become the t...
1,cwmt2008,ce-news,zh_en_news_trans,zh,en,画像中，中共中央总书记胡锦涛和国务院总理温家宝两人在绵阳机场紧紧握手，画像下有一行题字：“伟...,"In this portrait, Hu Jintao, the General Secre...",The picture showed General Secretary of the Co...,"Hu Jintao, the general secretary of the CPC Ce..."
2,cwmt2008,ce-news,zh_en_news_trans,zh,en,5月16日，四川汶川大地震发生后的第四天，胡锦涛从北京飞抵四川绵竹机场，亲自指挥抗震救灾。,"On May 16th, four days after the Wenchuan eart...","On May 16, the fourth day after the Wenchuan E...","On May 16, the 4th day following Sichuan Wench..."
3,cwmt2008,ce-news,zh_en_news_trans,zh,en,地震后当天就飞到灾区指挥的温家宝到机场迎接，两人一见面，就在飞机前握手致意。,"Wen Jiabao, who flew to the disaster area same...","Wen Jiabao, who has arrived at the disaster ar...","Wen Jiabao, who flew to the quake-hit areas an..."
4,cwmt2008,ce-news,zh_en_news_trans,zh,en,张秀华家挂的胡、温画像是经过电脑处理，原来画面的其他人员已经被掩盖，只有两个人握手的画面。,The portrait of Hu and Wen hung in Zhang Xiuhu...,The portrait of Hu and Wen hung in Zhang Xiuhu...,The figure of President Hu and Premier Wen hun...


The two datasets I will be using are now properly imported, note that each source Chinese code has not one but three "correct" English reference translations.

My goal with this project is to learn about the application of the BLEU and COMET MT evaluation metrics, and to do this I will be evaluating the two main LLMs I've been using for my [Classical Chinese machine translation interface](https://github.com/softly-undefined/classical-chinese-tool-v2) off of two baseline scores:

1. A translation completed by Google Translate, a commonly accepted machine translation tool used widely
2. An approved reference translation, calculating BLEU and COMET comparing ref1 as the MT-generated output to ref2 and ref3 as references (this may be up for change, I'm not the hugest fan of using a different amount of reference translations for this section compared to earlier ones)
3. Potentially Apple Translate, to compare it's efficacy as well, although that is also potentially up for change.

Using this [tutorial](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/) to learn about calculating BLEU scores.


COMET- calculates sentence-by-sentence, when looking corpus-wide it is simply an average of the sentence level scores.

In [10]:
#example code to reference
references = [[['this', 'is', 'a', 'test'], ['this', 'is', 'test']]]
candidates = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, candidates)
print(score)

1.0


# Test Before Generating Translations

Later on in this project I will be generating a large number of translations using the different models I have been testing, but I want to ensure that I properly understand how to use the evaluation metrics before spending the time/money creating the translations.

## BLEU Score Testing
First I will use BLEU to generate scores for the ref1 columns of the 2008 data compared to the ref2 and ref3 data

In [11]:
#extract the relevant data
df_test = df2008

# format references from ref2 and ref3 columns
df_test[['ref2', 'ref3']] = df_test[['ref2', 'ref3']].astype(str)
references_test = df_test[['ref2', 'ref3']].values.tolist()
references_test = [[sentence.split() for sentence in ref_group] for ref_group in references_test]

# format candidates from ref1 column
df_test['ref1'] = df_test['ref1'].astype(str)
candidates_test = df_test['ref1'].values.tolist()
candidates_test = [sentence.split() for sentence in candidates_test]

bleu_score = corpus_bleu(references_test, candidates_test)
print(bleu_score)

0.2520605875656664


I repeated the process using ref2 and ref3 as the candidates resulting in values of 0.2377663462841575 and 0.2208567008054941 respectively. These scores seem to make sense comparing them to numbers from the [original BLEU paper](https://aclanthology.org/P02-1040.pdf)



With the knowledge from the [original BLEU paper](https://aclanthology.org/P02-1040.pdf) it is clear that I should not compare BLEU scores based on different numbers of reference translations, so going forward I won't be using these translations to compare (though I may try testing my MT output on only 2 references to make it viable for comparison).

## COMET Score Testing

Next I will test the COMET Metric using the same idea of using the 2008 data and doing ref1 compared to ref2 and ref3

Rather than having to split up the sentences of reference and candidate word by word (which BLEU requires) COMET requires a specific format working with the sentences themselves.

In [48]:
#formatting the data correctly using ref1 as candidate and ref2 and ref4 as references
formatted_data = []

for _, row in df2008.iterrows():
    entry = {
        "src": row['src'],
        "mt": row['ref1'],
        "ref": [row['ref2'], row['ref3']]
    }
    formatted_data.append(entry)

In [49]:
model_path = download_model("Unbabel/wmt22-comet-da")

model = load_from_checkpoint(model_path)

model_output = model.predict(formatted_data, batch_size=8, gpus=0)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.3.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/371e9839ca4e213dde891b066cf3080f75ec7e72/checkpoints/model.ckpt`
Encoder model frozen.
/Users/ericbennett/miniforge3/lib/python3.10/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/ericbennett/miniforge3/lib/python3.10/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
Predicting DataLoader 0: 100%|████████████████| 126/126 [02:39<00:00,  1.27s/it]


In [50]:
comet_score = sum(model_output.scores) / len(model_output.scores)
print(comet_score)

0.7343478642448517


Using ref2 as the candidate resulted in a comet_score of 0.7299891123122296, and ref3 a comet_score of 0.7139009555753609. It's interesting that both BLEU and COMET have a decreasing score order when using different references, probably just chance that it is decreasing but could show the correlation between BLEU and COMET scores.

# Translation Generation

I won't be generating the sentences themselves in this notebook file, but I have in separate files within this project, storing the resulting data in two .csv files.

In [None]:
translations2008 = pd.read_csv("mtranslations/translations2008.csv")
translations2009 = pd.read_csv("mtranslations/translations2009.csv")

# next here add each of the .csv's to the df2008 and df2009 dataframes respectively


# df2008





# df2009

In [None]:
# the model names will be stored in this array for reference
# when calculating BLEU and COMET scores 
translation_models = ['ref1']

# BLEU Score Calculation

Now that we have the translations we are looking to evaluated in our dataframe, we can calculate the BLEU scores for each of our different translation methods 

## 2008 Dataset

In [52]:
#creating a references array for the 2008 dataset in a format acceptable to corpus_bleu

df2008[['ref1', 'ref2', 'ref3']] = df2008[['ref1', 'ref2', 'ref3']].astype(str)
references2008 = df2008[['ref1', 'ref2', 'ref3']].values.tolist()
references2008 = [[sentence.split() for sentence in ref_group] for ref_group in references2008]


for model in translation_models:
    df2008[model] = df2008[model].astype(str)
    candidates2008 = df2008[model].values.tolist()
    candidates2008 = [sentence.split() for sentence in candidates2008]
    
    bleu_score = corpus_bleu(references2008, candidates2008)
    print(f"{model} BLEU score (2008 Dataset): {bleu_score}")
    
    

ref1 BLEU score 2008 Dataset: 0.9995760206964518


## 2009 Dataset

In [53]:
#creating a references array for the 2009 dataset in a format acceptable to corpus_bleu

df2009[['ref1','ref2','ref3']] = df2009[['ref1','ref2','ref3']].astype(str)
references2009 = df2009[['ref1','ref2','ref3']].values.tolist()
references2009 = [[sentence.split() for sentence in ref_group] for ref_group in references2009]


for model in translation_models:
    df2009[model] = df2009[model].astype(str)
    candidates2009 = df2009[model].values.tolist()
    candidates2009 = [sentence.split() for sentence in candidates2009]
    
    bleu_score = corpus_bleu(references2008, candidates2008)
    print(f"{model} BLEU score (2008 Dataset): {bleu_score}")
    

ref1 BLEU score (2008 Dataset): 0.9995760206964518


# COMET Score Calculation

Finally, we will calculate the COMET evaluation metrics.

## 2008 Dataset

In [58]:
comet_scores2008 = []


for model_name in translation_models:    
    formatted_data2008 = []
    
    for _, row in df2008.iterrows():
        entry = {
            "src": row['src'],
            "mt": row[model_name],
            "ref": [row['ref1'],row['ref2'], row['ref3']]
        }
        formatted_data2008.append(entry)

    model_path = download_model("Unbabel/wmt22-comet-da")
    model = load_from_checkpoint(model_path)
    model_output = model.predict(formatted_data, batch_size=8, gpus=0)
    
    comet_score = sum(model_output.scores) / len(model_output.scores)
    
    comet_scores2008.append((model_name, comet_score))
    

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.3.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/371e9839ca4e213dde891b066cf3080f75ec7e72/checkpoints/model.ckpt`
Encoder model frozen.
/Users/ericbennett/miniforge3/lib/python3.10/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/ericbennett/miniforge3/lib/python3.10/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
Predicting DataLoader 0: 100%|████████████████| 126/126 [02:33<00:00,  1.22s/it]


In [61]:
for model, comet_score in comet_scores2008:
    print(f"{model} COMET score (2008 Dataset): {comet_score}")

ref1 COMET score (2009 Dataset): 0.7343478642448517


## 2009 Dataset

In [59]:
comet_scores2009 = []

for model_name in translation_models:
    formatted_data2009 = []
    
    for _, row in df2009.iterrows():
        entry = {
            "src": row['src'],
            "mt": row[model_name],
            "ref": [row['ref1'],row['ref2'], row['ref3']]
        }
        formatted_data2009.append(entry) 
    
    model_path = download_model("Unbabel/wmt22-comet-da")
    model = load_from_checkpoint(model_path)
    model_output = model.predict(formatted_data, batch_size=8, gpus=0)
    
    
    comet_score = sum(model_output.scores) / len(model_output.scores)
    
    comet_scores2009.append((model_name, comet_score))
    

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.3.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/371e9839ca4e213dde891b066cf3080f75ec7e72/checkpoints/model.ckpt`
Encoder model frozen.
/Users/ericbennett/miniforge3/lib/python3.10/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/ericbennett/miniforge3/lib/python3.10/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
Predicting DataLoader 0: 100%|████████████████| 126/126 [02:27<00:00,  1.17s/it]


In [62]:
for model_name, comet_score in comet_scores2009:
    print(f"{model_name} COMET score (2009 Dataset): {comet_score}")

ref1 COMET score (2009 Dataset): 0.7343478642448517
