In [31]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, SmoothingFunction

The purpose of this notebook is to explore the datasets I found on Kaggle at [link](https://www.kaggle.com/datasets/warmth/cwmt-data). 

The dataset contains a large quantity of information from several years of CWMT conferences (2008, 2009, 2011) as well as a number of other sources, though for the purposes of this exploration I will be limiting my observations to specificially Chinese -> English datasets.

In [32]:
#importing the datasets properly for manipulation
df2008 = pd.read_csv("mt-dataset/cwmt2008_ce_news.tsv", delimiter="\t")
df2009 = pd.read_csv("mt-dataset/cwmt2009_ce_news.tsv", delimiter="\t")

# drops the 4 rows between the two datasets missing a third reference
df2008.dropna()
df2009.dropna()

df2008.head()

Unnamed: 0,datasource,domain,setid,srclang,trglang,src,ref1,ref2,ref3
0,cwmt2008,ce-news,zh_en_news_trans,zh,en,狭小的防震棚已经成为北川擂鼓镇农民张秀华（58岁）临时的家，而就在这个“家”的中央，悬挂了一...,A small narrow anti-earthquake tent became the...,The shockproof shed has become a temporary hom...,The narrow quakeproof shelter has become the t...
1,cwmt2008,ce-news,zh_en_news_trans,zh,en,画像中，中共中央总书记胡锦涛和国务院总理温家宝两人在绵阳机场紧紧握手，画像下有一行题字：“伟...,"In this portrait, Hu Jintao, the General Secre...",The picture showed General Secretary of the Co...,"Hu Jintao, the general secretary of the CPC Ce..."
2,cwmt2008,ce-news,zh_en_news_trans,zh,en,5月16日，四川汶川大地震发生后的第四天，胡锦涛从北京飞抵四川绵竹机场，亲自指挥抗震救灾。,"On May 16th, four days after the Wenchuan eart...","On May 16, the fourth day after the Wenchuan E...","On May 16, the 4th day following Sichuan Wench..."
3,cwmt2008,ce-news,zh_en_news_trans,zh,en,地震后当天就飞到灾区指挥的温家宝到机场迎接，两人一见面，就在飞机前握手致意。,"Wen Jiabao, who flew to the disaster area same...","Wen Jiabao, who has arrived at the disaster ar...","Wen Jiabao, who flew to the quake-hit areas an..."
4,cwmt2008,ce-news,zh_en_news_trans,zh,en,张秀华家挂的胡、温画像是经过电脑处理，原来画面的其他人员已经被掩盖，只有两个人握手的画面。,The portrait of Hu and Wen hung in Zhang Xiuhu...,The portrait of Hu and Wen hung in Zhang Xiuhu...,The figure of President Hu and Premier Wen hun...


The two datasets I will be using are now properly imported, note that each source Chinese code has not one but three "correct" English reference translations.

My goal with this project is to learn about the application of the BLEU and COMET MT evaluation metrics, and to do this I will be evaluating the two main LLMs I've been using for my [Classical Chinese machine translation interface](https://github.com/softly-undefined/classical-chinese-tool-v2) off of two baseline scores:

1. A translation completed by Google Translate, a commonly accepted machine translation tool used widely
2. An approved reference translation, calculating BLEU and COMET comparing ref1 as the MT-generated output to ref2 and ref3 as references (this may be up for change, I'm not the hugest fan of using a different amount of reference translations for this section compared to earlier ones)
3. Potentially Apple Translate, to compare it's efficacy as well, although that is also potentially up for change.

Using this [tutorial](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/) to learn about calculating BLEU scores.


COMET- calculates sentence-by-sentence, when looking corpus-wide it is simply an average of the sentence level scores.

In [37]:
#example code to reference
references = [[['this', 'is', 'a', 'test'], ['this', 'is', 'test']]]
candidates = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, candidates)
print(score)

1.0


In [46]:
#creating a references array for the 2008 dataset in a format acceptable to corpus_bleu

df2008[['ref1', 'ref2', 'ref3']] = df2008[['ref1', 'ref2', 'ref3']].astype(str)
references2008 = df2008[['ref1', 'ref2', 'ref3']].values.tolist()
references2008 = [[sentence.split() for sentence in ref_group] for ref_group in references2008]

In [45]:
#creating a references array for the 2009 dataset in a format acceptable to corpus_bleu

df2009[['ref1','ref2','ref3']] = df2009[['ref1','ref2','ref3']].astype(str)
references2009 = df2009[['ref1','ref2','ref3']].values.tolist()
references2009 = [[sentence.split() for sentence in ref_group] for ref_group in references2009]