# Evaluation on CoVoST2 dataset

<a target="_blank" href="https://colab.research.google.com/github/shreyjasuja/re_s2st/blob/main/covost2_eval.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook reproduces evaluation results of three models on CoVoST2 dataset:

*   [Whisper](https://arxiv.org/pdf/2212.04356.pdf) (Radford et al., 2022)

*   [XLS-R](https://arxiv.org/pdf/2111.09296.pdf) (Babu et al., 2021)

*   [SeamlessM4T](https://arxiv.org/pdf/2308.11596.pdf) (Barrault et al., 2023)


CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla's open-source Common Voice database of crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus.

Although most of these models are multi-task models, we would be focusing here on their multilingual translation capabilities




In [2]:
import pandas as pd
from tqdm import tqdm
import sacrebleu
from datasets import load_dataset
import json
import torch
import collections

  from .autonotebook import tqdm as notebook_tqdm


### Extract the dataset

Remember from the earlier notebook where we downloaded the audio data, and saved the compressed files. Now we will download a script from our repository which will help extract these files.

In [1]:
!wget https://raw.githubusercontent.com/shreyjasuja/re_s2st/main/scripts/extract_and_cleanup.sh -O data/extract_and_cleanup.sh

--2024-04-15 02:53:58--  https://raw.githubusercontent.com/shreyjasuja/re_s2st/main/scripts/extract_and_cleanup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 934 [text/plain]
Saving to: ‘data/extract_and_cleanup.sh’


2024-04-15 02:53:58 (30.4 MB/s) - ‘data/extract_and_cleanup.sh’ saved [934/934]



Let now run that script to extract all our audio files in the required directory structure. This will take approx 8-10 minutes.

In [2]:
%time !(cd data && chmod +x extract_and_cleanup.sh && ./extract_and_cleanup.sh) &> /dev/null

CPU times: user 9.22 s, sys: 1.48 s, total: 10.7 s
Wall time: 7min 35s


We have only loaded audio files till now. We would also require the trancriptions and/or translations as ground truth for our evaluation. This reference textual data is provided by Hugging face 🤗 Datasets library [here](https://huggingface.co/datasets/covost2).

Lets try loading some language, say Catalan and see how the data looks like. Language code for Catalan is `ca`.

In [4]:
data=load_dataset("covost2","ca_en",data_dir="data/ca",split="test",trust_remote_code=True)

Downloading builder script: 100%|██████████| 6.96k/6.96k [00:00<00:00, 2.54MB/s]
Downloading readme: 100%|██████████| 24.4k/24.4k [00:00<00:00, 9.37MB/s]
Downloading data: 100%|██████████| 5.02M/5.02M [00:00<00:00, 13.7MB/s]
Generating train split: 100%|██████████| 95854/95854 [00:10<00:00, 9333.54 examples/s] 
Generating validation split: 100%|██████████| 12730/12730 [00:02<00:00, 6074.45 examples/s]
Generating test split: 100%|██████████| 12730/12730 [00:02<00:00, 5999.39 examples/s]


#### Lets have a look over the data

Each data point will have the audio file `path` to the audio we downloaded before, an audio `array` which is already sampled at sampling rate of 16,000, transcription in source language as `sentence` and translation to english as `translation` field.

In [5]:
data[0]

{'client_id': '03de40b6ecf87f9e1f42719a857b2fbf3b93179bf443e707870f2dda3e53b621248065d52be4dfa6ec462fe118b76b345c19e14063b840813a369c54aab6e1c6',
 'file': '/home/cc/data/ca/clips/common_voice_ca_19034690.mp3',
 'audio': {'path': '/home/cc/data/ca/clips/common_voice_ca_19034690.mp3',
  'array': array([ 2.32830644e-10, -1.74622983e-10, -3.25962901e-09, ...,
          9.91155393e-04, -7.40018208e-04, -5.23986295e-04]),
  'sampling_rate': 16000},
 'sentence': '"Supervisa l\'emissió de les resolucions de concessió de l\'habitació."',
 'translation': 'Supervises issuance of room concession decisions.',
 'id': 'common_voice_ca_19034690'}

To store evaluation results under a directory named `results`

In [3]:
import os
results_directory='results/covost2'
if not os.path.exists(results_directory):
  os.makedirs(os.path.join(results_directory,'scores'))
  os.makedirs(os.path.join(results_directory,'generations'))


## Divide language in different categories

 While evaluating performance in terms of translation capabilities, we need to divide our languages between high, mid and low resource categories depending on what amount of data is available in each language. This distribution has been provided by Babu et al.,2021 in their XLS-R [paper](https://arxiv.org/pdf/2111.09296.pdf).

In [4]:
res_levels=["low_res","mid_res","high_res"]

In [5]:
high_res=['ca','de','fr','es']
mid_res=['zh-CN','fa','it','ru','pt']
low_res=['mn','ta','lv','et','cy','sl','ja','tr','ar','nl','sv-SE','id']

In [99]:
def resource_level_results(scores,model_name):
  res_scores=collections.defaultdict(float)
  for level in res_levels:
    for lang in eval(level):
      res_scores[level]+=scores[lang]
    res_scores['all']+=res_scores[level]
    res_scores[level]/=len(eval(level))
  res_scores['all']/=21.0
  return {
      "Model":model_name,
      "High" : round(res_scores["high_res"],1),
      "Mid" : round(res_scores["mid_res"],1),
      "Low" : round(res_scores["low_res"],1),
      "All" : round(res_scores['all'],1)
  }


In [130]:
final_results=[]

In [128]:
lang_codes= low_res + mid_res +high_res

## Evaluation metrics

We will use BLEU score as our evaluation metric. We will source this implementation from the sacrebleu library which is consistent with methodology cited in the research papers. SeamlessM4T also presented the score using same library implementation for *sacrebleu version 2.3.1*

In [6]:
def evaluate_sacre_bleu(translations,gt_translations):
  #calculate BLEU score
  bleu = sacrebleu.corpus_bleu(translations, [gt_translations])
  return round(bleu.score, 3)

Or else we could have also used NLTK's BLEU score implementation, for which scoring function would have look like this.

In [7]:
import nltk
from nltk.translate.bleu_score import corpus_bleu
from nltk.tokenize import word_tokenize
def evaluate_nltk_bleu(translations,gt_translations):
  references = [[word_tokenize(ref)] for ref in gt_translations]
  candidates = [word_tokenize(cand) for cand in translations]
  bleu_score=corpus_bleu(list_of_references=references,hypotheses=candidates)
  return round(bleu_score * 100, 3)


## Evaluate Whisper model

There are multiple whisper mode with varying size. Out of these `large-v2` being the largest of all, tends to perform best. So, we reproduce the results for Whisper large-v2 model for comparative analysis.

### Load the model

In [33]:
import whisper
model = whisper.load_model("large-v2")

In [34]:
import numpy as np
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

Model is multilingual and has 1,541,384,960 parameters.


Below is the function which runs a source langauge to infer over X-eng translations.

The parameters defined under `options` is consistent with the example [notebook](https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb) shared by Whisper for multilingual translation on its github implementation.

### Model Inference

In [35]:
def whisper_inference(src_lang):
  x_en=load_dataset("covost2",src_lang+"_en",data_dir="data/"+src_lang,split="test",trust_remote_code=True)

  options = dict(language=src_lang, beam_size=5, best_of=5)
  # transcribe_options = dict(task="transcribe",**options))
  translate_options = dict(task="translate",**options)

  translations = []
  gt_translations = []

  # transcriptions = []
  # gt_transcripts=[]


  for item in tqdm(x_en):
      audio = item['file']

      translation = model.transcribe(audio, **translate_options)["text"]
      translations.append(translation)
      gt_translations.append(item['translation'])

      # transcription = model.transcribe(audio, **transcribe_options)["text"]
      # transcriptions.append(transcription)
      # gt_transcripts.append(item['sentence'])
  return translations, gt_translations





In [37]:
whisper_bleu_score = collections.defaultdict(float)
whisper_translations = collections.defaultdict(float)

In [None]:
for src in lang_codes:
  translations, ground_truth = whisper_inference(src)
  whisper_translations[src] = translations
  whisper_bleu_score[src] = evaluate_sacre_bleu(translations=translations,gt_translations=ground_truth)
  with open(os.path.join(results_directory,'scores','whisper_eval.json'), 'w') as f:
    json.dump(whisper_bleu_score, f, indent=4)
  with open(os.path.join(results_directory,'generations','whisper_translations.json'), 'w') as f:
    json.dump(whisper_translations, f, indent=4)



 83%|████████▎ | 1464/1759 [1:41:57<13:19,  2.71s/it]

In [81]:
whisper_bleu_score

{'fr': 35.453,
 'de': 34.886,
 'es': 39.56,
 'ca': 30.76,
 'it': 36.066,
 'ru': 42.257,
 'zh-CN': 15.964,
 'pt': 50.954,
 'fa': 17.683,
 'et': 12.717,
 'mn': 0.136,
 'nl': 40.06,
 'tr': 27.228,
 'ar': 37.944,
 'sv-SE': 41.77,
 'lv': 12.57,
 'sl': 19.459,
 'ta': 3.748,
 'ja': 24.571,
 'id': 46.6,
 'cy': 19.088}

### Resource-level results

In [135]:
print(resource_level_results(whisper_bleu_score, "Whisper large-v2"))

{'Model': 'Whisper large-v2', 'High': 35.2, 'Mid': 32.6, 'Low': 23.8, 'All': 28.1}


In [131]:
final_results.append(resource_level_results(whisper_bleu_score, "Whisper large-v2"))

In [47]:
#clear GPU memory
del model
torch.cuda.empty_cache()

##Evaluate XLS-R (2B) model

We use the huggingface 🤗 transformers implementation of XLS-R (2B) model.

We would be using `wav2vec2-xls-r-2b-21-to-en` model as it is a encoder-decoder model which has been fine-tuned to support languages in CoVoST2 X-eng translations. The details about which can be found [here](https://huggingface.co/facebook/wav2vec2-xls-r-2b-21-to-en)

❗ **Note**: Please beaware that the reference code given for inference at huggingface doesn't work, please use the below implementation

### Load the model

In [None]:
import torch
from transformers import SpeechEncoderDecoderModel,MBart50Tokenizer
from datasets import load_dataset
#loading the MBart50Tokenizer as decoder is MBart50 transformer model
tokenizer = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50")

In [None]:
from transformers import Wav2Vec2FeatureExtractor
feature_extractor = Wav2Vec2FeatureExtractor("facebook/wav2vec2-xls-r-2b-21-to-en")

In [None]:
import warnings

# Suppress UserWarnings
warnings.filterwarnings("ignore", category=UserWarning)

Using the pipleine function to put together the tokenizer, feature extractor and the actual model

In [None]:
from transformers import pipeline
asr=pipeline(model="facebook/wav2vec2-xls-r-2b-21-to-en",tokenizer=tokenizer,feature_extractor=feature_extractor,device=0)

### Model Inference

In [None]:
def xlsr_inference(src_lang):
  x_en=load_dataset("covost2",src_lang+"_en",data_dir="data/"+src_lang,split="test",trust_remote_code=True)

  translations = []
  gt_translations = []

  for item in tqdm(x_en):
      audio = item['file']

      translation = asr(audio)["text"]
      translations.append(translation)
      gt_translations.append(item['translation'])

  return translations, gt_translations

In [None]:
xlsr_bleu_score=collections.defaultdict(float)
xlsr_translations = collections.defaultdict(float)

In [None]:
for src in lang_codes:
    translations, ground_truth=xlsr_inference(src)
    xlsr_bleu_score[src]=evaluate_sacre_bleu(translations=translations,gt_translations=ground_truth)
    xlsr_translations[src]=translations
    with open(os.path.join(results_directory,'scores','xlsr_eval.json'), 'w') as f:
      json.dump(xlsr_bleu_score, f, indent=4)
    with open(os.path.join(results_directory,'generations','xlsr_translations.json'), 'w') as f:
    json.dump(xlsr_translations, f, indent=4)


In [104]:
xlsr_bleu_score

{'mn': 1.877,
 'ta': 0.613,
 'lv': 20.774,
 'et': 11.186,
 'cy': 14.671,
 'sl': 19.117,
 'ja': 4.102,
 'tr': 16.774,
 'ar': 16.991,
 'nl': 31.883,
 'sv-SE': 30.987,
 'id': 16.255,
 'zh-CN': 9.475,
 'fa': 13.073,
 'it': 35.034,
 'ru': 39.44,
 'pt': 42.012,
 'ca': 33.813,
 'de': 33.486,
 'fr': 37.614,
 'es': 39.166}

### Resource-level results

In [136]:
print(resource_level_results(xlsr_bleu_score, "XLS-R (2B)"))

{'Model': 'XLS-R (2B)', 'High': 36.0, 'Mid': 27.8, 'Low': 15.4, 'All': 22.3}


In [137]:
final_results.append(resource_level_results(xlsr_bleu_score, "XLS-R (2B)"))

## Evaluate Seamless models

The claims under our study are evaluated on both Seamless medium and large models. Both models differ only in number of parameters, thus overall inference methods remains the same.

 ❗ **Note** : *In order to evaluate the performance of seamless models on CoVoST2 data, just change the `model_type` according to medium or large models, and run the code under this section.*

In [144]:
# model_type = "medium"
model_type = "large"

We would be using Seamless models added to HuggingFace 🤗 by Facebook, you can find more information about this from the [model card](https://huggingface.co/facebook/seamless-m4t-medium) The code in this section has been adopted from documentation available [here](https://huggingface.co/docs/transformers/v4.38.0/en/model_doc/seamless_m4t#overview)

### Load the model

In [None]:
from transformers import AutoProcessor, SeamlessM4TModel

model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-"+model_type)
model.cuda()
processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-"+model_type)

  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Model Inference

In [None]:
def seamless_inference(src_lang):
  x_en=load_dataset("covost2",src_lang+"_en",data_dir="data/"+src_lang,split="test",trust_remote_code=True)

  translations = []
  gt_translations = []


  for item in tqdm(x_en):
      audio_sample = item['audio']
      audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt",sampling_rate=16000)
      audio_inputs = {k: v.to('cuda') for k, v in audio_inputs.items()}
      output_tokens = model.generate(**audio_inputs, tgt_lang="eng",generate_speech=False)
      translation=processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
      translations.append(translation)
      gt_translations.append(item['translation'])

  return translations, gt_translations

In [64]:
#dictionaries to store BLEU score and translations
seamless_bleu_score=collections.defaultdict(float)
seamless_translations=collections.defaultdict(float)

In [None]:
for src in lang_codes:
    translations, ground_truth=seamless_inference(src)
    seamless_bleu_score[src]=evaluate_sacre_bleu(translations=translations,gt_translations=ground_truth)
    seamless_translations[src]=translations

    with open(os.path.join(results_directory,'scores','seamless_'+model_type+'_eval.json'), 'w') as f:
      json.dump(seamless_bleu_score, f, indent=4)
    with open(os.path.join(results_directory,'generations','seamless_'+model_type+'_translations.json'), 'w') as f:
    json.dump(seamless_translations, f, indent=4)

100%|██████████| 1759/1759 [12:11<00:00,  2.41it/s]
100%|██████████| 786/786 [03:29<00:00,  3.75it/s]
100%|██████████| 1629/1629 [06:33<00:00,  4.14it/s]
100%|██████████| 1571/1571 [13:28<00:00,  1.94it/s]
100%|██████████| 690/690 [03:42<00:00,  3.10it/s]
100%|██████████| 360/360 [01:46<00:00,  3.38it/s]
100%|██████████| 684/684 [03:21<00:00,  3.40it/s]
100%|██████████| 1629/1629 [08:26<00:00,  3.21it/s]
100%|██████████| 1695/1695 [07:31<00:00,  3.76it/s]
100%|██████████| 1699/1699 [08:25<00:00,  3.36it/s]
100%|██████████| 1595/1595 [06:42<00:00,  3.97it/s]
100%|██████████| 844/844 [03:24<00:00,  4.12it/s]
100%|██████████| 4898/4898 [32:36<00:00,  2.50it/s]
100%|██████████| 3445/3445 [17:31<00:00,  3.28it/s]
100%|██████████| 8951/8951 [56:18<00:00,  2.65it/s]
100%|██████████| 6300/6300 [41:09<00:00,  2.55it/s]
100%|██████████| 4023/4023 [20:34<00:00,  3.26it/s]
100%|██████████| 12730/12730 [1:18:42<00:00,  2.70it/s]
 50%|█████     | 6766/13511 [40:45<37:00,  3.04it/s]

### Resource-level results

In [142]:
print(resource_level_results(seamless_bleu_score, "Seamless "+model_type))

{'Model': 'Seamless medium', 'High': 37.3, 'Mid': 33.6, 'Low': 28.3, 'All': 31.3}


In [143]:
final_results.append(resource_level_results(seamless_bleu_score, "Seamless "+model_type))

In [147]:
print(resource_level_results(seamless_bleu_score, "Seamless "+model_type))

{'Model': 'Seamless large', 'High': 39.3, 'Mid': 36.2, 'Low': 31.9, 'All': 34.3}


In [148]:
final_results.append(resource_level_results(seamless_bleu_score, "Seamless "+model_type))

# Aggregate Results

In [152]:
import pandas as pd

df= pd.DataFrame(final_results)

In [153]:
df

Unnamed: 0,Model,High,Mid,Low,All
0,Whisper large-v2,35.2,32.6,23.8,28.1
1,XLS-R (2B),36.0,27.8,15.4,22.3
2,Seamless medium,37.3,33.6,28.3,31.3
3,Seamless large,39.3,36.2,31.9,34.3
