# Evaluation on CoVoST2 dataset

<a target="_blank" href="https://colab.research.google.com/github/shreyjasuja/re_s2st/blob/main/covost2_eval.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook reproduces evaluation results of three models on CoVoST2 dataset:

*   [Whisper](https://arxiv.org/pdf/2212.04356.pdf) (Radford et al.,2022)
*   [SeamlessM4T](https://arxiv.org/pdf/2308.11596.pdf) (Barrault et al.,2023)
*   [XLS-R](https://arxiv.org/pdf/2111.09296.pdf) (Babu et al.,2021)


CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla's open-source Common Voice database of crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus.

Although most of these models are multi-task models, we would be focusing here on their multilingual translation capabilities




## Load data from container to the disk

Provided that we downloaded the data from Common Voices dataset and persisted on Chameleon container, we can now load this data from the container to our disk. \\
Lets first see what containers we do have on our account

In [11]:
import os
from getpass import getpass
import subprocess

command = ['bash', '-c', 'source CHI-231138-openrc.sh && openstack container list']
password = getpass("Please enter your password: ")  # Use getpass.getpass() to input this securely as shown above

proc = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
stdout, stderr = proc.communicate(input=password)

Please enter your password: ··········


In [13]:
print(stdout)

(sj4020@nyu.edu) Please enter your Chameleon CLI password: 
+-----------------------+
| Name                  |
+-----------------------+
| CoVoST2_data          |
| CoVoST2_data_segments |
+-----------------------+



Choose the container where your data persists

In [28]:
container_name="CoVoST2_data"

Lets now download all the data from container. This should take around 5 minutes to execute.

In [29]:

command = ['bash', '-c', 'source CHI-231138-openrc.sh && openstack container save '+container_name]
password = getpass("Please enter your password: ")  # Use getpass.getpass() to input this securely as shown above

proc = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
stdout, stderr = proc.communicate(input=password)

Please enter your password: ··········


Remember from the notebook where we downloaded the audio data, we saved compressed files and a script to extract these later. We would now use that script to extract all our audio files in the required directory structure. This will take approx 8-10 minutes.

In [35]:
%time !(cd data && chmod +x extract_and_cleanup.sh && ./extract_and_cleanup.sh) &> /dev/null

CPU times: user 9.56 s, sys: 1.55 s, total: 11.1 s
Wall time: 7min 38s


We have only loaded audio files till now. We would also require the trancriptions and/or translations as ground truth for our evaluation. This reference textual data is provided by Hugging face 🤗 Datasets library [here](https://huggingface.co/datasets/covost2).

Lets try loading some language, say Catalan and see how the data looks like. Language code for Catalan is `ca`.

In [42]:
import pandas as pd
from tqdm import tqdm
import sacrebleu
from datasets import load_dataset

In [24]:
data=load_dataset("covost2","ca_en",data_dir="data/ca",split="test",trust_remote_code=True)

Each data point will have the audio file `path` to the audio we downloaded before, an audio `array` which is already sampled at sampling rate of 16,000, transcription in source language as `sentence` and translation to english as `translation` field.

In [36]:
data[0]

{'client_id': '03de40b6ecf87f9e1f42719a857b2fbf3b93179bf443e707870f2dda3e53b621248065d52be4dfa6ec462fe118b76b345c19e14063b840813a369c54aab6e1c6',
 'file': '/home/cc/data/ca/clips/common_voice_ca_19034690.mp3',
 'audio': {'path': '/home/cc/data/ca/clips/common_voice_ca_19034690.mp3',
  'array': array([ 2.32830644e-10, -1.74622983e-10, -3.25962901e-09, ...,
          9.91155393e-04, -7.40018208e-04, -5.23986295e-04]),
  'sampling_rate': 16000},
 'sentence': '"Supervisa l\'emissió de les resolucions de concessió de l\'habitació."',
 'translation': 'Supervises issuance of room concession decisions.',
 'id': 'common_voice_ca_19034690'}

In [38]:
speeches=[]
transcripts=[]

for i in data:
  speeches.append(i['audio']['path'])
  transcripts.append(i['sentence'])

#### Browse data in VizSeq (see also the [VizSeq documentation](https://facebookresearch.github.io/vizseq/))


In [74]:
!pip install vizseq


Installing collected packages: vizseq
Successfully installed vizseq-0.1.15


In [None]:
!pip install sentencepiece

In [None]:
!pip install sacrebleu==2.3.1

## Divide language in different categories

 While evaluating performance in terms of translation capabilities, we need to divide our languages between high, mid and low resource categories depending on what amount of data is available in each language. This distribution has been provided by Babu et al.,2021 in their XLS-R [paper](https://arxiv.org/pdf/2111.09296.pdf).

In [30]:
res_levels=["low_res","mid_res","high_res"]

In [31]:
high_res=['ca','de','fr','es']
mid_res=['zh-CN','fa','it','ru','pt']
low_res=['mn','ta','lv','et','cy','sl','ja','tr','ar','nl','sv-SE','id']

## Evaluation metrics

We will use BLEU score as our evaluation metric. We will source this implementation from the sacrebleu library which is consistent with methodology cited in the research papers. SeamlessM4T also presented the score using same library implementation for *sacrebleu version 2.3.1*

In [None]:
def evaluate_sacre_bleu(translations,gt_translations):
  #calculate BLEU score
  bleu = sacrebleu.corpus_bleu(translations, [gt_translations])
  return round(bleu.score, 3)

Or else we could have also used NLTK's BLEU score implementation, for which scoring function would have look like this.

In [None]:
import nltk
from nltk.translate.bleu_score import corpus_bleu
from nltk.tokenize import word_tokenize
def evaluate_nltk_bleu(translations,gt_translations):
  references = [[word_tokenize(ref)] for ref in gt_translations]
  candidates = [word_tokenize(cand) for cand in translations]
  bleu_score=corpus_bleu(list_of_references=references,hypotheses=candidates)
  return round(bleu_score * 100, 3)






## Load Whisper and do inference on CoVoST2 dataset

There are multiple whisper mode with varying size. Out of these `large-v2` being the largest of all, tends to perform best. So, we reproduce the results for Whisper large-v2 model only for comparative analysis.

In [80]:
import whisper
model = whisper.load_model("large-v2")

100%|█████████████████████████████████████| 2.87G/2.87G [01:20<00:00, 38.2MiB/s]


In [None]:
model.cuda()

In [None]:
import numpy as np
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

Model is multilingual and has 1,541,384,960 parameters.


Below is the function which runs a source langauge to infer over X-eng translations.

The parameters defined under `options` is consistent with the example [notebook](https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb) shared by Whisper for multilingual translation on its github implementation.

In [85]:
def whisper_inference(src_lang):
  x_en=load_dataset("covost2",src_lang+"_en",data_dir="data/"+src_lang,split="test",trust_remote_code=True)

  options = dict(language=src_lang.split('-')[0], beam_size=5, best_of=5)
  # transcribe_options = dict(task="transcribe",**options))
  translate_options = dict(task="translate",**options)

  translations = []
  gt_translations = []

  # transcriptions = []
  # gt_transcripts=[]


  for item in tqdm(x_en):
      audio = item['file']

      translation = model.transcribe(audio, **translate_options)["text"]
      translations.append(translation)
      gt_translations.append(item['translation'])

      # transcription = model.transcribe(audio, **transcribe_options)["text"]
      # transcriptions.append(transcription)
      # gt_transcripts.append(item['sentence'])
  return translations, gt_translations





In [99]:
import collections
whisper_bleu_score=collections.defaultdict(dict)

In [None]:
file_name = 'whisper_largev2_covost2_score.json'
for i in res_levels[:1]:
  for src in eval(i)[:1]:
    translations, ground_truth=whisper_inference(src)
    whisper_bleu_score[i][src]=evaluate_sacre_bleu(translations=translations,gt_translations=ground_truth)
  with open(file_name, 'w') as f:
    json.dump(whisper_bleu_score, f, indent=4)

In [97]:
#clear GPU memory
import torch
del model
torch.cuda.empty_cache()

## Load XLS-R (2B) model and do infernece

We use the huggingface 🤗 transformers implementation of XLS-R (2B) model.

We would be using `wav2vec2-xls-r-2b-21-to-en` model as it is a encoder-decoder model which has been fine-tuned to support languages in CoVoST2 X-eng translations. The details about which can be found [here](https://huggingface.co/facebook/wav2vec2-xls-r-2b-21-to-en)

**Note**: Please beaware that the reference code given for inference at huggingface doesn't work, please use the below implementation

In [93]:
import torch
from transformers import SpeechEncoderDecoderModel,MBart50Tokenizer
from datasets import load_dataset
#loading the MBart50Tokenizer as decoder is MBart50 transformer model
tokenizer = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50")

In [94]:
from transformers import Wav2Vec2FeatureExtractor
feature_extractor = Wav2Vec2FeatureExtractor("facebook/wav2vec2-xls-r-2b-21-to-en")

In [95]:
import warnings

# Suppress UserWarnings
warnings.filterwarnings("ignore", category=UserWarning)

Using the pipleine function to put together the tokenizer, feature extractor and the actual model

In [None]:
from transformers import pipeline
asr=pipeline(model="facebook/wav2vec2-xls-r-2b-21-to-en",tokenizer=tokenizer,feature_extractor=feature_extractor,device=0)

In [100]:
def xlsr_inference(src_lang):
  x_en=load_dataset("covost2",src_lang+"_en",data_dir="data/"+src_lang,split="test",trust_remote_code=True)

  translations = []
  gt_translations = []

  # transcriptions = []
  # gt_transcripts=[]


  for item in tqdm(x_en):
      audio = item['file']

      translation = asr(audio)["text"]
      translations.append(translation)
      gt_translations.append(item['translation'])

      # transcription = model.transcribe(audio, **transcribe_options)["text"]
      # transcriptions.append(transcription)
      # gt_transcripts.append(item['sentence'])
  return translations, gt_translations

In [102]:
xlsr_bleu_score=collections.defaultdict(dict)

In [None]:
import json
from tqdm import tqdm
file_name = 'xls_r_covost2_score.json'
for i in res_levels:
  for src in eval(i):
    translations, ground_truth=xlsr_inference(src)
    xlsr_bleu_score[i][src]=evaluate_sacre_bleu(translations=translations,gt_translations=ground_truth)
    with open(file_name, 'w') as f:
      json.dump(xlsr_bleu_score, f, indent=4)

In [None]:
xlsr_bleu_score

defaultdict(dict,
            {'low_res': {'mn': 1.877,
              'ta': 0.613,
              'lv': 20.774,
              'et': 11.186,
              'cy': 14.671,
              'sl': 19.117,
              'ja': 4.102,
              'tr': 16.774,
              'ar': 16.991,
              'nl': 31.883,
              'sv-SE': 30.987,
              'id': 16.255},
             'mid_res': {'zh-CN': 9.475,
              'fa': 13.073,
              'it': 35.034,
              'ru': 39.44,
              'pt': 42.012},
             'high_res': {'ca': 33.813,
              'de': 33.486,
              'fr': 37.614,
              'es': 39.166}})

# Seamless medium on CoVoST 2 data

In [None]:
from transformers import AutoProcessor, SeamlessM4TModel

processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")

  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
model.cuda()

In [None]:
def seamless_inference(src_lang):
  x_en=load_dataset("covost2",src_lang+"_en",data_dir="data/"+src_lang,split="test",trust_remote_code=True)

  translations = []
  gt_translations = []

  # transcriptions = []
  # gt_transcripts=[]


  for item in tqdm(x_en):
      audio_sample = item['audio']
      audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt",sampling_rate=16000)
      audio_inputs = {k: v.to('cuda') for k, v in audio_inputs.items()}

      output_tokens = model.generate(**audio_inputs, tgt_lang="eng",generate_speech=False)
      translation=processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
      translations.append(translation)
      gt_translations.append(item['translation'])

      # transcription = model.transcribe(audio, **transcribe_options)["text"]
      # transcriptions.append(transcription)
      # gt_transcripts.append(item['sentence'])
  return translations, gt_translations

In [None]:
import sacrebleu
import collections
seamless_bleu_score=collections.defaultdict(dict)

In [None]:
import json
from tqdm import tqdm
file_name = 'seamless_large_sacre_bleu.json'
for i in res_levels:
  for src in eval(i):
    translations, ground_truth=seamless_inference(src)
    seamless_bleu_score[i][src]=evaluate_sacre_bleu(translations=translations,gt_translations=ground_truth)
    with open(file_name, 'w') as f:
      json.dump(seamless_bleu_score, f, indent=4)

100%|██████████| 1759/1759 [12:11<00:00,  2.41it/s]
100%|██████████| 786/786 [03:29<00:00,  3.75it/s]
100%|██████████| 1629/1629 [06:33<00:00,  4.14it/s]
100%|██████████| 1571/1571 [13:28<00:00,  1.94it/s]
100%|██████████| 690/690 [03:42<00:00,  3.10it/s]
100%|██████████| 360/360 [01:46<00:00,  3.38it/s]
100%|██████████| 684/684 [03:21<00:00,  3.40it/s]
100%|██████████| 1629/1629 [08:26<00:00,  3.21it/s]
100%|██████████| 1695/1695 [07:31<00:00,  3.76it/s]
100%|██████████| 1699/1699 [08:25<00:00,  3.36it/s]
100%|██████████| 1595/1595 [06:42<00:00,  3.97it/s]
100%|██████████| 844/844 [03:24<00:00,  4.12it/s]
100%|██████████| 4898/4898 [32:36<00:00,  2.50it/s]
100%|██████████| 3445/3445 [17:31<00:00,  3.28it/s]
100%|██████████| 8951/8951 [56:18<00:00,  2.65it/s]
100%|██████████| 6300/6300 [41:09<00:00,  2.55it/s]
100%|██████████| 4023/4023 [20:34<00:00,  3.26it/s]
100%|██████████| 12730/12730 [1:18:42<00:00,  2.70it/s]
 50%|█████     | 6766/13511 [40:45<37:00,  3.04it/s]

In [None]:
seamless_bleu_score

defaultdict(dict,
            {'low_res': {'mn': 7.378,
              'ta': 3.931,
              'lv': 26.902,
              'et': 26.298,
              'cy': 55.276,
              'sl': 38.32,
              'ja': 19.69,
              'tr': 30.647,
              'ar': 45.732,
              'nl': 40.112,
              'sv-SE': 37.972,
              'id': 50.455},
             'mid_res': {'zh-CN': 19.911,
              'fa': 25.213,
              'it': 38.805,
              'ru': 47.881,
              'pt': 49.055},
             'high_res': {'ca': 37.969,
              'de': 38.009,
              'fr': 40.724,
              'es': 40.639}})

In [None]:
s=0
for i in res_levels:
  for k in eval(i):
    s+=seamless_bleu_score[i][k]

In [None]:
s/21

31.29719047619048

In [None]:
s/21

34.32947619047619

## FLEURS DATASET

In [None]:
from transformers import AutoProcessor, SeamlessM4TModel

processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")

model.cuda()

In [None]:
import re
from pycountry import languages
def bcp47_to_iso639_3(bcp47_code):
    parts = bcp47_code.split('_')
    lang_code = parts[0]
    try:

      lang = languages.get(alpha_2=lang_code).alpha_3
      return lang.lower()
    except (AttributeError, KeyError) as e:
        # If the mapping fails, return the original code
        print(e, lang_code)
        return lang_code.lower()

In [None]:
from datasets import get_dataset_config_names
bcp_47_codes=get_dataset_config_names("google/fleurs",trust_remote_code=True)

In [None]:
lang_dict={}
for i in bcp_47_codes:
  lang_dict[bcp47_to_iso639_3(i)]=i

'NoneType' object has no attribute 'alpha_3' ast
'NoneType' object has no attribute 'alpha_3' ceb
'NoneType' object has no attribute 'alpha_3' ckb
'NoneType' object has no attribute 'alpha_3' cmn
'NoneType' object has no attribute 'alpha_3' fil
'NoneType' object has no attribute 'alpha_3' kam
'NoneType' object has no attribute 'alpha_3' kea
'NoneType' object has no attribute 'alpha_3' luo
'NoneType' object has no attribute 'alpha_3' nso
'NoneType' object has no attribute 'alpha_3' umb
'NoneType' object has no attribute 'alpha_3' yue
'NoneType' object has no attribute 'alpha_3' all


In [None]:
!wget https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip

--2024-03-28 06:01:33--  https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.162.163.19, 3.162.163.11, 3.162.163.51, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.162.163.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6096377 (5.8M) [application/zip]
Saving to: ‘evaluation_data_ids.zip’


2024-03-28 06:01:33 (79.2 MB/s) - ‘evaluation_data_ids.zip’ saved [6096377/6096377]



In [None]:
!sudo apt-get install unzip
!unzip evaluation_data_ids.zip && rm evaluation_data_ids.zip

In [None]:
old_codes=['msa','fil','uzb','fas','nep','lav','ara','aze','pus','ori','mon','swa','orm']
new_codes=['zlm','tgl','uzn','pes','npi','lvs','arb','azj','pbt','ory','khk','swh','gaz']
for i in range(len(new_codes)):
  lang_dict[new_codes[i]]=lang_dict[old_codes[i]]
  del lang_dict[old_codes[i]]

In [None]:
import os
base_path="evaluation_data_ids/s2tt_fleurs_ids/"
x_eng_files = [file for file in os.listdir("evaluation_data_ids/s2tt_fleurs_ids/") if file.endswith('-eng.ids')]
print(len(x_eng_files))

101


In [None]:
import collections
generated_translations=collections.defaultdict(dict)

In [None]:
from datasets import load_dataset
src_lang_data=load_dataset("google/fleurs",name=lang_dict['ukr'],split="test",streaming=True)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [None]:
print(next(iter(src_lang_data)))

{'id': 1982, 'num_samples': 118080, 'path': None, 'audio': {'path': 'test/10021730821550109934.wav', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -1.80006027e-05, -2.40206718e-05, -9.59634781e-05]), 'sampling_rate': 16000}, 'transcription': 'жінки усім подорожнім жінкам радять казати що вони заміжні незалежно від справжнього сімейного стану', 'raw_transcription': 'Жінки: усім подорожнім жінкам радять казати, що вони заміжні, незалежно від справжнього сімейного стану.', 'gender': 0, 'lang_id': 92, 'language': 'Ukrainian', 'lang_group_id': 1}


In [None]:
from tqdm import tqdm

In [None]:
lang_issue=[]
lang_missing_ids=[]

In [None]:
import torch
import gc

for file_name in  x_eng_files:
  lang_code = file_name.split("-")[0].split("_")[-1]
  with open(base_path+file_name) as f:
    ids=f.read().split()

  if (lang_code in generated_translations.keys()) and (len(ids)==len(generated_translations[lang_code])) :
    print("Done")
    continue

  try:
    src_lang_data=load_dataset("google/fleurs",name=lang_dict[lang_code],split="test",streaming=True,trust_remote_code=True)
  except:
    lang_issue.append(lang_code)
    print("\n Missing language ",lang_code)
    continue



  for item in tqdm(src_lang_data,total=len(ids)):
    audio_sample = item['audio']
    id=audio_sample['path'].split("/")[-1].split(".")[0]

    if str(item['id'])+'_'+str(id) in generated_translations[lang_code]:
      continue

    if id not in ids:
      continue


    try:
        # Initially, try to process the audio on the GPU
        audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt", sampling_rate=16000)
        audio_inputs = {k: v.to('cuda') for k, v in audio_inputs.items()}
        with torch.no_grad():
            output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
        translation = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("\nCUDA out of memory. Shifting inference to CPU for ID:", id)
            torch.cuda.empty_cache()  # Clear any unreleased memory

            # Move the model to CPU for this inference
            model.to('cpu')

            try:
                # Make sure audio_inputs are on the CPU as well
                audio_inputs = {k: v.to('cpu') for k, v in audio_inputs.items()}
                with torch.no_grad():
                    output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
                translation = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
            except Exception as cpu_e:
                print("\nFailed processing on CPU for ID:", id, "with error:", cpu_e)
                lang_missing_ids.append((lang_code, id))
            finally:
                # Regardless of the outcome, put the model back on the GPU for subsequent operations
                model.to('cuda')
        else:
            print("\nAn error occurred for ID:", id, "Error:", e)
            lang_missing_ids.append((lang_code, id))
    except Exception as e:
        print("\nAn unexpected error occurred for ID:", id, "Error:", e)
        lang_missing_ids.append((lang_code, id))


    del audio_inputs
    del output_tokens

    generated_translations[lang_code][str(item['id'])+'_'+str(id)]=translation
  torch.cuda.empty_cache()

  with open ('generated_large.json','w')as f:
    json.dump(generated_translations,f,indent=2)






100%|██████████| 750/750 [09:11<00:00,  1.36it/s]
100%|██████████| 908/908 [11:45<00:00,  1.29it/s]
100%|██████████| 926/926 [12:11<00:00,  1.27it/s]
100%|██████████| 728/728 [08:41<00:00,  1.40it/s]
100%|██████████| 749/749 [08:40<00:00,  1.44it/s]
100%|██████████| 1015/1015 [12:48<00:00,  1.32it/s]
264it [03:08,  1.40it/s]
100%|██████████| 1041/1041 [15:38<00:00,  1.11it/s]
100%|██████████| 918/918 [11:43<00:00,  1.31it/s]
100%|██████████| 357/357 [04:26<00:00,  1.34it/s]
100%|██████████| 1021/1021 [13:00<00:00,  1.31it/s]
100%|██████████| 880/880 [11:24<00:00,  1.29it/s]
100%|██████████| 946/946 [11:01<00:00,  1.43it/s]
100%|██████████| 977/977 [11:28<00:00,  1.42it/s]
100%|██████████| 687/687 [08:17<00:00,  1.38it/s]
100%|██████████| 857/857 [10:27<00:00,  1.37it/s]
100%|██████████| 660/660 [11:18<00:00,  1.03s/it]
100%|██████████| 621/621 [09:49<00:00,  1.05it/s]
100%|██████████| 792/792 [09:20<00:00,  1.41it/s]
100%|██████████| 925/925 [15:11<00:00,  1.01it/s]
100%|██████████| 10


CUDA out of memory. Shifting inference to CPU for ID: 3189556219205510204


 82%|████████▏ | 312/379 [05:56<00:58,  1.14it/s]


CUDA out of memory. Shifting inference to CPU for ID: 6634898757415929965


100%|██████████| 379/379 [08:08<00:00,  1.29s/it]
100%|██████████| 905/905 [11:21<00:00,  1.33it/s]
100%|██████████| 964/964 [13:15<00:00,  1.21it/s]
100%|██████████| 919/919 [11:39<00:00,  1.31it/s]
100%|██████████| 920/920 [11:09<00:00,  1.38it/s]
100%|██████████| 831/831 [10:39<00:00,  1.30it/s]
100%|██████████| 771/771 [10:06<00:00,  1.27it/s]
100%|██████████| 364/364 [04:10<00:00,  1.45it/s]
100%|██████████| 979/979 [12:03<00:00,  1.35it/s]
100%|██████████| 854/854 [12:44<00:00,  1.12it/s]
100%|██████████| 980/980 [12:53<00:00,  1.27it/s]
100%|██████████| 1019/1019 [13:19<00:00,  1.27it/s]
100%|██████████| 371/371 [05:31<00:00,  1.12it/s]
100%|██████████| 723/723 [09:26<00:00,  1.28it/s]
100%|██████████| 1021/1021 [13:15<00:00,  1.28it/s]
100%|██████████| 862/862 [10:54<00:00,  1.32it/s]
100%|██████████| 299/299 [03:40<00:00,  1.36it/s]
100%|██████████| 871/871 [11:20<00:00,  1.28it/s]
100%|██████████| 743/743 [09:15<00:00,  1.34it/s]
100%|██████████| 1000/1000 [11:28<00:00,  1.45


CUDA out of memory. Shifting inference to CPU for ID: 12560373056138365189


100%|██████████| 405/405 [07:16<00:00,  1.08s/it]
100%|██████████| 984/984 [11:58<00:00,  1.37it/s]
100%|██████████| 761/761 [10:01<00:00,  1.26it/s]
100%|██████████| 759/759 [09:22<00:00,  1.35it/s]
100%|██████████| 512/512 [06:43<00:00,  1.27it/s]
100%|██████████| 998/998 [14:30<00:00,  1.15it/s]
100%|██████████| 478/478 [08:00<00:00,  1.00s/it]
100%|██████████| 865/865 [11:54<00:00,  1.21it/s]
100%|██████████| 973/973 [12:13<00:00,  1.33it/s]
100%|██████████| 382/382 [04:34<00:00,  1.39it/s]
100%|██████████| 723/723 [08:57<00:00,  1.35it/s]
100%|██████████| 932/932 [11:28<00:00,  1.35it/s]
100%|██████████| 927/927 [11:11<00:00,  1.38it/s]
100%|██████████| 676/676 [08:19<00:00,  1.35it/s]
100%|██████████| 883/883 [10:34<00:00,  1.39it/s]
100%|██████████| 827/827 [13:04<00:00,  1.05it/s]
100%|██████████| 949/949 [11:16<00:00,  1.40it/s]
100%|██████████| 46/46 [00:39<00:00,  1.16it/s]
100%|██████████| 541/541 [07:16<00:00,  1.24it/s]
100%|██████████| 914/914 [11:13<00:00,  1.36it/s]
10

In [None]:
len(generated_translations)

101

In [None]:
import json
from collections import defaultdict

with open('generated_large.json') as f:
    generated_translations = json.load(f)

# Convert to defaultdict with empty dictionaries as default values
generated_translations = defaultdict(dict, generated_translations)

In [None]:
eng_data=load_dataset('google/fleurs',name='en_us')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [None]:
eng_translation={}

In [None]:
for split in eng_data:
  for item in tqdm(eng_data[split]):
      audio_sample = item['audio']
      # id=audio_sample['path'].split("/")[-1].split(".")[0]
      eng_translation[item['id']]=item['raw_transcription']

100%|██████████| 2602/2602 [00:03<00:00, 765.20it/s]
100%|██████████| 394/394 [00:00<00:00, 819.57it/s]
100%|██████████| 647/647 [00:00<00:00, 800.07it/s]


In [None]:
seamless_fleurs_bleu={}

In [None]:
lang_missing_ids

[]

In [None]:
import sacrebleu
for lang_code in list(generated_translations.keys()):

  translations=[]
  gt_translations=[]

  for i in generated_translations[lang_code]:
      key=int(i.split('_')[0])
      gt_translations.append(eng_translation[key])
      translations.append(generated_translations[lang_code][i])

  bleu = sacrebleu.corpus_bleu(translations, [gt_translations])
  seamless_fleurs_bleu[lang_code]=round(bleu.score, 3)

In [None]:
dict(sorted(seamless_fleurs_bleu.items()))

{'afr': 39.69,
 'amh': 17.034,
 'arb': 31.725,
 'asm': 17.47,
 'ast': 25.894,
 'azj': 16.425,
 'bel': 16.056,
 'ben': 22.778,
 'bos': 32.891,
 'bul': 31.33,
 'cat': 37.574,
 'ceb': 7.723,
 'ces': 31.016,
 'ckb': 20.487,
 'cmn': 18.98,
 'cym': 30.22,
 'dan': 33.553,
 'deu': 35.469,
 'ell': 24.804,
 'est': 28.534,
 'fin': 25.782,
 'fra': 32.641,
 'ful': 0.788,
 'gaz': 0.317,
 'gle': 10.654,
 'glg': 32.033,
 'guj': 27.164,
 'hau': 0.544,
 'heb': 28.226,
 'hin': 25.194,
 'hrv': 29.8,
 'hun': 24.166,
 'hye': 27.81,
 'ibo': 1.27,
 'ind': 28.81,
 'isl': 22.854,
 'ita': 25.307,
 'jav': 19.459,
 'jpn': 15.886,
 'kam': 1.803,
 'kan': 21.799,
 'kat': 18.741,
 'kaz': 21.338,
 'kea': 27.313,
 'khk': 16.258,
 'khm': 18.62,
 'kir': 16.771,
 'kor': 18.402,
 'lao': 19.088,
 'lin': 0.917,
 'lit': 20.675,
 'ltz': 14.429,
 'lug': 16.179,
 'luo': 0.789,
 'lvs': 27.666,
 'mal': 20.99,
 'mar': 21.372,
 'mkd': 33.972,
 'mlt': 38.23,
 'mri': 0.99,
 'mya': 14.676,
 'nld': 26.502,
 'nob': 33.007,
 'npi': 23.518,

In [None]:
with open('seamless_large_fleurs_bleu.json','w')as f:
  json.dump(seamless_fleurs_bleu,f)

In [None]:
import datasets
print(datasets.__version__)

2.17.1
