## Evaluation on FLEURS dataset

<a href="https://colab.research.google.com/github/shreyjasuja/re_s2st/blob/main/fleurs_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
**Before running this notebook make sure you ran notebook 1.`initiate_server.ipynb`. So, that you have a GPU server to get inference over FLEURS dataset**

SyntaxError: invalid syntax (1979348668.py, line 1)

In this notebook, we evaluate the performance of various multilingual multitask models on the FLEURS dataset. Fleurs is the speech version of the FLoRes machine translation benchmark. They use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. The dataset is available at [Huggingface datasets](https://huggingface.co/datasets/google/fleurs).

We evaluate the performance of the following models on the FLEURS dataset:

*   [Whisper](https://arxiv.org/pdf/2212.04356.pdf) (Radford et al., 2022)

*   [SeamlessM4T](https://arxiv.org/pdf/2308.11596.pdf) (Barrault et al., 2023)

*   Cascaded pipeline of Whisper(ASR) and [NLLB-1.3B (MT)](https://arxiv.org/ftp/arxiv/papers/2207/2207.04672.pdf) (MR Costa-jussà et al., 2022)


The important thing to note here is that not all languages in the FLEURS dataset are supported by the models. Whisper supports only 82 languages (including English), while Seamless and NLLB supports all the languages in the dataset. So, in order to evaluate the performance of Whisper on the FLEURS dataset, we will only consider the 81 languages supported by Whisper, but during inference we considered all the supported languages for a given model.

You will also find the reference of AudioPaLM [Rubenstein et al., 2023](https://arxiv.org/pdf/2306.12925.pdf) model in the notebook which is yet another model that supports all the languages in the FLEURS dataset. However, we are not evaluating the performance of AudioPaLM in this notebook, because the model is a proprietary model and the weights are not available for download.

In [3]:
from datasets import load_dataset
from tqdm import tqdm
import sacrebleu
import json

## Downloading the dataset

#### Language code mapping

We use the FLEURS dataset from Huggingface 🤗 datasets. Interestingly, we need to use BCP-47 codes in order to access various languages in the dataset, and Seamless paper uses ISO 639-3 language code as their standard, so we had to work on a mapping between the two. We used `pycountry` library to resolve the mapping

In [4]:
from datasets import get_dataset_config_names
bcp_47_codes=get_dataset_config_names("google/fleurs",trust_remote_code=True)

Downloading builder script:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

In [24]:
from pycountry import languages
def bcp47_to_iso639_3(bcp47_code):
    parts = bcp47_code.split('_')
    lang_code = parts[0]
    try:

      lang = languages.get(alpha_2=lang_code).alpha_3
      return lang.lower()
    except (AttributeError, KeyError) as e:
        # If the mapping fails, return the original code
        print("Failed to map", lang_code, " from bcp47 to iso639-3")
        return lang_code.lower()

In [25]:
lang_dict={}
for i in bcp_47_codes:
  lang_dict[bcp47_to_iso639_3(i)]=i

Failed to map ast  from bcp47 to iso639-3
Failed to map ceb  from bcp47 to iso639-3
Failed to map ckb  from bcp47 to iso639-3
Failed to map cmn  from bcp47 to iso639-3
Failed to map fil  from bcp47 to iso639-3
Failed to map kam  from bcp47 to iso639-3
Failed to map kea  from bcp47 to iso639-3
Failed to map luo  from bcp47 to iso639-3
Failed to map nso  from bcp47 to iso639-3
Failed to map umb  from bcp47 to iso639-3
Failed to map yue  from bcp47 to iso639-3
Failed to map all  from bcp47 to iso639-3


Seamless team provided us with the exact ids of the input speech utterances that they used for evaluation. We use these ids to evaluate the performance of theri own model, Whisper and NLLB on the FLEURS dataset. This link was found on their 🤗 [model card](https://huggingface.co/facebook/seamless-m4t-large) under the metrics section

In [19]:
!wget https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip -O evaluation_data_ids.zip

--2024-05-07 03:54:38--  https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.161.170.54, 18.161.170.51, 18.161.170.13, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.161.170.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6096377 (5.8M) [application/zip]
Saving to: ‘evaluation_data_ids.zip’


2024-05-07 03:54:39 (24.6 MB/s) - ‘evaluation_data_ids.zip’ saved [6096377/6096377]



In [22]:
!(unzip evaluation_data_ids.zip && rm evaluation_data_ids.zip )> /dev/null #unzip the evaluation data

**⛔️ Caution:** We noticed that the ISO 639-3 code which we got from pycountry library is not always the same as the one used in the Seamless paper. So, we had to manually map some of the languages to the correct ISO 639-3 code. We have provided the mapping in the code below.

There was one more bizzare thing issue we noticed in seamless paper that although they cited the ISO 639-3 code for the languages supported by their model in Table 5 in the paper, they still used a different code in the evaluation ids. We had to manually map these codes as well. example here, would be Norwegian which is `nob` and `nno` in the paper but `nor` in the evaluation ids.

In [26]:
old_codes=['msa','fil','uzb','fas','nep','lav','ara','aze','pus','ori','mon','swa','orm']
new_codes=['zlm','tgl','uzn','pes','npi','lvs','arb','azj','pbt','ory','khk','swh','gaz']
for i in range(len(new_codes)):
  lang_dict[new_codes[i]]=lang_dict[old_codes[i]]
  del lang_dict[old_codes[i]]

Restricting only to X→En directions

In [27]:
import os
base_path="evaluation_data_ids/s2tt_fleurs_ids/"
x_eng_files = [file for file in os.listdir("evaluation_data_ids/s2tt_fleurs_ids/") if file.endswith('-eng.ids')]
print(len(x_eng_files))

101


#### A sneak peek into the dataset

**❗️ Note:** We are using the FLEURS dataset in streaming model, as we are just evaluating the performance of the models on the dataset on the test split. As the dataset is huge, we also tried downloading just the test split for all languages at once using the `split` parameter and also with `data_files` parameter, as mentioned in the [documentation](https://huggingface.co/docs/datasets/loading), but it didn't work as it was still downloading the whole dataset.

So, we ultimately resorted to the streaming mode.

In [35]:
src_lang_data=load_dataset("google/fleurs",name=lang_dict['ukr'],split="test",streaming=True)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Below is the sample from the dataset, the id which we got from evaluation benchmark is the corresponding filename in the `path` under `audio` field. Please don't confuse it with the actual id in each record.

```python
{
    'id': 1982,
    'num_samples': 118080,
    'path': None,
    'audio': {
        'path': 'test/10021730821550109934.wav',
        'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -1.80006027e-05, -2.40206718e-05, -9.59634781e-05]),
       'sampling_rate': 16000},
    'transcription': 'жінки усім подорожнім жінкам радять казати що вони заміжні незалежно від справжнього сімейного стану', 'raw_transcription': 'Жінки: усім подорожнім жінкам радять казати, що вони заміжні, незалежно від справжнього сімейного стану.',
    'gender': 0,
    'lang_id': 92,
    'language': 'Ukrainian',
    'lang_group_id': 1
    }

```


In [64]:
itr=iter(src_lang_data)
record=next(itr)
print(record)

{'id': 1982, 'num_samples': 118080, 'path': None, 'audio': {'path': 'test/10021730821550109934.wav', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -1.80006027e-05, -2.40206718e-05, -9.59634781e-05]), 'sampling_rate': 16000}, 'transcription': 'жінки усім подорожнім жінкам радять казати що вони заміжні незалежно від справжнього сімейного стану', 'raw_transcription': 'Жінки: усім подорожнім жінкам радять казати, що вони заміжні, незалежно від справжнього сімейного стану.', 'gender': 0, 'lang_id': 92, 'language': 'Ukrainian', 'lang_group_id': 1}


Lets check the audio

In [43]:
from IPython.display import Audio

# Assuming 'audio_array' is your sampled audio array and 'fs' is the sampling rate
Audio(record['audio']['array'], rate=16000)


You would notice that there would be multiple utterances for the same id in the dataset, as the dataset is a parallel corpus. Also, within the same language there are multiple speakers of different genders.

So, when we would require reference ground truth for the evaluation, we would have to map on the actual id (and not utterances),and then take the english translation in `raw_transcription` field as the reference ground truth. You will see more on this in the next section.


Lets try to find the different utterance of above audio. We would need to search for audio with same `id`

In [None]:
for i in itr :
    if i['id']==1982:
        break
record=i
print(record)

In [66]:
Audio(record['audio']['array'], rate=16000)

## Evaluation metrics and code

We would be using `BLEU` score as our standard metric to evaluate the performance of the models on the FLEURS dataset. As mentioned in the paper in Table 4, we would be using the `sacrebleu` library to calculate the BLEU score.

As we mentioned earlier, we would be using the English language in FLEURS dataset as the reference ground truth for the evaluation. We would be using the `raw_transcription` field in the dataset as the reference ground truth for the evaluation.

**Note:** Although we used only test split for each individual language in the dataset, for english we would need to get entire dataset instead to get all the ids as these ids are randomly split under different splits across the different languages in the dataset. When we would be doing inference on the models, we would save the generated translations as a hash_map with `filenameId_actualId. This would preserve the generated translations for each unique utterance for a given actual id and the actual id is then used here to fetch the reference ground truth from the English data. for example:

```python
{
    'id': 1982,
    'num_samples': 118080,
    'path': None,
    'audio': {
        'path': 'test/10021730821550109934.wav',
        'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -1.80006027e-05, -2.40206718e-05, -9.59634781e-05]),
       'sampling_rate': 16000},
    'transcription': 'жінки усім подорожнім жінкам радять казати що вони заміжні незалежно від справжнього сімейного стану', 'raw_transcription': 'Жінки: усім подорожнім жінкам радять казати, що вони заміжні, незалежно від справжнього сімейного стану.',
    'gender': 0,
    'lang_id': 92,
    'language': 'Ukrainian',
    'lang_group_id': 1
    }

```

For this we made the key as `10021730821550109934_1982` and the value as the generated translation. During evaluation, we would use the actual id `1982` to fetch the reference ground truth from the English data.

In [None]:
def get_bleu_score(generated_translations):

  # get english data from fleurs dataset
  eng_data=load_dataset('google/fleurs',name='en_us',trust_remote_code=True)
  eng_translation={}
  #combine all english translations from all splits
  for split in eng_data:
    for item in tqdm(eng_data[split]):
      audio_sample = item['audio']
      eng_translation[item['id']]=item['raw_transcription']

  bleu_score={}
  for lang_code in list(generated_translations.keys()): #calculate bleu score for each language code

    translations=[]
    gt_translations=[]

    for i in generated_translations[lang_code]:
        key=int(i.split('_')[0]) #actual id
        gt_translations.append(eng_translation[key])
        translations.append(generated_translations[lang_code][i])

    try:
      bleu = sacrebleu.corpus_bleu(translations, [gt_translations])
      bleu_score[lang_code]=round(bleu.score, 3)
    except:
      print(lang_code)

  return bleu_score


#### Setting up the results directory

In [None]:
import os
results_directory='results/fleurs'
if not os.path.exists(results_directory):
  os.makedirs(os.path.join(results_directory,'scores'))
  os.makedirs(os.path.join(results_directory,'generations'))

scores_path=os.path.join(results_directory,'scores')
generations_path=os.path.join(results_directory,'generations')


## Evaluate the Seamless models

The claims under our study are evaluated on both Seamless medium and large models. Both models differ only in number of parameters, thus overall inference methods remains the same.

 ❗ **Note** : *In order to evaluate the performance of seamless models on CoVoST2 data, just change the `model_type` according to medium or large models, and run the code under this section.*

In [None]:
# model_type="medium"
model_type="large"

#### Load the model

We would be using Seamless models added to HuggingFace 🤗 by Facebook, you can find more information about this from the [model card](https://huggingface.co/facebook/seamless-m4t-medium) The code in this section has been adopted from documentation available [here](https://huggingface.co/docs/transformers/v4.38.0/en/model_doc/seamless_m4t#overview)

In [None]:
from transformers import AutoProcessor, SeamlessM4TModel

processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-"+model_type)
model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-"+model_type)

model.cuda()

#### Model Inference

Inferencing on this large dataset is going take a lot of time so we need to be patient. For us, on a single RTX 6000 GPU, it took around 2 days to generate translations for the entire test split of the FLEURS dataset for each model

In [None]:
import collections
generated_translations=collections.defaultdict(dict) #hashmap to store translations for each language code

In [None]:
lang_issue=[]
lang_missing_ids=[]

Since the inference time is large, we made the code in such a way that during any interruption, the code would save the generated translations in a file, so that we can resume the inference from where we left off. It eliminates any redundancy in the inference process.

We would be doing inference only on the audio utterances mentiones in the evaluation benchmark provided by Seamless team. We would be saving the generated translations in a file with the key as `filenameId_actualId` and the value as the generated translation.

In [None]:
import torch
import gc

for file_name in  x_eng_files:
  lang_code = file_name.split("-")[0].split("_")[-1]
  with open(base_path+file_name) as f:
    ids=f.read().split()

  if (lang_code in generated_translations.keys()) and (len(ids)==len(generated_translations[lang_code])) :
    print("Done")
    continue

  try:
    src_lang_data=load_dataset("google/fleurs",name=lang_dict[lang_code],split="test",streaming=True,trust_remote_code=True)
  except:
    lang_issue.append(lang_code)
    print("\n Missing language ",lang_code)
    continue



  for item in tqdm(src_lang_data,total=len(ids)):
    audio_sample = item['audio']
    id=audio_sample['path'].split("/")[-1].split(".")[0]

    if str(item['id'])+'_'+str(id) in generated_translations[lang_code]:
      continue

    if id not in ids:
      continue


    try:
        # Initially, try to process the audio on the GPU
        audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt", sampling_rate=16000)
        audio_inputs = {k: v.to('cuda') for k, v in audio_inputs.items()}
        with torch.no_grad():
            output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
        translation = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("\nCUDA out of memory. Shifting inference to CPU for ID:", id)
            torch.cuda.empty_cache()  # Clear any unreleased memory

            # Move the model to CPU for this inference
            model.to('cpu')

            try:
                # Make sure audio_inputs are on the CPU as well
                audio_inputs = {k: v.to('cpu') for k, v in audio_inputs.items()}
                with torch.no_grad():
                    output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
                translation = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
            except Exception as cpu_e:
                print("\nFailed processing on CPU for ID:", id, "with error:", cpu_e)
                lang_missing_ids.append((lang_code, id))
            finally:
                # Regardless of the outcome, put the model back on the GPU for subsequent operations
                model.to('cuda')
        else:
            print("\nAn error occurred for ID:", id, "Error:", e)
            lang_missing_ids.append((lang_code, id))
    except Exception as e:
        print("\nAn unexpected error occurred for ID:", id, "Error:", e)
        lang_missing_ids.append((lang_code, id))


    del audio_inputs
    del output_tokens

    generated_translations[lang_code][str(item['id'])+'_'+str(id)]=translation
  torch.cuda.empty_cache()

  with open (os.path.join(generations_path,'Seamless '+model_type+'.json'),'w')as f:
    json.dump(generated_translations,f,indent=2)






100%|██████████| 750/750 [09:11<00:00,  1.36it/s]
100%|██████████| 908/908 [11:45<00:00,  1.29it/s]
100%|██████████| 926/926 [12:11<00:00,  1.27it/s]
100%|██████████| 728/728 [08:41<00:00,  1.40it/s]
100%|██████████| 749/749 [08:40<00:00,  1.44it/s]
100%|██████████| 1015/1015 [12:48<00:00,  1.32it/s]
264it [03:08,  1.40it/s]
100%|██████████| 1041/1041 [15:38<00:00,  1.11it/s]
100%|██████████| 918/918 [11:43<00:00,  1.31it/s]
100%|██████████| 357/357 [04:26<00:00,  1.34it/s]
100%|██████████| 1021/1021 [13:00<00:00,  1.31it/s]
100%|██████████| 880/880 [11:24<00:00,  1.29it/s]
100%|██████████| 946/946 [11:01<00:00,  1.43it/s]
100%|██████████| 977/977 [11:28<00:00,  1.42it/s]
100%|██████████| 687/687 [08:17<00:00,  1.38it/s]
100%|██████████| 857/857 [10:27<00:00,  1.37it/s]
100%|██████████| 660/660 [11:18<00:00,  1.03s/it]
100%|██████████| 621/621 [09:49<00:00,  1.05it/s]
100%|██████████| 792/792 [09:20<00:00,  1.41it/s]
100%|██████████| 925/925 [15:11<00:00,  1.01it/s]
100%|██████████| 10


CUDA out of memory. Shifting inference to CPU for ID: 3189556219205510204


 82%|████████▏ | 312/379 [05:56<00:58,  1.14it/s]


CUDA out of memory. Shifting inference to CPU for ID: 6634898757415929965


100%|██████████| 379/379 [08:08<00:00,  1.29s/it]
100%|██████████| 905/905 [11:21<00:00,  1.33it/s]
100%|██████████| 964/964 [13:15<00:00,  1.21it/s]
100%|██████████| 919/919 [11:39<00:00,  1.31it/s]
100%|██████████| 920/920 [11:09<00:00,  1.38it/s]
100%|██████████| 831/831 [10:39<00:00,  1.30it/s]
100%|██████████| 771/771 [10:06<00:00,  1.27it/s]
100%|██████████| 364/364 [04:10<00:00,  1.45it/s]
100%|██████████| 979/979 [12:03<00:00,  1.35it/s]
100%|██████████| 854/854 [12:44<00:00,  1.12it/s]
100%|██████████| 980/980 [12:53<00:00,  1.27it/s]
100%|██████████| 1019/1019 [13:19<00:00,  1.27it/s]
100%|██████████| 371/371 [05:31<00:00,  1.12it/s]
100%|██████████| 723/723 [09:26<00:00,  1.28it/s]
100%|██████████| 1021/1021 [13:15<00:00,  1.28it/s]
100%|██████████| 862/862 [10:54<00:00,  1.32it/s]
100%|██████████| 299/299 [03:40<00:00,  1.36it/s]
100%|██████████| 871/871 [11:20<00:00,  1.28it/s]
100%|██████████| 743/743 [09:15<00:00,  1.34it/s]
100%|██████████| 1000/1000 [11:28<00:00,  1.45


CUDA out of memory. Shifting inference to CPU for ID: 12560373056138365189


100%|██████████| 405/405 [07:16<00:00,  1.08s/it]
100%|██████████| 984/984 [11:58<00:00,  1.37it/s]
100%|██████████| 761/761 [10:01<00:00,  1.26it/s]
100%|██████████| 759/759 [09:22<00:00,  1.35it/s]
100%|██████████| 512/512 [06:43<00:00,  1.27it/s]
100%|██████████| 998/998 [14:30<00:00,  1.15it/s]
100%|██████████| 478/478 [08:00<00:00,  1.00s/it]
100%|██████████| 865/865 [11:54<00:00,  1.21it/s]
100%|██████████| 973/973 [12:13<00:00,  1.33it/s]
100%|██████████| 382/382 [04:34<00:00,  1.39it/s]
100%|██████████| 723/723 [08:57<00:00,  1.35it/s]
100%|██████████| 932/932 [11:28<00:00,  1.35it/s]
100%|██████████| 927/927 [11:11<00:00,  1.38it/s]
100%|██████████| 676/676 [08:19<00:00,  1.35it/s]
100%|██████████| 883/883 [10:34<00:00,  1.39it/s]
100%|██████████| 827/827 [13:04<00:00,  1.05it/s]
100%|██████████| 949/949 [11:16<00:00,  1.40it/s]
100%|██████████| 46/46 [00:39<00:00,  1.16it/s]
100%|██████████| 541/541 [07:16<00:00,  1.24it/s]
100%|██████████| 914/914 [11:13<00:00,  1.36it/s]
10

In [None]:
len(generated_translations)

101

In [None]:
import json
from collections import defaultdict

with open(os.path.join(generations_path,'Seamless '+model_type+'.json')) as f:
    generated_translations = json.load(f)

# Convert to defaultdict with empty dictionaries as default values
generated_translations = defaultdict(dict, generated_translations)

#### Evaluation

Lets evaluate against the reference English transcriptions

In [None]:
seamless_fleurs_bleu=get_bleu_score(generated_translations)

100%|██████████| 2602/2602 [00:02<00:00, 921.40it/s]
100%|██████████| 394/394 [00:00<00:00, 989.93it/s] 
100%|██████████| 647/647 [00:00<00:00, 949.56it/s]


In [None]:
dict(sorted(seamless_fleurs_bleu.items()))

{'afr': 39.69,
 'amh': 17.034,
 'arb': 31.725,
 'asm': 17.47,
 'ast': 25.894,
 'azj': 16.425,
 'bel': 16.056,
 'ben': 22.778,
 'bos': 32.891,
 'bul': 31.33,
 'cat': 37.574,
 'ceb': 7.723,
 'ces': 31.016,
 'ckb': 20.487,
 'cmn': 18.98,
 'cym': 30.22,
 'dan': 33.553,
 'deu': 35.469,
 'ell': 24.804,
 'est': 28.534,
 'fin': 25.782,
 'fra': 32.641,
 'ful': 0.788,
 'gaz': 0.317,
 'gle': 10.654,
 'glg': 32.033,
 'guj': 27.164,
 'hau': 0.544,
 'heb': 28.226,
 'hin': 25.194,
 'hrv': 29.8,
 'hun': 24.166,
 'hye': 27.81,
 'ibo': 1.27,
 'ind': 28.81,
 'isl': 22.854,
 'ita': 25.307,
 'jav': 19.459,
 'jpn': 15.886,
 'kam': 1.803,
 'kan': 21.799,
 'kat': 18.741,
 'kaz': 21.338,
 'kea': 27.313,
 'khk': 16.258,
 'khm': 18.62,
 'kir': 16.771,
 'kor': 18.402,
 'lao': 19.088,
 'lin': 0.917,
 'lit': 20.675,
 'ltz': 14.429,
 'lug': 16.179,
 'luo': 0.789,
 'lvs': 27.666,
 'mal': 20.99,
 'mar': 21.372,
 'mkd': 33.972,
 'mlt': 38.23,
 'mri': 0.99,
 'mya': 14.676,
 'nld': 26.502,
 'nob': 33.007,
 'npi': 23.518,

In [None]:
with open(os.path.join(scores_path,'Seamless '+model_type+'.json'),'w')as f:
  json.dump(seamless_fleurs_bleu,f)

## Evaluate the Whisper model

#### Load the model

In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
model.to('cuda')
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### Filter and map language codes for Whisper model

We found that the Whisper model doesn't support the ISO 639-3 language codes, instead it either takes the 2 letter language codes or the full language name or we can omit giving any language code but that degrades the performance as the model then would also need to perform Language Identification task, and error due to language identification would be added to the overall error. So, we again need to map the ISO 639-3 language codes to the language codes supported by the Whisper model.

We used the `pycountry` library to get the 2 letter language codes for the ISO 639-3 language codes. As we mentioned in the beginning that the Whisper model doesn't support all the languages in the FLEURS dataset, so we would only consider the languages supported by the Whisper model for evaluation.

In [None]:
allowed_whisper_lang=[
            'en', 'zh', 'de', 'es', 'ru', 'ko', 'fr', 'ja', 'pt', 'tr', 'pl', 'ca', 'nl', 'ar', 'sv', 'it', 'id', 'hi', 'fi', 'vi',
            'he', 'uk', 'el', 'ms', 'cs', 'ro', 'da', 'hu', 'ta', 'no', 'th', 'ur', 'hr', 'bg', 'lt', 'la', 'mi', 'ml', 'cy', 'sk',
            'te', 'fa', 'lv', 'bn', 'sr', 'az', 'sl', 'kn', 'et', 'mk', 'br', 'eu', 'is', 'hy', 'ne', 'mn', 'bs', 'kk', 'sq', 'sw',
            'gl', 'mr', 'pa', 'si', 'km', 'sn', 'yo', 'so', 'af', 'oc', 'ka', 'be', 'tg', 'sd', 'gu', 'am', 'yi', 'lo', 'uz', 'fo',
            'ht', 'ps', 'tk', 'nn', 'mt', 'sa', 'lb', 'my', 'bo', 'tl', 'mg', 'as', 'tt', 'haw', 'ln', 'ha', 'ba', 'jw', 'su',
            'yue', 'my', 'ca', 'nl', 'ht', 'lb', 'ps', 'pa', 'ro', 'ro', 'si', 'es', 'zh']

In [None]:
import pycountry

def iso639_3_to_iso639_1(code_639_3):
    # remove older seamless specific mappings
    cross_mapping = dict(zip(new_codes, old_codes))

    # Additional special cases for direct conversion from ISO 639-3 to ISO 639-1
    special_cases = {
        'cmn': 'zh',  # Mandarin Chinese
        'nob': 'no',  # Norwegian Bokmål
        'jav': 'jw'   # Javanese
        # Add any other special cases if needed
    }

    # Check special cases first
    if code_639_3 in special_cases:
        return special_cases[code_639_3]

    # Use cross-mapping to find the corresponding old code if available
    if code_639_3 in cross_mapping:
        code_639_3 = cross_mapping[code_639_3]

    # Use pycountry to attempt to convert any code to ISO 639-1 two-letter code
    try:
        language = pycountry.languages.get(alpha_3=code_639_3)
        return language.alpha_2 if hasattr(language, 'alpha_2') else code_639_3
    except AttributeError:
        # Return the original code if no ISO 639-1 code is found
        return code_639_3

In [None]:
whisper_codes={}
for code in lang_dict:
  x=iso639_3_to_iso639_1(code)
  if x in allowed_whisper_lang:
    whisper_codes[code]=x



In [None]:
len(whisper_codes)

82

#### Model Inference

In [None]:
with open(os.path.join(generations_path,'Whisper large-v2.json')) as f:
  generated_translations=json.load(f)

with open(os.path.join(generations_path,'Whisper large-v2_asr.json')) as f:
  generated_transcriptions=json.load(f)

In [None]:
import collections
generated_translations=collections.defaultdict(dict, generated_translations)
generated_transcriptions=collections.defaultdict(dict,generated_transcriptions)

In [None]:
lang_issue=[]
lang_missing_ids=[]


The inference code is similar to that of Seamless model, but the only difference is that we are using the 2 letter language codes for the Whisper model when passing to its decoder.

**❗️ Note:** Although we talk about 2-letter language code, but while saving the generated translations, we would be still saving the key as `filenameId_actualId` and the value as the generated translation under the ISO 639-3 language code as nested dictionary. This would allow for easier comparison during final analysis.



During the inference while decoding we did one extra step to use the model as ASR and generate transcriptions. This would come handy when we use the cascaded pipeline where whisper had to perform as ASR.

In [None]:
import torch
import gc

for file_name in  x_eng_files:
  lang_code = file_name.split("-")[0].split("_")[-1]
  with open(base_path+file_name) as f:
    ids=f.read().split()

  if (lang_code in generated_translations.keys()) and (len(ids)==len(generated_translations[lang_code])) :
    print("Done")
    continue



  try:

    forced_decoder_ids={
    'translate' : processor.get_decoder_prompt_ids(language=whisper_codes[lang_code], task="translate"),
    'transcribe': processor.get_decoder_prompt_ids(language=whisper_codes[lang_code], task="transcribe")
    }
    src_lang_data=load_dataset("google/fleurs",name=lang_dict[lang_code],split="test",streaming=True,trust_remote_code=True)

  except:
    lang_issue.append(lang_code)
    print("\n Missing language in whisper",lang_code)
    continue




  for item in tqdm(src_lang_data,total=len(ids)):
    audio_sample = item['audio']
    id=audio_sample['path'].split("/")[-1].split(".")[0]

    if str(item['id'])+'_'+str(id) in generated_translations[lang_code]:
      continue

    if id not in ids:
      continue

    try:
        # Initially, try to process the audio on the GPU
        input_features = processor(audio_sample["array"], sampling_rate=16000, return_tensors="pt").input_features
        input_features= input_features.to('cuda')
        with torch.no_grad():
          translate_output = model.generate(input_features,
                                            forced_decoder_ids=forced_decoder_ids['translate'])
          transcript_output = model.generate(input_features,
                                             forced_decoder_ids=forced_decoder_ids['transcribe'])

        translation = processor.batch_decode(translate_output, skip_special_tokens=True)
        transcription = processor.batch_decode(transcript_output, skip_special_tokens=True)
    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("\nCUDA out of memory. Shifting inference to CPU for ID:", id)
            torch.cuda.empty_cache()  # Clear any unreleased memory

            # Move the model to CPU for this inference
            model.to('cpu')

            try:
                input_features= input_features.to('cpu')
                with torch.no_grad():
                  translate_output = model.generate(input_features,
                                                    forced_decoder_ids=forced_decoder_ids['translate'])
                  transcript_output = model.generate(input_features,
                                                     forced_decoder_ids=forced_decoder_ids['transcribe'])

                translation = processor.batch_decode(translate_output, skip_special_tokens=True)
                transcription = processor.batch_decode(transcript_output, skip_special_tokens=True)
            except Exception as cpu_e:
                print("\nFailed processing on CPU for ID:", id, "with error:", cpu_e)
                lang_missing_ids.append((lang_code, id))
            finally:
                torch.cuda.empty_cache()
                # Regardless of the outcome, put the model back on the GPU for subsequent operations
                model.to('cuda')
        else:
            print("\nAn error occurred for ID:", id, "Error:", e)
            lang_missing_ids.append((lang_code, id))
    except Exception as e:
        print("\nAn unexpected error occurred for ID:", id, "Error:", e)
        lang_missing_ids.append((lang_code, id))

    del input_features ,transcript_output, translate_output

    generated_translations[lang_code][str(item['id'])+'_'+str(id)]=translation[0]
    generated_transcriptions[lang_code][str(item['id'])+'_'+str(id)]=transcription[0]
  torch.cuda.empty_cache()

  with open (os.path.join(generations_path,'Whisper large-v2.json'),'w')as f:
    json.dump(generated_translations,f,indent=2)

  with open (os.path.join(generations_path,'Whisper large-v2_asr.json'),'w')as f:
    json.dump(generated_transcriptions,f,indent=2)






Done
Done
Done
Done
Done
Done
Done

 Missing language in whisper ckb
Done
Done

 Missing language in whisper kir
Done
Done

 Missing language in whisper wol
Done
Done

 Missing language in whisper ful
Done
Done
Done
Done
Done
Done

 Missing language in whisper umb
Done
Done
Done
Done
Done

 Missing language in whisper kam
Done
Done

 Missing language in whisper nya
Done
Done
Done
Done
Done
Done
Done

 Missing language in whisper tgl
Done
Done
Done

 Missing language in whisper ceb

 Missing language in whisper ast
Done

 Missing language in whisper nso
Done

 Missing language in whisper zul
Done
Done
Done
Done
Done
Done

 Missing language in whisper kea

 Missing language in whisper gle
Done
Done
Done
Done
Done
Done

 Missing language in whisper luo
Done
Done
Done
Done
Done

 Missing language in whisper ory
Done
Done
Done
Done
Done
Done
Done
Done

 Missing language in whisper lug
Done

 Missing language in whisper ibo
Done
Done
Done
Done
Done
Done

 Missing language in whisper gaz
Done

 59%|█████▉    | 305/516 [27:34<24:01,  6.83s/it]

#### Evaluation

In [None]:
len(generated_transcriptions)

81

In [None]:
whisper_bleu_score= get_bleu_score(generated_translations)

100%|██████████| 2602/2602 [00:03<00:00, 791.18it/s]
100%|██████████| 394/394 [00:00<00:00, 862.31it/s]
100%|██████████| 647/647 [00:00<00:00, 846.46it/s]


In [31]:
dict(sorted(whisper_bleu_score.items()))

{'afr': 34.099,
 'amh': 0.872,
 'arb': 23.293,
 'asm': 2.734,
 'azj': 13.053,
 'bel': 10.724,
 'ben': 10.248,
 'bos': 29.756,
 'bul': 28.144,
 'cat': 34.34,
 'ces': 27.707,
 'cmn': 17.414,
 'cym': 11.277,
 'dan': 32.915,
 'deu': 34.783,
 'ell': 23.71,
 'est': 18.025,
 'fin': 21.988,
 'fra': 32.398,
 'glg': 27.381,
 'guj': 14.435,
 'hau': 0.088,
 'heb': 20.133,
 'hin': 21.75,
 'hrv': 26.93,
 'hun': 20.836,
 'hye': 14.136,
 'ind': 29.236,
 'isl': 8.742,
 'ita': 23.384,
 'jav': 5.03,
 'jpn': 18.732,
 'kan': 9.238,
 'kat': 0.943,
 'kaz': 2.969,
 'khk': 0.418,
 'khm': 3.701,
 'kor': 21.323,
 'lao': 6.639,
 'lin': 0.242,
 'lit': 12.512,
 'ltz': 14.265,
 'lvs': 12.86,
 'mal': 13.515,
 'mar': 10.27,
 'mkd': 27.217,
 'mlt': 11.974,
 'mri': 6.448,
 'mya': 0.165,
 'nld': 23.956,
 'nob': 31.246,
 'npi': 11.645,
 'oci': 18.824,
 'pan': 14.582,
 'pbt': 1.496,
 'pes': 19.422,
 'pol': 21.776,
 'por': 37.99,
 'ron': 31.377,
 'rus': 27.849,
 'slk': 26.05,
 'slv': 16.705,
 'sna': 0.386,
 'snd': 3.732,
 '

In [None]:
with open(os.path.join(scores_path,'Whisper large-v2.json'),'w')as f:
    json.dump(whisper_bleu_score,f)

## Cascaded pipeline of Whisper and NLLB

#### Load the NLLB model

In [72]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-1.3B").to("cuda").eval()


#### Mapping language IDs for NLLB inference

The NLLB paper suggested that they have used the language codes from FLORES datasets while defining source and target languages. Below is the mapping we got for mapping ISO 639-3 codes into BCP-47 codes of required specification.

In [32]:
flores_mapping = {
    'afr': 'afr_Latn',
    'amh': 'amh_Ethi',
    'ara': 'arb_Arab',
    'asm': 'asm_Beng',
    'ast': 'ast_Latn',
    'azj': 'azj_Latn',
    'bel': 'bel_Cyrl',
    'ben': 'ben_Beng',
    'bos': 'bos_Latn',
    'bul': 'bul_Cyrl',
    'cat': 'cat_Latn',
    'ceb': 'ceb_Latn',
    'ces': 'ces_Latn',
    'ckb': 'ckb_Arab',
    'cym': 'cym_Latn',
    'dan': 'dan_Latn',
    'deu': 'deu_Latn',
    'ell': 'ell_Grek',
    'eng': 'eng_Latn',
    'est': 'est_Latn',
    'fin': 'fin_Latn',
    'fra': 'fra_Latn',
    'ful': 'fuv_Latn',
    'gle': 'gle_Latn',
    'glg': 'glg_Latn',
    'guj': 'guj_Gujr',
    'hau': 'hau_Latn',
    'heb': 'heb_Hebr',
    'hin': 'hin_Deva',
    'hrv': 'hrv_Latn',
    'hun': 'hun_Latn',
    'hye': 'hye_Armn',
    'ibo': 'ibo_Latn',
    'ind': 'ind_Latn',
    'isl': 'isl_Latn',
    'ita': 'ita_Latn',
    'jav': 'jav_Latn',
    'jpn': 'jpn_Jpan',
    'kam': 'kam_Latn',
    'kan': 'kan_Knda',
    'kat': 'kat_Geor',
    'kaz': 'kaz_Cyrl',
    'khm': 'khm_Khmr',
    'kir': 'kir_Cyrl',
    'kor': 'kor_Hang',
    'lao': 'lao_Laoo',
    'Latvian': 'lij_Latn',
    'kea': 'lim_Latn',
    'lin': 'lin_Latn',
    'lit': 'lit_Latn',
    'ltz': 'ltz_Latn',
    'lug': 'lug_Latn',
    'luo': 'luo_Latn',
    'lav': 'lvs_Latn',
    'mal': 'mal_Mlym',
    'mar': 'mar_Deva',
    'mkd': 'mkd_Cyrl',
    'mlt': 'mlt_Latn',
    'mon': 'khk_Cyrl',
    'mri': 'mri_Latn',
    'mya': 'mya_Mymr',
    'nld': 'nld_Latn',
    'nob': 'nob_Latn',
    'npi': 'npi_Deva',
    'nso': 'nso_Latn',
    'nya': 'nya_Latn',
    'oci': 'oci_Latn',
    'orm': 'gaz_Latn',
    'ory': 'ory_Orya',
    'pan': 'pan_Guru',
    'fas': 'pes_Arab',
    'pol': 'pol_Latn',
    'por': 'por_Latn',
    'pus': 'pbt_Arab',
    'ron': 'ron_Latn',
    'rus': 'rus_Cyrl',
    'slk': 'slk_Latn',
    'sna': 'sna_Latn',
    'snd': 'snd_Arab',
    'som': 'som_Latn',
    'spa': 'spa_Latn',
    'srp': 'srp_Cyrl',
    'swe': 'swe_Latn',
    'swh': 'swh_Latn',
    'tam': 'tam_Taml',
    'tel': 'tel_Telu',
    'tgk': 'tgk_Cyrl',
    'tgl': 'tgl_Latn',
    'tha': 'tha_Thai',
    'tur': 'tur_Latn',
    'ukr': 'ukr_Cyrl',
    'umb': 'umb_Latn',
    'urd': 'urd_Arab',
    'uzb': 'uzn_Latn',
    'vie': 'vie_Latn',
    'wol': 'wol_Latn',
    'xho': 'xho_Latn',
    'yor': 'yor_Latn',
    'zho_simpl': 'zho_Hans',
    'zho_trad': 'zho_Hant',
    'msa': 'zsm_Latn',
    'zul': 'zul_Latn'
}


Some languages had a different mapping so we accomodated them with this additional mapping

In [62]:
missing_entries={
    'lvs': 'lvs_Latn',
    'slv': 'slv_Latn',
    'cmn': 'zho_Hant',
    'uzn': 'uzn_Latn',
    'pes': 'pes_Arab',
    'zlm': 'zsm_Latn',
    'arb': 'arb_Arab',
    'pbt': 'pbt_Arab',
    'khk': 'khk_Cyrl',
    'yue': 'yue_Hant'
    }

#### Load the Whisper generated transcriptions

In [73]:
with open(os.path.join(generations_path,'Whisper large-v2_asr.json')) as f:
     generated_transcriptions = json.load(f)

In [74]:
generated_translations=collections.defaultdict(dict)

#### Model inference

In [75]:
missing_langs=[]

In [None]:
for lang_code in generated_transcriptions:
  transcripts= generated_transcriptions[lang_code]
  try:
    if lang_code in flores_mapping:
      tokenizer = AutoTokenizer.from_pretrained(
        "facebook/nllb-200-1.3B", src_lang=flores_mapping[lang_code]
      )
    else:
      tokenizer = AutoTokenizer.from_pretrained(
        "facebook/nllb-200-1.3B", src_lang=missing_entries[lang_code]
      )
  except:
    missing_langs.append(lang_code)
    print("Mapping not found for: ",lang_code)
    continue

  if len(transcripts)==len(generated_translations[lang_code]):
    print("Done! ", lang_code)
    continue

  for i in tqdm(transcripts):
    inputs = tokenizer(transcripts[i], return_tensors="pt").to('cuda')
    with torch.no_grad():
        translated_tokens = model.generate(
        **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"], max_length=30
        )

    generated_translations[lang_code][i]=tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
  with open (os.path.join(generations_path,'Whisper Large-v2 ASR + NLLB-1.3B.json'),'w')as f:
    json.dump(generated_translations,f,indent=2)




100%|██████████| 925/925 [07:45<00:00,  1.99it/s]
100%|██████████| 958/958 [08:01<00:00,  1.99it/s]
100%|██████████| 932/932 [07:55<00:00,  1.96it/s]
 34%|███▍      | 303/883 [02:33<05:09,  1.87it/s]

#### Evaluation

In [82]:
cascaded_pipeline_bleu_score= get_bleu_score(generated_translations)

dict(sorted(cascaded_pipeline_bleu_score.items()))

100%|██████████| 2602/2602 [00:03<00:00, 824.66it/s]
100%|██████████| 394/394 [00:00<00:00, 878.26it/s]
100%|██████████| 647/647 [00:00<00:00, 858.65it/s]


{'afr': 28.256,
 'amh': 0.14,
 'arb': 28.011,
 'asm': 1.846,
 'azj': 15.961,
 'bel': 13.166,
 'ben': 2.808,
 'bos': 29.176,
 'bul': 27.868,
 'cat': 34.126,
 'ces': 27.992,
 'cmn': 18.831,
 'cym': 25.287,
 'dan': 30.571,
 'deu': 33.35,
 'ell': 25.258,
 'est': 25.915,
 'fin': 25.309,
 'fra': 33.011,
 'glg': 28.525,
 'guj': 7.589,
 'hau': 2.426,
 'heb': 24.441,
 'hin': 22.31,
 'hrv': 26.475,
 'hun': 23.832,
 'hye': 16.623,
 'ind': 31.659,
 'isl': 16.256,
 'ita': 24.682,
 'jav': 9.461,
 'jpn': 20.478,
 'kan': 14.554,
 'kat': 0.147,
 'kaz': 15.86,
 'khk': 0.218,
 'khm': 1.238,
 'kor': 22.084,
 'lao': 9.402,
 'lin': 5.373,
 'lit': 19.157,
 'ltz': 11.027,
 'lvs': 23.508,
 'mal': 3.664,
 'mar': 12.388,
 'mkd': 29.312,
 'mlt': 12.414,
 'mri': 10.136,
 'mya': 0.392,
 'nld': 25.099,
 'nob': 30.124,
 'npi': 12.741,
 'oci': 17.533,
 'pan': 8.678,
 'pbt': 2.648,
 'pes': 23.219,
 'pol': 22.975,
 'por': 35.599,
 'ron': 30.901,
 'rus': 27.476,
 'slk': 28.702,
 'slv': 21.162,
 'sna': 4.288,
 'snd': 4.38

In [71]:
with open(os.path.join(scores_path,'Whisper Large-v2 ASR + NLLB-1.3B.json'),'w')as f:
    json.dump(cascaded_pipeline_bleu_score,f)

## Challanges, we overcame 💪

1. **Mapping of language codes**: We faced a lot of issues in mapping the language codes from ISO 639-3 to the language codes supported by the models. We had to manually map some of the languages to the correct ISO 639-3 code. We also noticed that the language codes used in the evaluation ids were different from the ones used in the paper. We had to manually map these codes as well. This wasn't limited to just Seamless, for inference on Whisper models, it required a 2-letter code which also led to significant effort as not all mappings were available.

2. **Limited literature around FLEURS data**: There was very limited literature available around the FLEURS dataset. We had to rely on the information provided by the Seamless team in their model card and the dataset card on Huggingface. It took some time to realise that there were multiple utterances available for same actual id and that n-way mapping was provided on actual ids which could be shuffled across different splits in different languages

3. **Inference time**: The inference time for the models was very large. It took around 2 days to generate translations for the entire test split of the FLEURS dataset for each model. We had to make the code in such a way that during any interruption, the code would save the generated translations in a file, so that we can resume the inference from where we left off.

4. **Language support**: Not all languages in the FLEURS dataset are supported by the models. Whisper supports only 82 languages (including English), while Seamless and NLLB supports all the languages in the dataset. So, in order to evaluate the performance of Whisper on the FLEURS dataset, we only considered the 81 languages supported by Whisper, but during inference we considered all the supported languages for a given model.