# Results Analysis

This notebook will help you to anlayze the evaluation results from inference over CoVoST2 and FLEURS datasets. If you are only interested in one of the dataset's inference results. You should run only those concerned cells.

**Make sure you ran notebooks 2. `covost2_eval.ipynb` and 3. `fleurs_eval.ipynb` to transfer the evaluation results**

### Transfer raw results from the resource server to the jupyter server

This section will help you fetch results from your baremetal GPU instance where you reproduced the results. The scores and and generated output will be later used to compare against the original claims in the paper. 

Feel free to skip this section if you have already fetched the results from the remote server.

In [1]:
#put here the reserved fip of the instance where the expr
reserved_fip="192.5.86.195"

In [2]:
from chi import ssh
node = ssh.Remote(reserved_fip)

In [3]:
# Define the remote directory path and the local path to download to
remote_directory = 'results/'
archive_name='results.tar.gz'


The ssh implementation provided by `python-chi` as a wrapper over Fabric, only allows for single file transfer. In order to transfer the directory we would nead to archive the entire remote directory then transfer the archive file

In [4]:
node.run(f'tar -czf {archive_name} -C {remote_directory} .') #archive the results



<Result cmd='tar -czf results.tar.gz -C results/ .' exited=0>

In [5]:
node.get(archive_name) #download the archive

<fabric.transfer.Result at 0x7ff21164b340>

In [6]:
import tarfile
with tarfile.open(archive_name) as tar:
    tar.extractall(path=remote_directory) #extract the archive

## Lets see how our generated results look

In [9]:
import os, json
import pandas as pd

### CoVoST 2 generated results examples

Lets select a language of choice

In [38]:
language='ca'

Select initial 5 generated results of chosen language from each model

In [39]:
generated_results_directory_covost='results/covost2/generations/'
results_dict={}
for files in os.listdir(generated_results_directory_covost):
    if files.endswith('.json'):
        with open(generated_results_directory_covost+files) as f:
            generated_translations=json.load(f)
        model_name=files.split('.')[0]
        results_dict[model_name]=generated_translations[language][:5]
        
        
        

In [40]:
pd.set_option('display.max_colwidth', None)

In [41]:
pd.DataFrame(results_dict)

Unnamed: 0,XLS-R (2B),Whisper Large-v2,Seamless large,Seamless medium
0,It supervises the issuance of residence permits.,"Supervisor, let me see the large box, and I'm going to see what's in it.",He oversees the mission of the concession and admission resolutions.,It supervises the issuance of housing concession resolutions.
1,Inspections of slaughterhouses and livestock farms in the event of a declaration of diseases.,Inspection of elevators and splities into case of illness report,Inspect slaughterhouses and livestock farms in case of disease declaration.,Inspect herdsmen and livestock farms in case of a disease declaration.
2,"In March, he asked for a few days in February to mislead his mother.",In March he asked her a few days in February to deceive his mother.,"In March, he asked her for a few days in February to cheat on his mother.","In March, he asked her for a few days in February to deceive his mother."
3,Let's go to Gàtova.,Thank you.,Let's go to Gàtova.,Let's go to Gàtova.
4,"Self-confidence and the ability to deal with things is ripe and unassuming, it is flexible and secure.",Self-confidence and the ability to deal with things is mature and not anxious. It is flexible and safe.,Self-confidence and the ability to deal with things is mature and unconcerned; it is flexible and secure.,"Self-confidence and the ability to deal with things is mature and unwanted, it is flexible and safe."


### Fleurs generated results examples

Since for `fleurs` dataset we did mapping based on 639-3 codes, this code for Armenian, `hy` would be `hye`

In [56]:
mapped_lang='hye'

In [57]:
generated_results_directory_fleurs='results/fleurs/generations/'

Select initial 5 generated results of chosen language from each model

In [58]:
results_dict={}
for files in os.listdir(generated_results_directory_fleurs):
    if files.endswith('.json'):
        with open(generated_results_directory_fleurs+files) as f:
            generated_translations=json.load(f)
        model_name=files.split('.')[0]
        results_dict[model_name]=[ generated_translations[mapped_lang][key] for key in list(generated_translations[mapped_lang].keys())[:5]]

In [59]:
pd.DataFrame(results_dict)

Unnamed: 0,Whisper large-v2_asr,Whisper large-v2,Seamless large,Seamless medium,Whisper Large-v2 ASR + NLLB-1
0,"այն մեզ տվել է գնացքը, ավտոմեկինան և շատ այլ փոխադրման սարքեր։",It gave us access to a car and many other transportation devices.,"It gave us trains, cars, and many other transportation devices.","It gave us trains, cars, and many other transportation devices.","It gave us trains, cars, and many other means of transportation."
1,Մյուս կողմից արուցի և ջան պայմաները բնական այն շատ երկրնենում և ամբողջ տարին երդհեվեքությունը շարնակվում է հիմնականում անընդհատ կերպով։,"On the other hand, the weather and living conditions are generally good in the country, and the humidity continues throughout the year, mainly continuously.","On the other hand, ice and snow conditions are normal in many countries, and transportation continues throughout the year, mostly continuously.","On the other hand, snow and ice conditions are natural in many countries, and traffic continues throughout the year, mainly in a continuous manner.","On the other hand, the conditions of the rainforest and the jungle are natural in most of the country and throughout the year the"
2,Նախ բուլուր ձիավորները կրում են ձիավարման համար նախատեսված կրունքով և հարդ բավականին նեղ ներբանով կոշիքներ։,"In the beginning, all the warriors wore the armor they were ordered to wear, and the armor was made of fine silk.","First, the bulldozers wear shoes with a mounting braid and a rather thin, flat sole.","First, all the sailors wear shoes with a horse-drawn carriage and a flat, rather narrow sleeve.","First, the bull riders wear shoes with a tight, tight heel designed for riding."
3,"Նա նա այդ ծբաղվում է շատ երկրների համար, թը խտադրամների փորագրծման ուպները աշխատանքի վերջին, որինակներ են էր արյավ Վարջապետի դիմանըկարները կանադական նոր հինգ և հարյուր դոլար աշխողությամբ թը խտադրամների դիմեր էսին։","He also works for many countries. He is the last example of his work. For example, he donated $5,000 and $100,000 worth of photos of the President of the United States to the foundation of the foundation.","He is engaged in the engraving of stamps for many countries, including the last examples of his work, the portraits of the Prime Minister, the new Canadian five and a hundred dollar stamps of the Dimeress.","He also worked on the printing of coins for many countries, including the Prime Minister's portraits, the new five-hundred-dollar Canadian stamps, and the Demers.","He's been doing this for many countries, the engraving of the coins at the end of his career, the prints"
4,բլոգները կարող են նաև օգտակարծ լինել ուսանողների գրավորները բարել ավելու հարցում։ Մինչ ուսանողները հաճախ իրենց բլոգի փորդարություն է սկսում անպույտ կերականցամբ և ուղագրությամբ ընտերցողների ներկայացունը գխավրապես պոխում է դա։,"Blogs can also be useful for students to improve their writing skills. Before students start using their blogs, it changes the way they present themselves.","Blogs can also be useful in improving students' writing, while students often start their blog experience with abstract grammar and spelling, and the presence of readers fundamentally changes that.","Blogs can also be useful for improving student writing, while students often start their blog experience with ambiguity and the presence of readers with spelling mainly changes that.","Blogs can also be useful for improving student writing. While students often start their blog with a blank page, a voter representative"


## Reproduced results summaries

### CoVoST2 evaluation summary

Lets first summarize the results of the CoVoST2 evaluation. We have the scores for four models on this data for X→eng translations for 21 languages. These models are: 

1) Whisper 
2) XLS-R
3) Seamless Medium
4) Seamless Large


#### Divide language in different categories


While evaluating performance in terms of translation capabilities, we need to divide our languages between high, mid and low resource categories depending on what amount of data is available in each language. This distribution has been provided by Babu et al.,2021 in their XLS-R [paper](https://arxiv.org/pdf/2111.09296.pdf).

In [8]:
res_levels=["low_res","mid_res","high_res"]

high_res=['ca','de','fr','es']
mid_res=['zh-CN','fa','it','ru','pt']
low_res=['mn','ta','lv','et','cy','sl','ja','tr','ar','nl','sv-SE','id']


In [9]:
import collections
def resource_level_results(scores,model_name):
    res_scores=collections.defaultdict(float)
    for level in res_levels:
        for lang in eval(level):
            res_scores[level]+=scores[lang]
        res_scores['all']+=res_scores[level]
        res_scores[level]/=len(eval(level))
    res_scores['all']/=21.0
    return {
      "Model":model_name,
      "High" : round(res_scores["high_res"],1),
      "Mid" : round(res_scores["mid_res"],1),
      "Low" : round(res_scores["low_res"],1),
      "All" : round(res_scores['all'],1)
    }

In [10]:

re_results=[]
score_directory_covost='results/covost2/scores/'
#read all the json files
for files in os.listdir(score_directory_covost):
    if files.endswith('.json'):
        with open(score_directory_covost+files) as f:
            scores=json.load(f)
        model_name=files.split('.')[0]
        re_results.append(resource_level_results(scores,model_name))

re_results_df=pd.DataFrame(re_results)




In [11]:
re_results_df

Unnamed: 0,Model,High,Mid,Low,All
0,XLS-R (2B),36.1,27.7,15.1,22.1
1,Whisper Large-v2,36.3,33.6,24.9,29.2
2,Seamless Medium,37.3,33.6,28.3,31.3
3,Seamless large,39.3,36.2,31.9,34.3
4,Seamless medium,37.3,33.6,28.3,31.3


In [12]:
re_results_df.to_json('claims/re_covost_summary.json',orient='split')

### Fleurs evaluation summary

For Fleurs, the division of languages is not as clear as CoVoST2. We will divide the languages into high, mid and low resource categories based on resource-level information available in Table 38 in [SeamlessM4T](https://arxiv.org/pdf/2308.11596.pdf) (Barrault et al., 2023). 

Apart from the information in the Seamless paper, we had to manually check in the Whisper paper, if the language was supported or not. This is becuase while citing the results, author stated that they provided results only for languages that were supported by all three models - Whisper, AudioPaLM and Seamless.

In [13]:
whisper_fleurs= ['afr', 'amh', 'asm', 'bel', 'bul', 'ben', 'bos', 'cat', 'cmn', 'ces', 'cym', 'dan', 'deu', 'ell', 
                  'spa', 'est', 'fin', 'fra', 'glg', 'guj', 'hau', 'heb', 'hin', 'hrv', 'hun', 'hye', 'ind', 'isl', 'ita', 
                  'jpn', 'jav', 'kat', 'kaz', 'khm', 'kan', 'kor', 'ltz', 'lin', 'lao', 'lit', 'mri', 'mkd', 'mal', 'mar', 
                  'mlt', 'mya', 'nob', 'nld', 'oci', 'pan', 'pol', 'por', 'ron', 'rus', 'snd', 'slk', 'slv', 'sna', 'som', 
                  'srp', 'swe', 'tam', 'tel', 'tgk', 'tha', 'tur', 'ukr', 'urd', 'vie', 'yor', 'yue', 'zlm', 'uzn', 'pes', 
                  'npi', 'lvs', 'arb', 'azj', 'pbt', 'khk', 'swh']


high_res = [
    "arb", "bel", "ben", "cat","ces", "cmn", "deu","fin", "fra", "ita", 
    "jpn", "nld", "pol", "ron", "spa" #15
]

mid_res = [
    "cym", "dan", "ell", "est", "hin", "hrv", "hun", "ind", "jav", "kaz", "kor",
    "pbt", "por", "rus", "slk", "swh", "tam", "tel", "tgl", "tha", "tur", "ukr", "urd", "uzn", "vie" #25
]

low_res = [
    "afr", "amh", "asm", "azj", "bos", "bul", "glg", "guj", "heb", "hye", 
    "isl", "kan", "kat", "khk", "khm", "lao",
    "lit", "lvs", "mal", "mar", "mkd", "mlt", "mya", "nob", "npi", "pan", 
    "pes", "slv", "som", "srp", "swe", "tgk", "yor", "zlm"       #34
]


On top of the usual High, Mid and Low resource categories, the authors created one more category for low resource langauges which excluded languages that were Zero-shot in AudioPaLM, to eliminate any confounding factor like Zer0-shot evaluation for other models. We had to manually check for languages which were Zero-shot in AudioPaLM, mention in the Table 17 in [Rubenstein et al., 2023](https://arxiv.org/pdf/2306.12925.pdf).

In [14]:
zeroshot_audiopalm = [
    "yor", "uzn", "tgk", "som", "sna", "snd", "pbt", "oci", "nob", "mya", "mri", "lao",
    "lin", "ltz", "kan", "khm", "jav", "heb", "tgl", "amh", "bos"
]# Languages marked as 'zero-shot' in audioPalm

#remove the zeroshot languages from the low_res
low_res2= list(set(low_res)-set(zeroshot_audiopalm))

In [15]:
res_levels=["high_res","mid_res","low_res","low_res2"]

Aggregation function to compile results under required categories

In [16]:
def resource_level_results_fleurs(scores,model_name):
    res_scores=collections.defaultdict(float)
    for level in res_levels:
        for lang in eval(level):
            try:
                res_scores[level]+=scores[lang]
            except:
                continue
        res_scores[level]/=len(eval(level))

    for lang in whisper_fleurs:
        res_scores['whisper_fleurs']+=scores[lang]
    res_scores['whisper_fleurs']/=81.0 
    
    return {
      "Model":model_name,
      "High (n=15)" : round(res_scores["high_res"],1),
      "Medium (n=25)" : round(res_scores["mid_res"],1),
      "Low (n=34)" : round(res_scores["low_res"],1),
      "Low† (n=23)" : round(res_scores['low_res2'],1),
      "FLEURS X→eng (n=81)" : round(res_scores['whisper_fleurs'],1)
      

    }

Read the json files we fetched from our experiment to aggregate the results

In [17]:
score_directory_fleurs='results/fleurs/scores/'
re_results=[]
#read all the json files
for files in os.listdir(score_directory_fleurs):
    if files.endswith('.json'):
        with open(score_directory_fleurs+files) as f:
            scores=json.load(f)
        model_name=".".join(files.split('.')[:-1])
        re_results.append(resource_level_results_fleurs(scores,model_name))

re_results_df=pd.DataFrame(re_results)


Save this summary handy to compare against different claims in later sections

In [18]:
re_results_df.to_json('claims/re_fleurs_summary.json',orient='split')
re_results_df

Unnamed: 0,Model,High (n=15),Medium (n=25),Low (n=34),Low† (n=23),FLEURS X→eng (n=81)
0,Whisper large-v2,23.7,17.5,14.8,17.0,16.7
1,Seamless large,26.6,24.5,24.9,26.5,23.4
2,Seamless medium,23.7,21.1,21.7,22.8,20.3
3,Whisper Large-v2 ASR + NLLB-1.3B,24.3,20.5,15.4,17.6,18.2


## Helper Function to merge results against original claims

We will now merge the actual claims and reproduced results in Single view using some helper functions.

This function adds brackets to values in actual claims table while merging

In [22]:
def apply_brackets(claims_df, avoid_cols=["Model"]):
    df=claims_df.copy()
    for col in df.columns:
        if col not in avoid_cols:
            df[col]=df[col].apply(lambda x: f"({x})")
    return df

It is obvious that the naming convention for the models changes in every other claim. Thus it becomes important to have a mapping between the model names in the actual claims and the model names in the reproduced results. So, we would need to have a fuzzy matching between the model names in the actual claims and the model names in the reproduced results. This function does that.

In [57]:
!pip install fuzzywuzzy==0.18.0
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.25.1-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.25.1
  Downloading Levenshtein-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<4.0.0,>=3.8.0
  Downloading rapidfuzz-3.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.25.1 python-Levenshtein-0.25.1 rapidfuzz-3.9.0


In [59]:
from fuzzywuzzy import process

In [60]:
def fuzzy_combine_results(claims_df, re_results_df, value_cols=['High', 'Mid', 'Low', 'All'],cut_off=55):
    combined_results=claims_df.copy()

    # Create a set to keep track of matched models from final_results_df
    matched_models = set()
    
    index_lookup={v:k for k,v in claims_df['Model'].items()}

    # ensuring that each model from final_results_df is only used once
    for index, model in re_results_df['Model'].items():

        lookup_list=[claim_model for claim_model in claims_df['Model'] if claim_model not in matched_models]

        if not lookup_list:
            break

        # Extract best match that hasn't been used already
        best_match, score = process.extractOne(
            model,
            lookup_list
            
        )

        # If a good match is found and not already used
        if best_match and score >= cut_off:
            # Mark this model as matched to prevent further matches
            matched_models.add(best_match)

            # For each score column, append the final_results value to the claims value
            for col in value_cols:
                final_value = re_results_df.loc[re_results_df['Model'] == model, col].iloc[0]
                combined_results.at[index_lookup[best_match], col] = f"{final_value} {combined_results.at[index_lookup[best_match], col]}"

    return combined_results
    
        

## Whisper Claim 1

Below is the initial claim in Whisper paper where authors compared performance of different models on resource-level categorization of languages based on amount of data present for each language. These results are evaluated on CoVoST 2 dataset.

In [61]:
#read the claim data
import pandas as pd
claims_df= pd.read_json('claims/whisper_claim_1.json',orient='split')

In [62]:
claims_df

Unnamed: 0,Model,High,Mid,Low,All
0,XMEF-X,34.2,20.2,5.9,14.7
1,XLS-R (2B),36.1,27.7,15.1,22.1
2,mSLAM-CTC (2B),37.8,29.6,18.5,24.8
3,Maestro,38.2,31.3,18.4,25.2
4,Zero-Shot Whisper,36.2,32.6,25.2,29.1


Lets load the initial summary we obtained earlier

In [63]:
re_results_df= pd.read_json('claims/re_covost_summary.json',orient='split')
re_results_df

Unnamed: 0,Model,High,Mid,Low,All
0,XLS-R (2B),36.1,27.7,15.1,22.1
1,Whisper Large-v2,36.3,33.6,24.9,29.2
2,Seamless Medium,37.3,33.6,28.3,31.3
3,Seamless large,39.3,36.2,31.9,34.3
4,Seamless medium,37.3,33.6,28.3,31.3


Map these in a single view to validate against original claim.

In [64]:
merged_df=fuzzy_combine_results(apply_brackets(claims_df), re_results_df)
merged_df

Unnamed: 0,Model,High,Mid,Low,All
0,XMEF-X,(34.2),(20.2),(5.9),(14.7)
1,XLS-R (2B),36.1 (36.1),27.7 (27.7),15.1 (15.1),22.1 (22.1)
2,mSLAM-CTC (2B),(37.8),(29.6),(18.5),(24.8)
3,Maestro,(38.2),(31.3),(18.4),(25.2)
4,Zero-Shot Whisper,36.3 (36.2),33.6 (32.6),24.9 (25.2),29.2 (29.1)


The values under the parenthesis are original values in the claim. The reproduced results in the above table look consistent with the actual results

In [65]:
merged_df.to_json('claims/re_whisper_claim_1.json',orient='split')

## Seamless Claim 1

In this claim in seamless paper, authors compared performance of different models on Fleurs and CoVoST 2 datasets. Although the original claim also account for eng→X results, we would just consider X→eng translations for our analysis

In [66]:
#read the claim data
claims_df= pd.read_json('claims/seamless_claim_1.json',orient='split')

In [67]:
claims_df

Unnamed: 0,Model,size,FLEURS X→eng (n=81),FLEURS eng→X (n=88),CoVoST 2 X→eng (n=21),CoVoST 2 eng→X (n=15)
0,XLS-R-2B-S2T,2.6B,,x,22.1,27.8
1,WHISPER-LARGE-v2,1.5B,17.9,x,29.1,x
2,AUDIOPaLM-2-8B-AST,8.0B,19.7,x,37.8,x
3,SEAMLESSM4T-MEDIUM,1.2B,20.9,19.2,29.8,26.6
4,SEAMLESSM4T-LARGE,2.3B,24.0,21.5,34.1,30.6


#### CoVoST2 X→eng related entries

Load the CoVoST2 summary we generated earlier

In [68]:
re_results_df=pd.read_json('claims/re_covost_summary.json',orient='split')
re_results_df

Unnamed: 0,Model,High,Mid,Low,All
0,XLS-R (2B),36.1,27.7,15.1,22.1
1,Whisper Large-v2,36.3,33.6,24.9,29.2
2,Seamless Medium,37.3,33.6,28.3,31.3
3,Seamless large,39.3,36.2,31.9,34.3
4,Seamless medium,37.3,33.6,28.3,31.3


Filter for All languages column over the summary

In [69]:
#keep only Model and All columns and rename All column to CoVoST 2 X→eng (n=21)

re_results_df=re_results_df[['Model','All']]
re_results_df=re_results_df.rename(columns = {'All':'CoVoST 2 X→eng (n=21)'})

re_results_df

Unnamed: 0,Model,CoVoST 2 X→eng (n=21)
0,XLS-R (2B),22.1
1,Whisper Large-v2,29.2
2,Seamless Medium,31.3
3,Seamless large,34.3
4,Seamless medium,31.3


Map these values in a single view to compare against original claim values indicated in parenthesis

In [70]:
merged_df=fuzzy_combine_results(apply_brackets(claims_df,["Model","size"]), re_results_df,value_cols=['CoVoST 2 X→eng (n=21)'])

In [71]:
merged_df

Unnamed: 0,Model,size,FLEURS X→eng (n=81),FLEURS eng→X (n=88),CoVoST 2 X→eng (n=21),CoVoST 2 eng→X (n=15)
0,XLS-R-2B-S2T,2.6B,(),(x),22.1 (22.1),(27.8)
1,WHISPER-LARGE-v2,1.5B,(17.9),(x),29.2 (29.1),(x)
2,AUDIOPaLM-2-8B-AST,8.0B,(19.7),(x),(37.8),(x)
3,SEAMLESSM4T-MEDIUM,1.2B,(20.9),(19.2),31.3 (29.8),(26.6)
4,SEAMLESSM4T-LARGE,2.3B,(24.0),(21.5),34.3 (34.1),(30.6)


Till now we have finished populating the entries for CoVoST 2 dataset, lets now feed the FLEURS related entries

#### Fleurs related entries

Load the FLEURS summary we generated earlier and filter the dataframe 

In [72]:
re_results_df= pd.read_json('claims/re_fleurs_summary.json',orient='split')
re_results_df=re_results_df[['Model','FLEURS X→eng (n=81)']]

re_results_df

Unnamed: 0,Model,FLEURS X→eng (n=81)
0,Whisper large-v2,16.7
1,Seamless large,23.4
2,Seamless medium,20.3
3,Whisper Large-v2 ASR + NLLB-1.3B,18.2


Map these values in a single view to compare against original claim values indicated in parenthesis

In [73]:
merged_df2=fuzzy_combine_results(merged_df, re_results_df,value_cols=['FLEURS X→eng (n=81)'])
merged_df2

Unnamed: 0,Model,size,FLEURS X→eng (n=81),FLEURS eng→X (n=88),CoVoST 2 X→eng (n=21),CoVoST 2 eng→X (n=15)
0,XLS-R-2B-S2T,2.6B,(),(x),22.1 (22.1),(27.8)
1,WHISPER-LARGE-v2,1.5B,16.7 (17.9),(x),29.2 (29.1),(x)
2,AUDIOPaLM-2-8B-AST,8.0B,(19.7),(x),(37.8),(x)
3,SEAMLESSM4T-MEDIUM,1.2B,20.3 (20.9),(19.2),31.3 (29.8),(26.6)
4,SEAMLESSM4T-LARGE,2.3B,23.4 (24.0),(21.5),34.3 (34.1),(30.6)


Above view shows that almost all reproduced results are consistent with original values in the claim. We did notice some difference in values for Whisper Large-v2 model, this difference might account for a different decoding strategy that could have been used but not stated in the paper.

Lets save the reproduced claim view

In [74]:
merged_df.to_json('claims/re_seamless_claim_1.json',orient='split')

## Seamless Claim 2

Load the claim 2 from whisper. This claim compares resource level performance of models over languages in FLEURS dataset. This is similar to the earlier claim we evauated against CoVoST 2 data for Whisper

In [75]:
claim_df= pd.read_json('claims/seamless_claim_2.json',orient='split')
claim_df

Unnamed: 0,Model,High (n=15),Medium (n=25),Low (n=34),Low† (n=23)
0,WHISPER-LARGE-v2,24.2,19.4,16.1,18.1
1,AUDIOPALM-2-8B-AST,27.9,20.9,18.0,22.0
2,SEAMLESSM4T-MEDIUM,23.9,21.8,22.2,23.5
3,SEAMLESSM4T-LARGE,26.9,25.2,25.4,27.0


Load the FLEURS summary we generated earlier

In [76]:
re_results_df=pd.read_json('claims/re_fleurs_summary.json',orient='split')

Map these values in a single view to compare against original claim values indicated in parenthesis

In [77]:
merged_df=fuzzy_combine_results(apply_brackets(claim_df), re_results_df,value_cols=['High (n=15)', 'Medium (n=25)', 'Low (n=34)', 'Low† (n=23)'])
merged_df

Unnamed: 0,Model,High (n=15),Medium (n=25),Low (n=34),Low† (n=23)
0,WHISPER-LARGE-v2,23.7 (24.2),17.5 (19.4),14.8 (16.1),17.0 (18.1)
1,AUDIOPALM-2-8B-AST,(27.9),(20.9),(18.0),(22.0)
2,SEAMLESSM4T-MEDIUM,23.7 (23.9),21.1 (21.8),21.7 (22.2),22.8 (23.5)
3,SEAMLESSM4T-LARGE,26.6 (26.9),24.5 (25.2),24.9 (25.4),26.5 (27.0)


Above view shows that almost all reproduced results are consistent with original values in the claim. We did notice some difference in values for Whisper Large-v2 model, this difference might account for a different decoding strategy that could have been used but not stated in the paper.

Save the reproduced claim

In [78]:
merged_df.to_json('claims/re_seamless_claim_2.json',orient='split')

## Seamless Claim 3


This is one of the secondary claim in Seamless paper that highlights the performance of cascaded model pipelines consisting of trancription and translation stages against direct models

In [79]:
claim_df=pd.read_json('claims/seamless_claim_3.json',orient='split')
claim_df

Unnamed: 0,Model,type,size,X-eng (n=81),eng-X (n=88)
0,WHISPER-MEDIUM (ASR) + NLLB-1.3B,cascaded,2B,19.7,20.5
1,WHISPER-MEDIUM (ASR) + NLLB-3.3B,cascaded,4B,20.4,21.8
2,WHISPER-LARGE-v2 (ASR) + NLLB-1.3B,cascaded,2.8B,22.0,21.0
3,WHISPER-LARGE-v2 (ASR) + NLLB-3.3B,cascaded,4.8B,22.7,22.2
4,WHISPER-LARGE-v2,direct,1.5B,17.9,-
5,AudioPaLM-2-8B-AST,direct,8B,19.7,-
6,SEAMLESSM4T-MEDIUM,direct,1B,20.9,19.2
7,SEAMLESSM4T-LARGE,direct,2B,24.0,21.5


Lets again load the summary of results for FLEURS dataset and filter appropriate column

In [80]:
re_results_df=pd.read_json('claims/re_fleurs_summary.json',orient='split')
re_results_df=re_results_df[['Model','FLEURS X→eng (n=81)']]
re_results_df.rename(columns = {'FLEURS X→eng (n=81)': 'X-eng (n=81)'}, inplace = True)

re_results_df

Unnamed: 0,Model,X-eng (n=81)
0,Whisper large-v2,16.7
1,Seamless large,23.4
2,Seamless medium,20.3
3,Whisper Large-v2 ASR + NLLB-1.3B,18.2


Combine these values in a single view to compare against original claim values indicated in parenthesis

In [81]:
merged_df=fuzzy_combine_results(apply_brackets(claim_df,avoid_cols=["Model","type","size"]), re_results_df,value_cols=['X-eng (n=81)'])
merged_df

Unnamed: 0,Model,type,size,X-eng (n=81),eng-X (n=88)
0,WHISPER-MEDIUM (ASR) + NLLB-1.3B,cascaded,2B,(19.7),(20.5)
1,WHISPER-MEDIUM (ASR) + NLLB-3.3B,cascaded,4B,(20.4),(21.8)
2,WHISPER-LARGE-v2 (ASR) + NLLB-1.3B,cascaded,2.8B,18.2 (22.0),(21.0)
3,WHISPER-LARGE-v2 (ASR) + NLLB-3.3B,cascaded,4.8B,(22.7),(22.2)
4,WHISPER-LARGE-v2,direct,1.5B,16.7 (17.9),(-)
5,AudioPaLM-2-8B-AST,direct,8B,(19.7),(-)
6,SEAMLESSM4T-MEDIUM,direct,1B,20.3 (20.9),(19.2)
7,SEAMLESSM4T-LARGE,direct,2B,23.4 (24.0),(21.5)


Note: we limited our scope to WHISPER-LARGE-v2 (ASR) + NLLB-1.3B cascaded pipeline to provide a comparative view. Others aren't evaluated right now but can be evaluated easily using the same inference code we provided in `fleurs_eval.ipynb`

Lets save this merged result

In [82]:
merged_df.to_json('claims/re_seamless_claim_3.json',orient='split')

## Challanges, we overcame 💪 

1. The naming convention for the models changes in every other claim. Thus it becomes important to have a mapping between the model names in the actual claims and the model names in the reproduced results. So, we would need to have a fuzzy matching between the model names in the actual claims and the model names in the reproduced results. We used the `fuzzywuzzy` library to do this.

2. For one of the claims in Seamless paper, we had to divide languages across high, medium and low resorce-level categories based on Table 38 in Seamless paper. On top of it we had to constrain ourselved to only consider those languages which was supported across all models. This took some time to figure out.

3. In the same claim, they mentioned about an additional category for low resource-level languages which excluded languages that were Zero-shot in AudioPaLM. We had to manually check for languages which were Zero-shot in AudioPaLM, mention in the Table 17 in Rubenstein et al., 2023 by inspecting if training data hours were zero or not. This was a bit time consuming, as the language names mentioned in Table 5 in Seamless weren't consistent with the ones in AudioPaLM paper.