# Results Analysis

## transfer the results (or skip)

In [None]:
why server?

Let set up the basic Chameleon Configuration


In [1]:
import chi, os, time
from chi import lease
from chi import server

PROJECT_NAME = os.getenv('OS_PROJECT_NAME') # change this if you need to
chi.use_site("CHI@UC")
chi.set("project_name", PROJECT_NAME)
username = os.getenv('USER') # all exp resources will have this prefix

Now using CHI@UC:
URL: https://chi.uc.chameleoncloud.org
Location: Argonne National Laboratory, Lemont, Illinois, USA
Support contact: help@chameleoncloud.org


Set the `NODE_TYPE` to resource server where you performed your experiment

In [2]:
NODE_TYPE="gpu_rtx_6000"

Lets access the resources where our evaluation was conducted

In [3]:
l = lease.get_lease(f"colab-{username}-{NODE_TYPE}-v2")
reservation_id = lease.get_node_reservation(l["id"])
server_id = server.get_server_id(f"colab-{username}-{NODE_TYPE}-v2")
server.wait_for_active(server_id)
reserved_fip = [d['addr'] for d in chi.server.show_server(server_id).addresses['sharednet1'] if d['OS-EXT-IPS:type']=='floating'][0]

### Transfer raw results from the resource server to the jupyter server

In [4]:
from chi import ssh
node = ssh.Remote(reserved_fip)

In [5]:
# Define the remote directory path and the local path to download to
remote_directory = 'results/'
archive_name='results.tar.gz'


The ssh implementation provided by `python-chi` as a wrapper over Fabric, only allows for single file transfer. In order to transfer the directory we would nead to archive the entire remote directory then transfer the archive file

In [6]:
node.run(f'tar -czf {archive_name} -C {remote_directory} .') #archive the results



<Result cmd='tar -czf results.tar.gz -C results/ .' exited=0>

In [7]:
node.get(archive_name) #download the archive

<fabric.transfer.Result at 0x7f88c95b72e0>

In [8]:
import tarfile
with tarfile.open(archive_name) as tar:
    tar.extractall(path=remote_directory) #extract the archive

## Reproduced results summaries

### CoVoST2 evaluation summary

Lets first summarize the results of the CoVoST2 evaluation. We have the scores for four models on this data for X→eng translations for 21 languages. These models are: 

1) Whisper 
2) XLS-R
3) Seamless Medium
4) Seamless Large


#### Divide language in different categories


While evaluating performance in terms of translation capabilities, we need to divide our languages between high, mid and low resource categories depending on what amount of data is available in each language. This distribution has been provided by Babu et al.,2021 in their XLS-R [paper](https://arxiv.org/pdf/2111.09296.pdf).

In [9]:
import os, json
import pandas as pd

In [10]:
res_levels=["low_res","mid_res","high_res"]

high_res=['ca','de','fr','es']
mid_res=['zh-CN','fa','it','ru','pt']
low_res=['mn','ta','lv','et','cy','sl','ja','tr','ar','nl','sv-SE','id']


In [11]:
import collections
def resource_level_results(scores,model_name):
    res_scores=collections.defaultdict(float)
    for level in res_levels:
        for lang in eval(level):
            res_scores[level]+=scores[lang]
        res_scores['all']+=res_scores[level]
        res_scores[level]/=len(eval(level))
    res_scores['all']/=21.0
    return {
      "Model":model_name,
      "High" : round(res_scores["high_res"],1),
      "Mid" : round(res_scores["mid_res"],1),
      "Low" : round(res_scores["low_res"],1),
      "All" : round(res_scores['all'],1)
    }

In [12]:

re_results=[]
score_directory_covost='results/covost2/scores/'
#read all the json files
for files in os.listdir(score_directory_covost):
    if files.endswith('.json'):
        with open(score_directory_covost+files) as f:
            scores=json.load(f)
        model_name=files.split('.')[0]
        re_results.append(resource_level_results(scores,model_name))

re_results_df=pd.DataFrame(re_results)




In [13]:
re_results_df

Unnamed: 0,Model,High,Mid,Low,All
0,XLS-R (2B),36.0,27.8,15.4,22.3
1,Seamless Large,39.3,36.2,31.9,34.3
2,Whisper Large-v2,35.2,32.6,23.8,28.1
3,Seamless Medium,37.3,33.6,28.3,31.3


In [14]:
re_results_df.to_json('claims/re_covost_summary.json',orient='split')

### Fleurs evaluation summary

For Fleurs, the division of languages is not as clear as CoVoST2. We will divide the languages into high, mid and low resource categories based on resource-level information available in Table 38 in [SeamlessM4T](https://arxiv.org/pdf/2308.11596.pdf) (Barrault et al., 2023). 

Apart from the information in the Seamless paper, we had to manually check in the Whisper paper, if the language was supported or not. This is becuase while citing the results, author stated that they provided results only for languages that were supported by all three models - Whisper, AudioPaLM and Seamless.

In [15]:
whisper_fleurs= ['afr', 'amh', 'asm', 'bel', 'bul', 'ben', 'bos', 'cat', 'cmn', 'ces', 'cym', 'dan', 'deu', 'ell', 
                  'spa', 'est', 'fin', 'fra', 'glg', 'guj', 'hau', 'heb', 'hin', 'hrv', 'hun', 'hye', 'ind', 'isl', 'ita', 
                  'jpn', 'jav', 'kat', 'kaz', 'khm', 'kan', 'kor', 'ltz', 'lin', 'lao', 'lit', 'mri', 'mkd', 'mal', 'mar', 
                  'mlt', 'mya', 'nob', 'nld', 'oci', 'pan', 'pol', 'por', 'ron', 'rus', 'snd', 'slk', 'slv', 'sna', 'som', 
                  'srp', 'swe', 'tam', 'tel', 'tgk', 'tha', 'tur', 'ukr', 'urd', 'vie', 'yor', 'yue', 'zlm', 'uzn', 'pes', 
                  'npi', 'lvs', 'arb', 'azj', 'pbt', 'khk', 'swh']


high_res = [
    "arb", "bel", "ben", "cat","ces", "cmn", "deu","fin", "fra", "ita", 
    "jpn", "nld", "pol", "ron", "spa" #15
]

mid_res = [
    "cym", "dan", "ell", "est", "hin", "hrv", "hun", "ind", "jav", "kaz", "kor",
    "pbt", "por", "rus", "slk", "swh", "tam", "tel", "tgl", "tha", "tur", "ukr", "urd", "uzn", "vie" #25
]

low_res = [
    "afr", "amh", "asm", "azj", "bos", "bul", "glg", "guj", "heb", "hye", 
    "isl", "kan", "kat", "khk", "khm", "lao",
    "lit", "lvs", "mal", "mar", "mkd", "mlt", "mya", "nob", "npi", "pan", 
    "pes", "slv", "som", "srp", "swe", "tgk", "yor", "zlm"       #34
]


On top of the usual High, Mid and Low resource categories, the authors created one more category for low resource langauges which excluded languages that were Zero-shot in AudioPaLM, to eliminate any confounding factor like Zer0-shot evaluation for other models. We had to manually check for languages which were Zero-shot in AudioPaLM, mention in the Table 17 in [Rubenstein et al., 2023](https://arxiv.org/pdf/2306.12925.pdf).

In [16]:
zeroshot_audiopalm = [
    "yor", "uzn", "tgk", "som", "sna", "snd", "pbt", "oci", "nob", "mya", "mri", "lao",
    "lin", "ltz", "kan", "khm", "jav", "heb", "tgl", "amh", "bos"
]# Languages marked as 'zero-shot' in audioPalm

#remove the zeroshot languages from the low_res
low_res2= list(set(low_res)-set(zeroshot_audiopalm))

In [17]:
res_levels=["high_res","mid_res","low_res","low_res2"]

In [21]:
def resource_level_results_fleurs(scores,model_name):
    res_scores=collections.defaultdict(float)
    for level in res_levels:
        for lang in eval(level):
            try:
                res_scores[level]+=scores[lang]
            except:
                continue
        res_scores[level]/=len(eval(level))

    for lang in whisper_fleurs:
        res_scores['whisper_fleurs']+=scores[lang]
    res_scores['whisper_fleurs']/=81.0 
    
    return {
      "Model":model_name,
      "High (n=15)" : round(res_scores["high_res"],1),
      "Medium (n=25)" : round(res_scores["mid_res"],1),
      "Low (n=34)" : round(res_scores["low_res"],1),
      "Low† (n=23)" : round(res_scores['low_res2'],1),
      "FLEURS X→eng (n=81)" : round(res_scores['whisper_fleurs'],1)
      

    }

In [22]:
score_directory_fleurs='results/fleurs/scores/'
re_results=[]
#read all the json files
for files in os.listdir(score_directory_fleurs):
    if files.endswith('.json'):
        with open(score_directory_fleurs+files) as f:
            scores=json.load(f)
        model_name=files.split('.')[0]
        re_results.append(resource_level_results_fleurs(scores,model_name))

re_results_df=pd.DataFrame(re_results)


In [23]:
re_results_df.to_json('claims/re_fleurs_summary.json',orient='split')
re_results_df

Unnamed: 0,Model,High (n=15),Medium (n=25),Low (n=34),Low† (n=23),FLEURS X→eng (n=81)
0,Whisper large-v2,23.6,17.3,14.9,17.0,16.7
1,Seamless large,26.6,24.5,24.9,26.5,23.4
2,Seamless medium,23.7,21.1,21.7,22.8,20.3


## Whisper Claim 1

talk about claims 

In [24]:
#read the claim data
import pandas as pd
claims_df= pd.read_json('claims/whisper_claim_1.json',orient='split')

In [25]:
claims_df

Unnamed: 0,Model,High,Mid,Low,All
0,XMEF-X,34.2,20.2,5.9,14.7
1,XLS-R (2B),36.1,27.7,15.1,22.1
2,mSLAM-CTC (2B),37.8,29.6,18.5,24.8
3,Maestro,38.2,31.3,18.4,25.2
4,Zero-Shot Whisper,36.2,32.6,25.2,29.1


talk about summaries

In [26]:
re_results_df= pd.read_json('claims/re_covost_summary.json',orient='split')
re_results_df

Unnamed: 0,Model,High,Mid,Low,All
0,XLS-R (2B),36.0,27.8,15.4,22.3
1,Seamless Large,39.3,36.2,31.9,34.3
2,Whisper Large-v2,35.2,32.6,23.8,28.1
3,Seamless Medium,37.3,33.6,28.3,31.3


We will now merge the actual claims and reproduced results in Single view using some helper functions.

This function adds brackets to values in actual claims table while merging

In [27]:
def apply_brackets(claims_df, avoid_cols=["Model"]):
    df=claims_df.copy()
    for col in df.columns:
        if col not in avoid_cols:
            df[col]=df[col].apply(lambda x: f"({x})")
    return df

It is obvious that the naming convention for the models changes in every other claim. Thus it becomes important to have a mapping between the model names in the actual claims and the model names in the reproduced results. So, we would need to have a fuzzy matching between the model names in the actual claims and the model names in the reproduced results. This function does that.

In [28]:
!pip install fuzzywuzzy==0.18.0

Collecting fuzzywuzzy==0.18.0
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [29]:
from fuzzywuzzy import process



In [30]:
def fuzzy_combine_results(claims_df, re_results_df, value_cols=['High', 'Mid', 'Low', 'All'],cut_off=55):
    combined_results=claims_df.copy()

    # Create a set to keep track of matched models from final_results_df
    matched_models = set()

    # ensuring that each model from final_results_df is only used once
    for index, claim_model in claims_df['Model'].items():

        lookup_list=[model for model in re_results_df['Model'] if model not in matched_models]

        if not lookup_list:
            break

        # Extract best match that hasn't been used already
        best_match, score = process.extractOne(
            claim_model,
            lookup_list
            
        )

        # If a good match is found and not already used
        if best_match and score >= cut_off:
            # Mark this model as matched to prevent further matches
            matched_models.add(best_match)

            # For each score column, append the final_results value to the claims value
            for col in value_cols:
                final_value = re_results_df.loc[re_results_df['Model'] == best_match, col].iloc[0]
                combined_results.at[index, col] = f"{final_value} {combined_results.at[index, col]}"

    return combined_results
    
        

In [31]:
merged_df=fuzzy_combine_results(apply_brackets(claims_df), re_results_df)
merged_df

Unnamed: 0,Model,High,Mid,Low,All
0,XMEF-X,(34.2),(20.2),(5.9),(14.7)
1,XLS-R (2B),36.0 (36.1),27.8 (27.7),15.4 (15.1),22.3 (22.1)
2,mSLAM-CTC (2B),(37.8),(29.6),(18.5),(24.8)
3,Maestro,(38.2),(31.3),(18.4),(25.2)
4,Zero-Shot Whisper,35.2 (36.2),32.6 (32.6),23.8 (25.2),28.1 (29.1)


In [32]:
merged_df.to_json('claims/re_whisper_claim_1.json',orient='split')

## Seamless Claim 1

In [33]:
#read the claim data
claims_df= pd.read_json('claims/seamless_claim_1.json',orient='split')

In [34]:
claims_df

Unnamed: 0,Model,size,FLEURS X→eng (n=81),FLEURS eng→X (n=88),CoVoST 2 X→eng (n=21),CoVoST 2 eng→X (n=15)
0,XLS-R-2B-S2T,2.6B,,x,22.1,27.8
1,WHISPER-LARGE-v2,1.5B,17.9,x,29.1,x
2,AUDIOPaLM-2-8B-AST,8.0B,19.7,x,37.8,x
3,SEAMLESSM4T-MEDIUM,1.2B,20.9,19.2,29.8,26.6
4,SEAMLESSM4T-LARGE,2.3B,24.0,21.5,34.1,30.6


#### CoVoST2 X→eng related entries

In [35]:
re_results_df=pd.read_json('claims/re_covost_summary.json',orient='split')
re_results_df

Unnamed: 0,Model,High,Mid,Low,All
0,XLS-R (2B),36.0,27.8,15.4,22.3
1,Seamless Large,39.3,36.2,31.9,34.3
2,Whisper Large-v2,35.2,32.6,23.8,28.1
3,Seamless Medium,37.3,33.6,28.3,31.3


In [36]:
#keep only Model and All columns and rename All column to CoVoST 2 X→eng (n=21)

re_results_df=re_results_df[['Model','All']]
re_results_df.rename(columns = {'All':'CoVoST 2 X→eng (n=21)'}, inplace = True)

re_results_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  re_results_df.rename(columns = {'All':'CoVoST 2 X→eng (n=21)'}, inplace = True)


Unnamed: 0,Model,CoVoST 2 X→eng (n=21)
0,XLS-R (2B),22.3
1,Seamless Large,34.3
2,Whisper Large-v2,28.1
3,Seamless Medium,31.3


In [37]:
merged_df=fuzzy_combine_results(apply_brackets(claims_df,["Model","size"]), re_results_df,value_cols=['CoVoST 2 X→eng (n=21)'])

In [38]:
merged_df

Unnamed: 0,Model,size,FLEURS X→eng (n=81),FLEURS eng→X (n=88),CoVoST 2 X→eng (n=21),CoVoST 2 eng→X (n=15)
0,XLS-R-2B-S2T,2.6B,(),(x),22.3 (22.1),(27.8)
1,WHISPER-LARGE-v2,1.5B,(17.9),(x),28.1 (29.1),(x)
2,AUDIOPaLM-2-8B-AST,8.0B,(19.7),(x),(37.8),(x)
3,SEAMLESSM4T-MEDIUM,1.2B,(20.9),(19.2),31.3 (29.8),(26.6)
4,SEAMLESSM4T-LARGE,2.3B,(24.0),(21.5),34.3 (34.1),(30.6)


#### Fleurs related entries

In [39]:
re_results_df= pd.read_json('claims/re_fleurs_summary.json',orient='split')
re_results_df=re_results_df[['Model','FLEURS X→eng (n=81)']]

re_results_df

Unnamed: 0,Model,FLEURS X→eng (n=81)
0,Whisper large-v2,16.7
1,Seamless large,23.4
2,Seamless medium,20.3


In [40]:
merged_df2=fuzzy_combine_results(merged_df, re_results_df,value_cols=['FLEURS X→eng (n=81)'])
merged_df2

Unnamed: 0,Model,size,FLEURS X→eng (n=81),FLEURS eng→X (n=88),CoVoST 2 X→eng (n=21),CoVoST 2 eng→X (n=15)
0,XLS-R-2B-S2T,2.6B,(),(x),22.3 (22.1),(27.8)
1,WHISPER-LARGE-v2,1.5B,16.7 (17.9),(x),28.1 (29.1),(x)
2,AUDIOPaLM-2-8B-AST,8.0B,(19.7),(x),(37.8),(x)
3,SEAMLESSM4T-MEDIUM,1.2B,20.3 (20.9),(19.2),31.3 (29.8),(26.6)
4,SEAMLESSM4T-LARGE,2.3B,23.4 (24.0),(21.5),34.3 (34.1),(30.6)


## Seamless Claim 2

In [41]:
claim_df= pd.read_json('claims/seamless_claim_2.json',orient='split')
claim_df

Unnamed: 0,Model,High (n=15),Medium (n=25),Low (n=34),Low† (n=23)
0,WHISPER-LARGE-v2,24.2,19.4,16.1,18.1
1,AUDIOPALM-2-8B-AST,27.9,20.9,18.0,22.0
2,SEAMLESSM4T-MEDIUM,23.9,21.8,22.2,23.5
3,SEAMLESSM4T-LARGE,26.9,25.2,25.4,27.0


In [42]:
re_results_df=pd.read_json('claims/re_fleurs_summary.json',orient='split')

In [43]:
merged_df=fuzzy_combine_results(apply_brackets(claim_df), re_results_df,value_cols=['High (n=15)', 'Medium (n=25)', 'Low (n=34)', 'Low† (n=23)'])
merged_df

Unnamed: 0,Model,High (n=15),Medium (n=25),Low (n=34),Low† (n=23)
0,WHISPER-LARGE-v2,23.6 (24.2),17.3 (19.4),14.9 (16.1),17.0 (18.1)
1,AUDIOPALM-2-8B-AST,(27.9),(20.9),(18.0),(22.0)
2,SEAMLESSM4T-MEDIUM,23.7 (23.9),21.1 (21.8),21.7 (22.2),22.8 (23.5)
3,SEAMLESSM4T-LARGE,26.6 (26.9),24.5 (25.2),24.9 (25.4),26.5 (27.0)


## Seamless Claim 3


In [44]:
claim_df=pd.read_json('claims/seamless_claim_3.json',orient='split')
claim_df

Unnamed: 0,Model,type,size,X-eng (n=81),eng-X (n=88)
0,WHISPER-MEDIUM (ASR) + NLLB-1.3B,cascaded,2B,19.7,20.5
1,WHISPER-MEDIUM (ASR) + NLLB-3.3B,cascaded,4B,20.4,21.8
2,WHISPER-LARGE-v2 (ASR) + NLLB-1.3B,cascaded,2.8B,22.0,21.0
3,WHISPER-LARGE-v2 (ASR) + NLLB-3.3B,cascaded,4.8B,22.7,22.2
4,WHISPER-LARGE-v2,direct,1.5B,17.9,-
5,AudioPaLM-2-8B-AST,direct,8B,19.7,-
6,SEAMLESSM4T-MEDIUM,direct,1B,20.9,19.2
7,SEAMLESSM4T-LARGE,direct,2B,24.0,21.5


In [45]:
re_results_df=pd.read_json('claims/re_fleurs_summary.json',orient='split')
re_results_df=re_results_df[['Model','FLEURS X→eng (n=81)']]
re_results_df.rename(columns = {'FLEURS X→eng (n=81)': 'X-eng (n=81)'}, inplace = True)

re_results_df

Unnamed: 0,Model,X-eng (n=81)
0,Whisper large-v2,16.7
1,Seamless large,23.4
2,Seamless medium,20.3


In [52]:
merged_df=fuzzy_combine_results(apply_brackets(claim_df,avoid_cols=["Model","type","size"]), re_results_df,value_cols=['X-eng (n=81)'],cut_off=90)
merged_df

Unnamed: 0,Model,type,size,X-eng (n=81),eng-X (n=88)
0,WHISPER-MEDIUM (ASR) + NLLB-1.3B,cascaded,2B,(19.7),(20.5)
1,WHISPER-MEDIUM (ASR) + NLLB-3.3B,cascaded,4B,(20.4),(21.8)
2,WHISPER-LARGE-v2 (ASR) + NLLB-1.3B,cascaded,2.8B,16.7 (22.0),(21.0)
3,WHISPER-LARGE-v2 (ASR) + NLLB-3.3B,cascaded,4.8B,(22.7),(22.2)
4,WHISPER-LARGE-v2,direct,1.5B,(17.9),(-)
5,AudioPaLM-2-8B-AST,direct,8B,(19.7),(-)
6,SEAMLESSM4T-MEDIUM,direct,1B,20.3 (20.9),(19.2)
7,SEAMLESSM4T-LARGE,direct,2B,23.4 (24.0),(21.5)


## Challanges, we overcame 💪 

1. The naming convention for the models changes in every other claim. Thus it becomes important to have a mapping between the model names in the actual claims and the model names in the reproduced results. So, we would need to have a fuzzy matching between the model names in the actual claims and the model names in the reproduced results. We used the `fuzzywuzzy` library to do this.

2. For one of the claims in Seamless paper, we had to divide languages across high, medium and low resorce-level categories based on Table 38 in Seamless paper. On top of it we had to constrain ourselved to only consider those languages which was supported across all models. This took some time to figure out.

3. In the same claim, they mentioned about an additional category for low resource-level languages which excluded languages that were Zero-shot in AudioPaLM. We had to manually check for languages which were Zero-shot in AudioPaLM, mention in the Table 17 in Rubenstein et al., 2023 by inspecting if training data hours were zero or not. This was a bit time consuming, as the language names mentioned in Table 5 in Seamless weren't consistent with the ones in AudioPaLM paper.