<a href="https://colab.research.google.com/github/samochristian2020/data_mitx/blob/main/data354_model_ASR_wolof.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Construction d'un model Automatic Speech recognition (__ASR__) de transcription en __wolof__ pour l'entreprise __data354.com__.

### Plan de Travail
> **1**.  Choisir un model ASR pre-entraine dans plusieurs langues a paufiner
 (fine-tuning) pour ASR en wolof grace aux methodes: "transfer learning" et "
  cross lingual training".
>  
> **2**.   Preparer (preprocessing dataset for the chosen model) les clips audio pour qu'ils soit compatibles avec le model choisi.
> 
> **3**.  Telecharger le dataset dans notre environement, puis en ecouter/visualiser quelques elements.
>
> **4**.  Effectuer encore quelques etapes de data cleaning
>
> **5**.  Construction d'un vocabulaire a inclure dans la CTC "head" qui sera utilisee par notre model pour "decoder" l'output du model en text-sequence.
>
> **6**.  Creation d' un objet "DataCollator" qui servira a uniformiser la longueur des inputs dans chaques "batch" envoye au model.
>
> **7**.  Construction d'une methode pour l'evaluation basee sur "word error rate" metric (elle sera utilisee uniquement sur le train split de notre dataset car le test split ne contient pas de transcription) 
>
> **8**. Construction du pipeline d'entrainement
>
> **9**. Entrainement(fine-tuning) du model
>
> **10**. Evaluation du model 

# Etape 1: 
Choisir un model ASR pre-entraine dans plusieurs langues a paufiner (fine-tuning) pour ASR en wolof grace aux methodes: "transfer learning" et " cross lingual training".


Compte tenu de tache a accomplir a savoir la creation d'un model ASR pour la transcription en WOLOF, Nous avions le choix entre : 
 - Entrainer un model from Scratch a l'aide de l 'une des diverses methodes de Natural Langages Processing NLP disponible "__inclure quelques methodes NLP pour ASR__" ce qui est une tache assez complexe car il faudrait alors choisir le model NLP le plus a meme de "representer" les specificites du language WOLOF, il faudrait egalement avoir assez donnees en WOLOF, tant en termes de quantite, qu en termes de diversite, et il faudrait enfin avoir assez de temps, (a compter en semaines) et de processing power (last generation GPU pour entrainer le model) ce qui implique un investissement financier consequent.

 - L'autre alternative, plus accessible consiste a tirer profit de l'un des models de dernieres generations adapte pour la tache a accomplir, que nous allons paufiner (fine-tune) a l'aide de notre petit dataset pour son utilisation en ASR WOLOF. Ceci grace aux techiques de Transfer-learning et Cross-lingual learning possible pour le model choisi.

- Nous avons choisi le model, Wav2Vec2.0-XLSR concu et realise par le laboratoire de recherche en intelligence artificielle de Facebook en 2020. Pour plus de details: Voir le Blog Officiel du Laboratoire AI de  Facebook [ici.](https://colab.research.google.com/drive/1HszEMVVkMcw_auiGuW4Ijap4hXhiyZEZ#scrollTo=1cDQPxo0rG26&line=10&uniqifier=1)

- Voici un petit apercu du model choisi tel decrit par ses concepteurs (en anglais):
 
_Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.  Soon after the superior performance of Wav2Vec2 was demonstrated on the English ASR dataset LibriSpeech, *Facebook AI* presented XLSR-Wav2Vec2 (click [here](https://arxiv.org/abs/2006.13979)). XLSR stands for *cross-lingual  speech representations* and refers to XLSR-Wav2Vec2's ability to learn speech representations that are useful across multiple languages._

*Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech.* 

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xlsr_wav2vec2.png)

*The authors show for the first time that massively pretraining an ASR model on cross-lingual unlabeled speech data, followed by language-specific fine-tuning on very little labeled data achieves state-of-the-art results. See Table 1-5 of the official [paper](https://arxiv.org/pdf/2006.13979.pdf).*

[voici la description du model choisi](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)

# Etape 2: 
Preparer (preprocessing dataset for the chosen model) les clips audio pour qu'ils soit compatibles avec le model choisi.

Le model n'accepte en inputs que des fichiers audio de type `.wav` (raw waveform) avec un sample rate de 16kHz comme on peut le voir dans la fiche descriptive du modele.

Mais notre dataset ne contient que des fichiers de type: `.mp3` (ainsi qu'on peut le voir dans fiche descriptive du challenge)  
- Affichage d'un fichier audio arbitraire de notre dataset(voir l'extenxion) et ils sont formattes avec un sample rate de 48000Hz

In [None]:
import os
import random
import soundfile as sf

audio_files = os.listdir("/content/drive/MyDrive/data354/clips/clips")

print("sample rate of a random audio signal before conversion is : ",sf.read("/content/drive/MyDrive/data354/clips/clips/"+random.choice(audio_files))[1])
print("file type of a random clip is : ", "\n", random.choice(audio_files))


In [None]:
print("total number of audio clips is : ",len(audio_files))

Dans les cellules qui suivent, nous allons:
 - Convertir des fichiers `.mp3`en fichiers `.wav` et formater avec un sample de rate de 16kHz 


In [None]:
#library to process audio files 
%%capture
!pip install pydub

In [None]:
# other utility libraries 
from tqdm.notebook import tqdm
from pydub import AudioSegment 
from joblib import Parallel, delayed


In [None]:
#ROOT_PATH is the path to my (google drive) folder containing original data
ROOT_PATH = "/content/drive/MyDrive/data354/clips/clips"

#creating a temporary folder to store our wav_data
!mkdir -p "/tmp/data354_WOLOF_ASR_dataset/audio_wav_16000"

OUTPUT_DIR = "/tmp/data354_WOLOF_ASR_dataset/audio_wav_16000"
            

In [None]:
#conversion function
# def save_fn(filename):
    
#     path = f"{ROOT_PATH}/{filename}"
#     save_path = f"{OUTPUT_DIR}"
#     if not os.path.exists(save_path):
#         os.makedirs(save_path, exist_ok=True)
    
#     if os.path.exists(path):
#         try:
#             sound = AudioSegment.from_mp3(path)
#             sound = sound.set_frame_rate(16000)
#             sound.export(f"{save_path}/{filename[:-4]}.wav", format="wav")
#         except:
#             pass


In [None]:
#parallelizing the task
#this takes approximately 20-35min
# %%capture
# Parallel(n_jobs=8, backend="multiprocessing")(delayed(save_fn)(filename) for filename in tqdm(audio_files))

In [None]:

#verifying that conversion succeeded  (format and sample rate)
audio_files_wav = os.listdir(OUTPUT_DIR)

print("the sample rate of a random audio signal after the conversion to wave file format is : ",sf.read(OUTPUT_DIR+"/"+random.choice(audio_files_wav))[1])
print("file type of a random clip is :", "\n",random.choice(audio_files_wav))

# Etape 3
Telecharger le dataset dans notre environement, puis en ecouter/visualiser quelques elements.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

#train dataset
df_train = pd.read_csv("/content/drive/MyDrive/data354/train.csv")

#test dataset
df_test = pd.read_csv("/content/drive/MyDrive/data354/test.csv")

df_train, df_eval = train_test_split(df_train, test_size=0.01, random_state=42)



In [None]:
display(df_train)


In [None]:

display(df_eval)
print("eval dataframe shape is :", df_eval.shape)



In [None]:
# fisrt let's listen to some random files 
# and also pull out their written transciption when they are in either of the train or test splits

## !! some files are in neither splits, so it will be no written transciptions for those !!

import IPython.display as ipd


rand_file = random.choice(audio_files_wav)

speech_array, sampling_rate = sf.read(OUTPUT_DIR+"/"+rand_file)

rand_file_stripped = rand_file.strip(".wav")


if not rand_file_stripped in df_test["ID"]:
  print(df_train[df_train["ID"]==rand_file_stripped]["transcription"],"\n")
elif not rand_file_stripped in df_eval["ID"]:
  print(df_test[df_test["ID"]==rand_file_stripped]["transcription"],"\n")
else:
  print(df_eval[df_eval["ID"]==rand_file_stripped]["transcription"],"\n")

ipd.Audio(data=speech_array, autoplay=True, rate=16000)

In [None]:

print(df_train.shape, df_eval.shape, df_test.shape)

In [None]:

#creating a list with first five clips the dataset
speech_deb = []
for i in [3829,3688]:
  speech_deb.append(sf.read(OUTPUT_DIR+"/"+df_train["ID"][i]+".wav")[0])


#creating a list with last five clips in the dataset
speech_fin = []
for i in [5226,860]:
  speech_fin.append(sf.read(OUTPUT_DIR+"/"+df_train["ID"][i]+".wav")[0])

In [None]:
# first two element in the train set
display(df_train.head(2))

In [None]:
# audio clip corresponding to the 1st element in the train set

print(df_train.loc[3829]["transcription"],"\n")
display(ipd.Audio(data=speech_deb[0], autoplay=True, rate=16000))

In [None]:
# audio clip corresponding to the 2nd element in the train set
print(df_train.loc[3688]["transcription"],"\n")
display(ipd.Audio(data=speech_deb[1], autoplay=True, rate=16000))

In [None]:
# last two element in the train set
display(df_train.tail(2))

In [None]:
# audio clip corresponding to the 2nd last element in the train set
print(df_train.loc[5226]["transcription"],"\n")
display(ipd.Audio(data=speech_fin[0], autoplay=True, rate=16000))

In [None]:
# audio clip corresponding to the last element in the train set
print(df_train.loc[860]["transcription"],"\n")
display(ipd.Audio(data=speech_fin[1], autoplay=True, rate=16000))

In [None]:
#adding the path to each clip in the train set and the test set 
df_train["clip_path"] = OUTPUT_DIR+"/"+df_train["ID"]+".wav"
df_test["clip_path"] = OUTPUT_DIR+"/"+df_test["ID"]+".wav"


In [None]:
df_eval["clip_path"] = OUTPUT_DIR+"/"+df_eval["ID"]+".wav"


In [None]:
display(df_train)


In [None]:
display(df_test)

# Etape 4
 Effectuer encore quelques etapes pour la preparation des donnees

In [None]:
%%capture
!pip install datasets==1.13.3
!pip install transformers==4.11.3
!pip install torchaudio
!pip install librosa
!pip install jiwer

In [None]:
from datasets import Dataset 

In [None]:
data_train = Dataset.from_pandas(df_train)
data_eval = Dataset.from_pandas(df_eval)

data_test = Dataset.from_pandas(df_test)


In [None]:
#fonction to add numpy array of the audio clips to the train and test datasets 
import numpy as np
import torchaudio

resampler  = torchaudio.transforms.Resample(48000, 16000)
def fct_speech_file_to_array_train(batch):
  speech_array, sampling_rate = torchaudio.load(batch["clip_path"])
  batch["audio_array"] = resampler(speech_array).squeeze().numpy()
  batch["sampling_rate"] = sampling_rate
  batch["target_text"] = batch["transcription"]

  return batch
    

def fct_speech_file_to_array_test(batch):
  speech_array, sampling_rate = torchaudio.load(batch["clip_path"])
  batch["audio_array"] = speech_array.squeeze().numpy()
  batch["sampling_rate"] = sampling_rate
  
  return batch

In [None]:

data_train = data_train.remove_columns(["ID" ])

data_eval = data_eval.remove_columns(["ID" ])


data_test = data_test.remove_columns(["ID" ])


In [None]:
print(data_train.column_names)
print(data_test.column_names)

In [None]:

data_train = data_train.map(fct_speech_file_to_array_train, remove_columns = data_train.column_names, num_proc=4)
data_eval = data_eval.map(fct_speech_file_to_array_train, remove_columns = data_eval.column_names, num_proc=4)



data_test = data_test.map(fct_speech_file_to_array_test, remove_columns = data_test.column_names, num_proc=4)


In [None]:
data_train

In [None]:
data_test

In [None]:
import random

rand_int = random.randint(0, len(data_train)-1)

print("Transcription:", data_train[rand_int]["target_text"])
print("audio array :", data_train[rand_int]["audio_array"])
#print("audio array type is :", type(data_train[rand_int]["audio_array"]))
print("Sampling rate:", data_train[rand_int]["sampling_rate"])

# Etape 5
Construction d'un vocabulaire a inclure dans la CTC "head" qui sera utilisee par notre model pour "decoder" l'output du model en text-sequence.

In [None]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'

def remove_special_characters(batch):
    batch["target_text"] = re.sub(chars_to_ignore_regex, '', batch["target_text"]).lower() + " "
    return batch

In [None]:

data_train = data_train.map(remove_special_characters)

data_eval = data_eval.map(remove_special_characters)





In [None]:
def extract_all_chars(batch):
  all_text = " ".join(batch["target_text"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
data_train_sharded_1 = data_train.shard(num_shards=5, index=0) 

data_train_sharded_2 = data_train.shard(num_shards=5, index=1) 

data_train_sharded_3 = data_train.shard(num_shards=5, index=2) 

data_train_sharded_4 = data_train.shard(num_shards=5, index=3) 

data_train_sharded_5 = data_train.shard(num_shards=5, index=4) 




In [None]:
data_test_sharded_1 = data_test.shard(num_shards=2, index=0) 

data_test_sharded_2 = data_test.shard(num_shards=2, index=1) 

In [None]:
vocab_train_1 = data_train_sharded_1.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=data_train_sharded_1.column_names)

vocab_train_2 = data_train_sharded_2.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=data_train_sharded_2.column_names)

vocab_train_3 = data_train_sharded_3.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=data_train_sharded_3.column_names)

vocab_train_4 = data_train_sharded_4.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=data_train_sharded_4.column_names)

vocab_train_5 = data_train_sharded_5.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=data_train_sharded_5.column_names)







In [None]:
vocab_eval = data_eval.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=data_eval.column_names)



In [None]:
vocab_list = list(set(vocab_eval["vocab"][0]) | set(vocab_train_1["vocab"][0]) |set(vocab_train_2["vocab"][0]) | set(vocab_train_3["vocab"][0]) |set(vocab_train_4["vocab"][0]) | set(vocab_train_5["vocab"][0]) )

In [None]:
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

In [None]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

In [None]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [None]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

In [None]:
def prepare_dataset(batch):
    # check that all files have the correct sampling rate
    assert (len(set(batch["sampling_rate"])) == 1), f"Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}."

    batch["input_values"] = processor(batch["audio_array"], sampling_rate=batch["sampling_rate"][0]).input_values
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

In [None]:
def prepare_dataset_test(batch):
    # check that all files have the correct sampling rate
    assert (
        len(set(batch["sampling_rate"])) == 1
    ), f"Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}."

    batch["input_values"] = processor(batch["audio_array"], padding=True,sampling_rate=batch["sampling_rate"][0]).input_values
    
    return batch

In [None]:


data_train_sharded_1 = data_train_sharded_1.map(prepare_dataset, remove_columns=data_train_sharded_1.column_names, num_proc=4, batched = True, batch_size = -1)

data_train_sharded_2 = data_train_sharded_2.map(prepare_dataset, remove_columns=data_train_sharded_2.column_names, num_proc=4, batched = True, batch_size = -1)

data_train_sharded_3 = data_train_sharded_3.map(prepare_dataset, remove_columns=data_train_sharded_3.column_names, num_proc=4, batched = True, batch_size = -1)

data_train_sharded_4 = data_train_sharded_4.map(prepare_dataset, remove_columns=data_train_sharded_4.column_names, num_proc=4, batched = True, batch_size = -1)

data_train_sharded_5 = data_train_sharded_5.map(prepare_dataset, remove_columns=data_train_sharded_5.column_names, num_proc=4, batched = True, batch_size = -1)



In [None]:
data_eval = data_eval.map(prepare_dataset, remove_columns=data_eval.column_names, batch_size=8, num_proc=4, batched=True)


In [None]:

data_test_sharded_1 = data_test_sharded_1.map(prepare_dataset_test, remove_columns=data_test_sharded_1.column_names, batch_size=-1, num_proc=4, batched=True)

data_test_sharded_2 = data_test_sharded_2.map(prepare_dataset_test, remove_columns=data_test_sharded_2.column_names, batch_size=-1, num_proc=4, batched=True)


In [None]:
from datasets import concatenate_datasets

data_train_ready = concatenate_datasets([data_train_sharded_1, data_train_sharded_2, data_train_sharded_3, data_train_sharded_4, data_train_sharded_5])

In [None]:
data_train_ready

In [None]:
data_eval_ready = data_eval

In [None]:
data_test_ready = concatenate_datasets([data_train_sharded_1, data_train_sharded_2])

In [None]:
data_test_ready

In [None]:
data_test_ready[0]["labels"][:10]

# Etape 6
 Creation d' un objet "DataCollator" qui servira a uniformiser la longueur des inputs dans chaques "batch" envoye au model.

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch



In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

# Etape 7
 Construction d'une methode pour l'evaluation basee sur "word error rate" metric (elle sera utilisee uniquement sur le train split de notre dataset car le test split ne contient pas de transcription) 

In [None]:
from datasets import load_metric


wer_metric = load_metric("wer")


In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

# Etape 8: Construction du pipeline d'entrainement

In [None]:
from transformers import Wav2Vec2ForCTC
Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53", 
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    gradient_checkpointing=True,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

In [None]:
model.freeze_feature_extractor()

In [None]:
model.gradient_checkpointing_enable()

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="/content/gdrive/MyDrive/wav2vec2-large-xlsr-WOLOF",
  #output_dir="./wav2vec2-large-xlsr-WOLOF",
  group_by_length=True,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=30,
  fp16=True,
  save_steps=100,
  eval_steps=100,
  logging_steps=10,
  learning_rate=3e-4,
  warmup_steps=500,
  save_total_limit=2,
)

In [None]:
from transformers import Trainer


trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=data_train_ready,
    eval_dataset=data_eval_ready,
    tokenizer=processor.feature_extractor,
)



# Etape 9: Entrainement (fine-tuning) du model


In [None]:
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);



In [None]:
trainer.train()

In [None]:
model.save_pretrained("wav2vec2-large-xlsr-WOLOF")
processor.save_pretrained("wav2vec2-large-xlsr-WOLOF")

# Etape 10: Evaluation du model 

In [None]:
val =pd.read_csv("../input/wolof-asr/Test.csv")
val["path"] = "../input/wolof-asr/Noise Removed/tmp/WOLOF_ASR_dataset/noise_remove/"+val["ID"]+".wav"
val.rename(columns = {'transcription':'sentence'}, inplace = True)
common_voice_val = Dataset.from_pandas(val)

In [None]:
common_voice_val = common_voice_val.remove_columns([ "ID","age",  "down_votes", "gender",  "up_votes"])

In [None]:
common_voice_val = common_voice_val.map(speech_file_to_array_fn_test, remove_columns=common_voice_val.column_names)



In [None]:
common_voice_val = common_voice_val.map(prepare_dataset_test, remove_columns=common_voice_val.column_names, batch_size=8, num_proc=4, batched=True)

In [None]:
final_pred = []
for i in tqdm(range(data_test_ready.shape[0])):    
    input_dict = processor(data_test_ready[i]["input_values"], return_tensors="pt", padding=True)

    logits = model(input_dict.input_values.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)[0]
    prediction = processor.decode(pred_ids)
    final_pred.append(prediction)

In [None]:
val["transcription"] = final_pred
val["transcription"] = val["transcription"].str.capitalize()
val.iloc[1390,6] = "ah"

In [None]:
val[["ID","transcription"]].to_csv("submission_file.csv", index=False)

In [None]:
val["transcription"] 