<a href="https://colab.research.google.com/github/taimoorsardar/Automatic-Speech-Recognition/blob/main/ASR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

in this notebook we will try to fine tune one of the state-of-the-art model Whisper A Seq2Seq Transformer model on common voice Urdu dataset.
this model is recently open sourced by openAi.
Morover we tried different variants separately, means just changed the model in this notebook.

#Preparing Environment

In [None]:
# checking gpu
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
# updating and installing required libraries, as suggested in hugging face
!pip install --upgrade pip
!pip install --upgrade datasets[audio] transformers accelerate evaluate jiwer tensorboard gradio

as we are using hugginface as our source of picking the dataset so we need to login from there as well


In [None]:
from huggingface_hub import notebook_login

notebook_login()

#Load Dataset

In [None]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_13_0", "ur", split="train+validation", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_13_0", "ur", split="test", use_auth_token=True)

print(common_voice)

#Preprocessing Dataset

In [None]:
# removing unnecessary columns
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])
print(common_voice)

### Load Pre-trained Feature Extractor

In this cell, we import the WhisperFeatureExtractor from the transformers library and load a pre-trained feature extractor specifically designed for the Whisper model.

In [None]:
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")

### Load Pre-trained Tokenizer

In this cell, we import the WhisperTokenizer from the transformers library and load a pre-trained tokenizer for the Whisper model.

In [None]:
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Urdu", task="transcribe")

### Load Pre-trained Processor

In this cell, we import the WhisperProcessor from the transformers library and load a pre-trained Processor for the Whisper model.

In [None]:
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="Urdu", task="transcribe")

###Remaining Data Processing

In [None]:
print(common_voice["train"][0])

In [None]:
from datasets import Audio
# Set the sampling rate to 16000 Hz, which is a standard rate for speech processing.
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
print(common_voice["train"][0])

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]
    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [None]:
# Apply the prepare_dataset function to each batch in the common_voice dataset.
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

#Training and Evaluation

## Load Pre-trained Whisper Model

In this cell, we import the `WhisperForConditionalGeneration` class from the `transformers` library and load a pre-trained Whisper model, specifically the "whisper-tiny" version from OpenAI.
The `WhisperForConditionalGeneration` class is essential because it provides the architecture and methods needed for generating text from audio inputs.

In [None]:
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

###configuring the model according to our requirements

In [None]:
model.generation_config.language = "urdu"
model.generation_config.task = "transcribe"

model.generation_config.forced_decoder_ids = None

### Custom Data Collator for Speech Sequence-to-Sequence Models

In this cell, we define a custom data collator class DataCollatorSpeechSeq2SeqWithPadding using the @dataclass decorator. This collator is designed to handle the specific requirements of speech sequence-to-sequence models by **appropriately padding the input audio features and label sequences.** This custom collator ensures that the data is correctly formatted and padded for training the speech sequence-to-sequence model, facilitating efficient and accurate training.


In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

In [None]:
#Evaluation metrics
import evaluate
metric = evaluate.load("wer")

In [None]:
# helping function to use the wer metrics
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id
    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

The selection of hyperparameters was taken from hugging face platforms.

Due to some contraints we changed some of them.

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-tiny-urdu",  # This specifies the directory where the trained model and other outputs will be saved.
    per_device_train_batch_size=16,  # The batch size for training per device (GPU or CPU). It determines the number of training samples processed simultaneously on each device during training.
    gradient_accumulation_steps=2,  # Number of steps for which gradients are accumulated before performing a parameter update. Useful for effectively simulating larger batch sizes with limited memory.
    learning_rate=1e-5,  # The initial learning rate for the optimizer.
    warmup_steps=500,  # Number of steps for which the learning rate increases linearly from 0 to the specified learning rate. Helps stabilize training by gradually increasing the learning rate.
    max_steps=1500,  # The maximum number of training steps to run. Training will stop when this number of steps is reached.
    gradient_checkpointing=True,  # Whether to use gradient checkpointing to reduce memory usage during training. Trades off compute for memory.
    fp16=True,  # Whether to use 16-bit precision (mixed precision training) to speed up training and reduce memory usage.
    evaluation_strategy="steps",  # Strategy for evaluation during training. "steps" means evaluation is performed every eval_steps steps.
    per_device_eval_batch_size=8,  # The batch size for evaluation per device.
    predict_with_generate=True,  # Whether to generate predictions using a generation strategy.
    generation_max_length=225,  # Maximum length of the generated output sequences during prediction.
    save_steps=500,  # Number of steps after which a checkpoint is saved.
    eval_steps=500,  # Number of steps after which evaluation is performed during training.
    logging_steps=25,  # Number of steps after which logs are written to the log file and printed on the console.
    report_to=["tensorboard"],  # Where to report training metrics. In this case, it's set to report to TensorBoard.
    load_best_model_at_end=True,  # Whether to load the best model at the end of training based on the specified metric.
    metric_for_best_model="wer",  # The metric used to determine the best model during training.
    greater_is_better=False,  # Whether a higher value of the specified metric indicates better performance.
    push_to_hub=True,  # Whether to push the trained model to the Hugging Face Model Hub after training.
)

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

In [None]:
processor.save_pretrained(training_args.output_dir)

some results here do not exactly match the one in research paper as the notebook was run again with different hyperparameters so the results are now changed but they are some close to original ones

In [None]:
trainer.train()

## this is optional work if you want to push to hugging face as well.

In [None]:
'''kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_13_0",
    "dataset": "Common Voice 11.0",  # a 'pretty' name for the training dataset
    "dataset_args": "config: hi, split: test",
    "language": "urdu",
    "model_name": "Whisper Base Urdu - Taimoor Sardar",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-base",
    "tasks": "automatic-speech-recognition",
}'''

In [None]:
#trainer.push_to_hub(**kwargs)

In [None]:
'''
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="TS_TI/whisper-base-ur")

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(source="microphone", type="filepath"),
    outputs="text",
    title="Whisper base urdu",
    description="Realtime demo for Hindi speech recognition using a fine-tuned Whisper small model.",
)

iface.launch()'''