In this section, we'll cover a step-by-step guide on fine-tuning Whisper for speech recognition on the Common Voice 13 dataset. We'll use the 'small' version of the model and a relatively lightweight dataset, enabling you to run fine-tuning fairly **on any 16GB+ GPU with low disk space requirements, such as the 16GB T4 GPU provided in the Google Colab free tier**.

## Prepare Environment

Hugginface team strongly advise you to upload model checkpoints directly the [Hugging Face Hub](https://huggingface.co/) while training. The Hub provides:


*   Integrated version control: you can be sure that no model checkpoint is lost during training.
*   Tensorboard logs: track important metrics over the course of training
*   Model cards: document what a model does and its intended use cases
*   Community: an easy way to share and collaborate with the community!



In [None]:
!pip install huggingface_hub datasets transformers

In [38]:
from huggingface_hub import notebook_login # You need to create an access token which role is "write"

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load Dataset

In [39]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset(
    "mozilla-foundation/common_voice_13_0", "dv", split="train+validation"
)
common_voice["test"] = load_dataset(
    "mozilla-foundation/common_voice_13_0", "dv", split="test"
)

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 4904
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 2212
    })
})


## Feature Extractor, Tokenizer and Processor

The ASR pipeline can be de-composed into 3 stages:

1. The feature extractor which pre-processes the raw audio-inputs to log-mel spectrograms
2. The model which performs the sequence-to-sequence mapping
3. The tokenizer which post-processes the predicted tokens to text

In [40]:
from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE

TO_LANGUAGE_CODE

{'english': 'en',
 'chinese': 'zh',
 'german': 'de',
 'spanish': 'es',
 'russian': 'ru',
 'korean': 'ko',
 'french': 'fr',
 'japanese': 'ja',
 'portuguese': 'pt',
 'turkish': 'tr',
 'polish': 'pl',
 'catalan': 'ca',
 'dutch': 'nl',
 'arabic': 'ar',
 'swedish': 'sv',
 'italian': 'it',
 'indonesian': 'id',
 'hindi': 'hi',
 'finnish': 'fi',
 'vietnamese': 'vi',
 'hebrew': 'he',
 'ukrainian': 'uk',
 'greek': 'el',
 'malay': 'ms',
 'czech': 'cs',
 'romanian': 'ro',
 'danish': 'da',
 'hungarian': 'hu',
 'tamil': 'ta',
 'norwegian': 'no',
 'thai': 'th',
 'urdu': 'ur',
 'croatian': 'hr',
 'bulgarian': 'bg',
 'lithuanian': 'lt',
 'latin': 'la',
 'maori': 'mi',
 'malayalam': 'ml',
 'welsh': 'cy',
 'slovak': 'sk',
 'telugu': 'te',
 'persian': 'fa',
 'latvian': 'lv',
 'bengali': 'bn',
 'serbian': 'sr',
 'azerbaijani': 'az',
 'slovenian': 'sl',
 'kannada': 'kn',
 'estonian': 'et',
 'macedonian': 'mk',
 'breton': 'br',
 'basque': 'eu',
 'icelandic': 'is',
 'armenian': 'hy',
 'nepali': 'ne',
 'mongol

In [41]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    "openai/whisper-small", language="sinhalese", task="transcribe"
)

## Pre-process the Data

In [42]:
common_voice["train"].features

{'client_id': Value(dtype='string', id=None),
 'path': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=48000, mono=True, decode=True, id=None),
 'sentence': Value(dtype='string', id=None),
 'up_votes': Value(dtype='int64', id=None),
 'down_votes': Value(dtype='int64', id=None),
 'age': Value(dtype='string', id=None),
 'gender': Value(dtype='string', id=None),
 'accent': Value(dtype='string', id=None),
 'locale': Value(dtype='string', id=None),
 'segment': Value(dtype='string', id=None),
 'variant': Value(dtype='string', id=None)}

Since our input audio is sampled at 48kHz, we need to *downsample* it to 16kHz prior to passing it to the Whisper feature extractor, 16kHz being the sampling rate expected by the Whisper model.

In [43]:
from datasets import Audio

sampling_rate = processor.feature_extractor.sampling_rate
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=sampling_rate))

Now we can write a function to prepare our data ready for the model:

1. We load and resample the audio data on a sample-by-sample basis by calling `sample["audio"]`. As explained above, `Datasets` performs any necessary resampling operations on the fly.
2. We use the feature extractor to compute the log-mel spectrogram input features from our 1-dimensional audio array.
3. We encode the transcriptions to label ids through the useof the tokenizer

In [44]:
def prepare_dataset(example):
  audio = example["audio"]

  example = processor(
      audio = audio["array"],
      sampling_rate = audio["sampling_rate"],
      text = example["sentence"]
  )

  # compute input length of audio sample in seconds
  example["input_length"] = len(audio["array"]) / audio["sampling_rate"]

  return example

We can apply the data preparation function to all of our training examples using `Datasets.map` method. We'll remove the columns from the raw training data (the audio and text), leaving just the columns returned by the `prepare_dataset` function:

In [45]:
common_voice = common_voice.map(
    prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1
) # It might take too long times to apply the data preparation

Map:   0%|          | 0/2212 [00:00<?, ? examples/s]

Finally, we filter any training data with audio samples longer than 30s. These samples would other wise be truncated by the Whisper feature-extractor which could affect the stability of training. We define a function that returns `True` for samples that are less than 30s, and `False` for those are longer:

In [46]:
max_input_length = 30.0

def is_audio_in_length_range(length):
  return length < max_input_length

We apply our filter function to all samples of our training dataset through `Datasets.filter` method:

In [47]:
common_voice["train"] = common_voice["train"].filter(
    is_audio_in_length_range,
    input_columns=["input_length"]
)

In [48]:
common_voice["train"]

Dataset({
    features: ['input_features', 'labels', 'input_length'],
    num_rows: 4904
})

## Training and Evaluation

Now that we've prepared our data, we're ready to dive into the training pipeline. The [Trainer]("https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer") will do much of heavy lifting for us. All we have to do is:



* Define a data collator: the data collator takes our pre-processed data and prepares PyTorch tensors ready for the model
* Evaluation metrics: during evaluation, we want to evaluate the model using the word error rate (WER) metrc. We need to define a `compute_metrics` function that handles this computation
* Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.
* Define the training arguments: these will be used by the Trainer in constructing the training schedule.

Once we've fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it to transcribe speech in Dhivehi.



### Define a Data Collator

The data collator for a Seq2Seq model is unique in the sense that it treats the `input_features` and `labels` independently: the `input_features` must be handled by the feature extractor and the `labels` by the tokenizer.

The `input_features` are already padded to 30s and converted to a log-Mel spectrogram of fixed dimension, so all we have to do is to convert them to batched PyTorch tensors. We do this using the feature extractor's`.pad` method with `return_tensors=pt`. Note that no additional padding is applied here since the inputs are of fixed dimension, the `input_features` are simply converted to PyTorch tensors.

On the otehr hand, the `labels` are un-padded. We first pad the sequences to the maximum length in the batch using the tokenizer's.`pad` method. The padding tokens are then replaced by -100 so that these tokens are **not** taken into account when computing the loss. We then cut the start of transcript token from the beginning of the label sequence as we append it later during training.

We can leverage the `WhisperProcessor` we defined earlier to perform both the feature extractor and the tokenizer operations:

In [49]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [50]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## Evaluation Metrics

In [None]:
!pip install evaluate jiwer

In [52]:
import evaluate

metric = evaluate.load("wer")

In [53]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

def compute_metrics(pred):
  pred_ids = pred.predictions
  label_ids = pred.label_ids

  # replace -100 with the pad_token_id
  label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

  # we do not want to group tokens when computing the metrics
  pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
  label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

  # compute orthographic wer
  wer_ortho = 100*metric.compute(predictions=pred_str, references=label_str)

  # compute normalized WER
  pred_str_norm = [normalizer(pred) for pred in pred_str]
  label_str_norm = [normalizer(label) for label in label_str]

  # filtering step to only evaluate the samples that correspond to non-zero references:
  pred_str_norm = [
      pred_str_norm[i] for i in range(len(pred_str_norm)) if len(label_str_norm[i]) > 0
  ]
  label_str_norm = [
      label_str_norm[i] for i in range(len(label_str_norm)) if len(label_str_norm[i]) > 0
  ]

  wer = 100 * metric.compute(predictions=pred_str_norm, references=label_str_norm)

  return {"wer_ortho": wer_ortho, "wer":wer}

## Load a Pre-Trained Checkpoint

In [54]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

We'll set `use_cache` to `False` for training since we're using gradient checkpointing and the two are incompatible. We'll also override two generation arguments to control the behaviour of the model during inference: we'll force the language and task tokens during generation by setting the `language` and `task` arguments, and also re-enable cache for generation to speed-up inference time

In [55]:
from functools import partial

# disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# set language and task for generation and re-enable cache
model.generate = partial(
    model.generate, language="sinhalese", task="transcribe", use_cache=True
)

## Define the Training Configuration

In the final step, we define all the parameters related to training. Here, we set the number of training steps to 500. This is enough steps to see a big WER improvement compared to the pre-trained Whisper model, while ensuring that fine-tuning can be run in approximately 45 minutes on a Google Colab free tier. For more detail on the training arugments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments)

In [56]:
!pip install accelerate transformers[torch] -U



In [57]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-dv",  # name on the HF Hub, if you don't have access token which role is 'write' in your Huggingface, then you might encounter an error message.
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    lr_scheduler_type="constant_with_warmup",
    warmup_steps=50,
    max_steps=500,  # increase to 4000 if you have your own GPU or a Colab paid plan
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

We can forward the training arguements to the Trainer along with out model, dataset, data collator and `compute_metrics` function:

In [59]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

## Training

In [60]:
trainer.train()



Step,Training Loss,Validation Loss,Wer Ortho,Wer
500,0.1228,0.16778,62.232746,13.022916


There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


TrainOutput(global_step=500, training_loss=0.876505708694458, metrics={'train_runtime': 5459.6164, 'train_samples_per_second': 1.465, 'train_steps_per_second': 0.092, 'total_flos': 2.30637451935744e+18, 'train_loss': 0.876505708694458, 'epoch': 1.63})

In [61]:
kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_13_0",
    "dataset": "Common Voice 13",  # a 'pretty' name for the training dataset
    "language": "dv",
    "model_name": "Whisper Small Dv - Sanchit Gandhi",  # a 'pretty' name for your model
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
}

The training results can now be uploaded to the Hub. To do so, execute the `push_to_hub` command:

In [62]:
trainer.push_to_hub(**kwargs)

'https://huggingface.co/shoveling42/whisper-small-dv/tree/main/'

## Sharing Your Model

You can now share this model with anyone using the link on the Hub. They can load it with the identifier "your-username/the-name-you-picked" directly into the `piplline()` object. For instance, to load the finetuned checkpoint ["sanchit-gandhi/whisper-small-dv"](https://huggingface.co/sanchit-gandhi/whisper-small-dv)

In [63]:
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="sanchit-gandhi/whisper-small-dv")

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.60k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)rocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]