## Fine-tuning the Whisper model for swedish audio transcription

This is part 1 of my Assignment in ID2223 Scalable Machine Learning and Deep Learning @ KTH

The assignment was to fine-tune the [Whisper](https://openai.com/blog/whisper/) model for swedish audio transciption. The task was broken done into 3 parts:

- feature pipeline: create features as inputs to the network, and store these features
- training pipeline: train the model on the data
- create an interface through which people can interact with the model

For part 2, check out [this notebook](https://github.com/tcsmaster/ID2223/tree/main/Lab2/swedish-fine-tune-training-pipeline.ipynb)

## Outline

In this notebook:
- I acquire the data for fine-tuning: the swedish data from the Common Voice dataset, which can be found [here](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/sv-SE/train)
- I prepare the data using built-in methods from Huggingface:
   - feature extractor
   - tokenizer
   - processor, which combines the two above
- I store the data in Google Drive

Note: this notebook is an adaptation of Sanchit Gandhi's [Fine-Tune Whisper for Multilingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-whisper)

In [None]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4
!apt update
!apt install -y ffmpeg

Installing some Python packages

In [None]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa

We can download and prepare the Common Voice splits in just one line of code.

First, ensure you have accepted the terms of use on the [Hugging Face Hub](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from datasets import load_dataset, DatasetDict, Audio

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="train", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="test", use_auth_token=True)

Next, we remove the unnecessary metadata columns from the dataset, as we only need the audio and the transcribed text.

In [None]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

The ASR pipeline can be decomposed into three stages: 
- A feature extractor which pre-processes the raw audio inputs
- The model which performs the sequence-to-sequence mapping 
- A tokenizer which post-processes the model outputs to text format

The Whisper model has an associated feature extractor and tokenizer, called WhisperFeatureExtractor and WhisperTokenizer respectively.

The Whisper feature extractor performs two operations:

- Pads / truncates the audio inputs to 30s: any audio inputs shorter than 30s are padded to 30s with silence (zeros), and those longer that 30s are truncated to 30s
- Converts the audio inputs to log-Mel spectrogram input features, a visual representation of the audio and the form of the input expected by the Whisper model

The Whisper model outputs a sequence of token ids. The tokenizer maps each of these token ids to their corresponding text string. For Hindi, we can load the pre-trained tokenizer and use it for fine-tuning without any further modifications. We simply have to specify the target language and the task. These arguments inform the tokenizer to prefix the language and task tokens to the start of encoded label sequences.

To simplify using the feature extractor and tokenizer, we can _wrap_ 
both into a single `WhisperProcessor` class. This processor object 
inherits from the `WhisperFeatureExtractor` and `WhisperProcessor`, 
and can be used on the audio inputs and model predictions as required. 
In doing so, we only need to keep track of two objects during training: 
the `processor` and the `model`:

In [None]:
from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer
from transformers import WhisperProcessor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Swedish", task="transcribe")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Swedish", task="transcribe")

Then, since the feature extractor expects the udio to be sampled at 16 kHz, we resample the audio.

In [None]:
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

Finally, we define a batch-preprocessor, which resamples the audio at 16 kHz, applies the featue extractor on the inputs and the tokenizer on the labels.

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [None]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=3)

Last, but not least, we need to store our data somewhere. There are a lot of possibilities to do that, for this, we're going to use Google Drive.

In [None]:
from google.colab import drive
import os
drive.mount('/content/gdrive')
common_voice.save_to_disk("common_voice")
os.mkdir("/content/gdrive/My Drive/common_voice")
common_voice.save_to_disk(F"/content/gdrive/My Drive/common_voice/")