In [5]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "nl", split="train+validation", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "nl", split="test", use_auth_token=True)

print(common_voice)

Found cached dataset common_voice_11_0 (/Users/hannatoenbreker/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/nl/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0)
Found cached dataset common_voice_11_0 (/Users/hannatoenbreker/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/nl/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0)


DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 41054
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 10743
    })
})


In [6]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (osxkeychain).
Your token has been saved to /Users/hannatoenbreker/.cache/huggingface/token
Login successful


In [7]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

## Prepare feature extractor and tokenizer 
### The ASR pipeline can be de-composed into three components:

- A feature extractor which pre-processes the raw audio-inputs
- The model which performs the sequence-to-sequence mapping
- A tokenizer which post-processes the model outputs to text format

## Feature extractor
- Eerst worden audiomonsters opgevuld/afgekapt zodat alle monsters een invoerlengte van 30s hebben.
- De tweede bewerking die de Whisper-extractor uitvoert is het converteren van de opgevulde audio-arrays naar log-Mel spectrogrammen. Deze spectrogrammen zijn een visuele weergave van de frequenties van een signaal.

In [8]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

## Whisper Tokenizer
Het Whisper model voert tekst tokens uit die de index van de voorspelde tekst in de woordenlijst van vocabulaire items aangeven. De tokenizer zet een opeenvolging van teksttokens om in de werkelijke tekststring (bijv. [1169, 3797, 3332] -> "de kat zat"). Oftewel encoding

In [9]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Dutch", task="transcribe")



Bij het coderen van de transcripties voegt de tokenizer 'speciale tokens' toe aan het begin en einde van de sequentie, inclusief het begin/eind van transcripttokens, de taaltokens en de taaktokens (zoals gespecificeerd door de argumenten in de vorige stap). Bij het decoderen van de label-id's hebben we de optie om deze speciale tokens 'over te slaan', zodat we een string in de oorspronkelijke invoer kunnen retourneren

In [10]:
input_str = common_voice["train"][0]["sentence"]
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)

print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")

To support 'mp3' decoding with `torchaudio>=0.12.0`, please install `ffmpeg4` system package. On Google Colab you can run:

	!add-apt-repository -y ppa:jonathonf/ffmpeg-4 && apt update && apt install -y ffmpeg

and restart your runtime. Alternatively, you can downgrade `torchaudio`:

	pip install "torchaudio<0.12"`.

Otherwise 'mp3' files will be decoded with `librosa`.


Input:                 Wij hebben ons nauwgezet aan die wens gehouden.
Decoded w/ special:    <|startoftranscript|><|nl|><|transcribe|><|notimestamps|>Wij hebben ons nauwgezet aan die wens gehouden.<|endoftext|>
Decoded w/out special: Wij hebben ons nauwgezet aan die wens gehouden.
Are equal:             True


Om het gebruik van de feature extractor en tokenizer te vereenvoudigen, kunnen we beide in een enkele WhisperProcessor klasse onderbrengen. Dit processorobject erft van de WhisperFeatureExtractor en WhisperProcessor en kan gebruikt worden op de audio inputs en modelvoorspellingen zoals vereist. Op deze manier hoeven we maar twee objecten bij te houden tijdens de training: de processor en het model. 

In [11]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Dutch", task="transcribe")

## Prepare our data

In [12]:
print(common_voice["train"][0])

{'audio': {'path': '/Users/hannatoenbreker/.cache/huggingface/datasets/downloads/extracted/bd6dfdacc71b2586341c3d59e48e79c8959528a3ed078349419a94a89b10e878/common_voice_nl_23373535.mp3', 'array': array([], dtype=float32), 'sampling_rate': 48000}, 'sentence': 'Wij hebben ons nauwgezet aan die wens gehouden.'}


We kunnen zien dat we een 1-dimensionale audio-ingangsarray en de bijbehorende transcriptie hebben. De bemonsteringsfrequentie van onze audio moet worden afgestemd op die van het Whisper-model (16 kHz). Aangezien onze audio input gesampled is op 48kHz, moeten we het downsamplen naar 16kHz voordat het doorgegeven wordt aan de Whisper feature extractor. We zetten de audio inputs op de juiste sampling rate met de cast_column methode van de dataset. 

In [13]:
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
print(common_voice["train"][0])

{'audio': {'path': '/Users/hannatoenbreker/.cache/huggingface/datasets/downloads/extracted/bd6dfdacc71b2586341c3d59e48e79c8959528a3ed078349419a94a89b10e878/common_voice_nl_23373535.mp3', 'array': array([], dtype=float32), 'sampling_rate': 16000}, 'sentence': 'Wij hebben ons nauwgezet aan die wens gehouden.'}


In [14]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch



In [15]:
common_voice = common_voice.map(prepare_dataset, num_proc=4)

Map (num_proc=4):   0%|          | 0/41054 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/10743 [00:00<?, ? examples/s]

## Define a Data Collator
### We can leverage the WhisperProcessor we defined earlier to perform both the feature extractor and the tokenizer operations

De input_features zijn al opgevuld tot 30s en geconverteerd naar een log-Mel spectrogram van vaste dimensie, dus alles wat we moeten doen is ze converteren naar gebatchte PyTorch tensoren. Converting a log-mel spectrogram to a PyTorch tensor allows for efficient data representation, standardization, normalization, and batch processing, enabling effective training and inference with deep learning models in speech recognition tasks. We doen dit met de .pad methode van de feature extractor met return_tensors=pt.

Machine learning models often process data in batches for efficiency. To create batches, all input sequences within a batch need to have the same length. Padding ensures that all sequences in a batch have equal lengths, allowing for efficient parallelization during training and inference.

We vullen de sequenties eerst op tot de maximale lengte in de batch met de .pad-methode van de tokenizer. De opgevulde tokens worden dan vervangen door -100 zodat deze tokens niet worden meegerekend bij het berekenen van het verlies. Vervolgens knippen we het begin van de transcripttoken af van het begin van de labelreeks als we deze later tijdens de training toevoegen. We kunnen gebruik maken van de WhisperProcessor die we eerder gedefinieerd hebben om zowel de feature extractor als de tokenizer operaties uit te voeren.

In [16]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

# A Python decorator for defining data classes, classes that just have fields with no additional methods.
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any  # Processor will be used for padding our features

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # We separate input features (audio) and labels (text) as they need different padding

        # Prepare the audio inputs
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        
        # Pad the audio inputs to ensure they all have the same length
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        # Pad the labels to the maximum length label in our batch
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # Replace padding tokens with -100 so they are ignored when calculating loss
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        # Add the prepared labels to our batch
        batch["labels"] = labels

        # Return the batch which now contains our input features and labels, both correctly padded
        return batch

In [17]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)


## Evaluate our results with WER

In [None]:
import evaluate

# defining metric for evaluation
metric = evaluate.load("wer")

In [None]:
def compute_metrics(pred):
    # Get predictions and labels from the output of the model
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # Replace -100 (typically representing ignored tokens) with the pad_token_id.
    # This is done to exclude such tokens from the Word Error Rate (WER) calculation.
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # Decoding the tokenized text back to readable text.
    # `skip_special_tokens=True` will discard special tokens (like padding, start, end tokens) during decoding.
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    
    # Compute Word Error Rate (WER), which is a common metric for speech recognition systems.
    # It's calculated as the distance between predicted and reference sequences divided by the total number of reference words.
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Initializing our model
Here, the Whisper model from OpenAI is being loaded. This model is specifically designed for speech recognition tasks. It's pretrained on a large corpus of data, and serves as a good starting point for further fine-tuning.

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

In [None]:
model.config.forced_decoder_ids = None #In some models, 
#you can force the decoder to start with a specific token ID. By setting this value to None, we are indicating that there is no specific token that the decoding process should be forced to start with.
model.config.suppress_tokens = [] # Some models allow you to specify certain tokens that should be suppressed (i.e., never be generated) during the decoding process. By setting this value to an empty list,
#we are indicating that no tokens should be suppressed in the decoding process.

## Training our model
This section defines the arguments to be used for training the model, including learning rate, maximum number of steps, gradient accumulation steps, and more. The Seq2SeqTrainingArguments class from the transformers library provides a standard way of encapsulating these arguments.

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-dutch",  # change to a repo name of your choice
    per_device_train_batch_size=8, # Batch size per device during training. Can be adjusted according to the computational resources.
    gradient_accumulation_steps=2,  # increase by 2x for every 2x the per_device_train_batch_size
    learning_rate=1e-5, # Learning rate for the optimizer, the learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.
    warmup_steps=500, #   Used to prevent the model from separating during the initial phases of training.
    max_steps=4000, # Total number of training steps to perform.
    gradient_checkpointing=True, # Whether to use gradient checkpointing to save memory at the expense of slower backward pass, backward pass means that the model is updated based on the error gradient. 
    #the error gradient means how much the model weights should be changed to reduce the error.
    fp16=False, # Whether to use 16-bit (mixed) precision training instead of 32-bit training, we use 16 bit because it is faster and uses less memory.
    evaluation_strategy="steps", # Evaluation strategy to adopt during training. 'steps' means the model is evaluated every 'eval_steps'. 
    per_device_eval_batch_size=8, # Batch size is the number of samples that are processed at once.
    predict_with_generate=True, # Whether to use generate method for the predictions during evaluation, if set to False, the model will generate the predictions.
    generation_max_length=225, # This parameter is a hyperparameter used during the fine-tuning process that sets the maximum length (in tokens) for the generated sequences by the model.
    save_steps=1000, # Number of steps before the checkpoint is saved, a checkpoint is a snapshot of the model weights, it is used to resume training from the same point.
    eval_steps=1000, # Number of steps before the model is evaluated.
    logging_steps=25, # Number of steps before logging the training metrics.
    report_to=["tensorboard"], 
    load_best_model_at_end=True, # Whether to load the best model found during training at the end of training.
    metric_for_best_model="wer", # wer is the metric we want to use to compare the best models
    greater_is_better=False, # Indicates whether a higher metric value is better. 'False' means a lower WER is better.
    push_to_hub=True
)

# The Seq2SeqTrainer class from the transformers library is used to initialize the trainer. It receives the training arguments, model, training and evaluation datasets, and the metrics computation function defined earlier. The Seq2SeqTrainer is specifically designed for sequence-to-sequence models (like speech recognition models).

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

In [None]:
trainer.train()

## Share our results
### If we would like to share our training results on the hub
#### Keep in mind that we have to change the argument "push to hub" two blocks above this code to true

In [None]:
kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_11_0",
    "dataset": "Common Voice 11.0",  # a 'pretty' name for the training dataset
    "dataset_args": "config: nl, split: test",
    "language": "nl",
    "model_name": "Whisper Dutch - RTL",  # a 'pretty' name for your model
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}

trainer.push_to_hub(**kwargs)

This code is loading the fine-tuned model and the corresponding processor from the Hugging Face Model Hub. The WhisperProcessor handles the specific data processing steps required by the Whisper model.

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("hannatoenbreker/whisper-dutch")
processor = WhisperProcessor.from_pretrained("hannatoenbreker/whisper-dutch")

The transcribe function is then defined to use this pipeline to transcribe audio input into text. A Gradio interface is created and launched. Gradio is a python library for quickly creating user interfaces to prototype machine learning models. It's using a Microphone input interface that lets users record audio directly from their microphones.

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="hanna/whisper-dutch")  # change to "your-username/the-name-you-picked"

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="microphone", type="filepath"), 
    outputs="text",
    title="Whisper Small Dutch",
    description="Realtime demo for Dutch speech recognition using a fine-tuned Whisper small model.",
)

iface.launch()