<a href="https://colab.research.google.com/github/sawadogosalif/audio_processing_playground/blob/main/fine_tuning_whisper_ASR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

he ASR pipeline can be de-composed into three components:

+ A feature extractor which pre-processes the raw audio-inputs
+ The model which performs the sequence-to-sequence mapping
+ A tokenizer which post-processes the model outputs to text format

Install all the dependencies

In [None]:
!pip install datasets
! pip install jiwer
! pip install transformers
! pip install gradio
! pip install huggingface_hub
! pip install torch
!pip install evaluate



Log In Hugging Face through the terminal

In [None]:
# The output will ask you to paste yohf_TUpzACVXbkrfKwTOMXGJFUBqRBeECYSYewur HF token.
!uv pip install datasets
!huggingface-cli login

[2mUsing Python 3.11.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 170ms[0m[0m

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token

Define Values for Fine Tuning

https://huggingface.co/blog/fine-tune-whisper

In [None]:
whisper_model = "openai/whisper-large-v3-turbo"
dataset_name = "sawadogosalif/MooreFRCollectionsAudios"
audio_column = "audio"  #
text_column = "transcript"

In [None]:
from datasets import load_dataset


Load essential data

In [None]:
dataset = load_dataset(dataset_name, split='train')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Resolving data files:   0%|          | 0/23 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/23 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/19 [00:00<?, ?it/s]

In [None]:
dataset

Dataset({
    features: ['audio', 'transcript', 'page', 'audio_sequence', 'duration'],
    num_rows: 14453
})

In [None]:

import re
BAD_CHARS_REGEX = r"[\\\\\!\-\;\:\"%\’\'\�\»\«\+\”\“\(\)\‘\*]"

# For batched processing
def clean_transcript_batch(examples):
    transcripts = examples.get("transcript", [])
    cleaned_transcripts = []

    for transcript in transcripts:
        # If not a string, add empty string
        if not isinstance(transcript, str):
            cleaned_transcripts.append("")
            continue

        try:
            # Avoid null bytes
            transcript = transcript.replace("\x00", "")

            # If UTF-8 is too long, skip cleaning
            if len(transcript.encode("utf-8")) > 1_000_000:
                cleaned_transcripts.append("")
                continue

            # Cleaning
            cleaned = re.sub(BAD_CHARS_REGEX, "", transcript)
            cleaned = re.sub(r"\s+", " ", cleaned).strip()

            cleaned_transcripts.append(cleaned)
        except Exception as e:
            print("Error on transcript:", e)
            cleaned_transcripts.append("")

    return {"transcript": cleaned_transcripts}


dataset  = dataset.map(
    clean_transcript_batch,
    num_proc=8,
    batched=True,
    batch_size=10,
    remove_columns=["page", "duration", "audio_sequence"]
    )

Tokenize the Dataset

In [None]:
from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE, TASK_IDS
from transformers import WhisperForConditionalGeneration
from transformers import WhisperForConditionalGeneration, WhisperProcessor, DataCollatorForSeq2Seq, WhisperTokenizer, WhisperFeatureExtractor


In [None]:
nearest_language= "hausa" #"yoruba"
processor = WhisperProcessor.from_pretrained(whisper_model, language=nearest_language, task="transcribe")
tokenizer = WhisperTokenizer.from_pretrained(whisper_model, language=nearest_language, task="transcribe")
feature_extractor = WhisperFeatureExtractor.from_pretrained(whisper_model)


In [None]:
import torch
import gc

# Clear PyTorch  cache
torch.cuda.empty_cache()
gc.collect()


71

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["transcript"], truncation=True, max_length=448).input_ids
    return batch
from datasets import Audio
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names, num_proc=1)


Load the model

In [None]:
dataset = dataset.train_test_split(test_size=0.05 , seed=2024)

Create a Data Collator

In [None]:
model = WhisperForConditionalGeneration.from_pretrained(whisper_model)
model.generation_config.language = nearest_language
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

In [None]:


import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)



Training area

In [None]:
import evaluate
metric = evaluate.load("wer")

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}


In [None]:
import torch

torch.cuda.empty_cache()  # Libère les blocs inutilisés du cache CUDA
torch.cuda.ipc_collect()  # Nettoie la mémoire inter-processus


In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="SaChi-ASR",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=250,
    max_steps=2000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    save_steps=2000,
    eval_steps=200*2,
    logging_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)



trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,

)

trainer.train()


  trainer = Seq2SeqTrainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msawallesalfo[0m ([33msawalle[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss


Step,Training Loss,Validation Loss,Wer
400,0.354,0.325892,25.317925


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Save the Preprocessor for preprocessor_config.json file

In [None]:
1

Real deal

Pushing to Hugging Face

In [None]:
kwargs = {
    "dataset_tags": dataset_name,
    "dataset": dataset_name,  # a 'pretty' name for the training dataset
    "dataset_args": "config: train, split: train",
    "model_name": "SaCHi_ASR",  # a 'pretty' name for our model
    "finetuned_from": whisper_model,
    "tasks": "automatic-speech-recognition"
}

In [None]:
trainer.push_to_hub(**kwargs)
print("\nALL DONE!!")