<a href="https://colab.research.google.com/github/sumanlaraee/AI-ML/blob/main/peft_bnb_whisper_large_v2_training_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning Whisper-large-V2 on Colab using PEFT-Lora + BNB INT8 training

In this Colab, we present a step-by-step guide on how to fine-tune Whisper for any multilingual ASR dataset using Hugging Face 🤗 Transformers and 🤗 PEFT. Using 🤗 PEFT and `bitsandbytes`, you can train the `whisper-large-v2` seamlessly on a colab with T4 GPU (16 GB VRAM). In this notebook, with most parts from [fine_tune_whisper.ipynb](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb#scrollTo=BRdrdFIeU78w) is adapted to train using PEFT LoRA+BNB INT8.

For more details on model, datasets and metrics, refer blog [Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-whisper)



## Inital Setup

In [None]:
!pip install datasets>=2.6.1          #required
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-7l10z04u
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-7l10z04u
  Resolved https://github.com/huggingface/transformers to commit 5275ef6f3d4a1a78a25e958496cde48fd0257dc2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


Linking the notebook to the Hub is straightforward - it simply requires entering your Hub authentication token when prompted. Find your Hub authentication token [here](https://huggingface.co/settings/tokens):

In [None]:
# Select CUDA device index       #required
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name_or_path = "openai/whisper-large-v2"
language = "pashto"
language_abbr = "mr"
task = "transcribe"
dataset_name = "mozilla-foundation/common_voice_11_0"

## Load Dataset

In [None]:
import os
import torch             #required
import torchaudio
import json
from datasets import Dataset, DatasetDict, Audio
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor, TrainingArguments, Trainer
from dataclasses import dataclass
from typing import Any, Dict, List, Union

# Set model details
model_name_or_path = "openai/whisper-large-v2"
language = "pashto"
task = "transcribe"

# Load processor components
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)
tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, task=task)
processor = WhisperProcessor.from_pretrained(model_name_or_path, language=language, task=task)

# Define paths
AUDIO_DIR = "/content/drive/MyDrive/audios"
PROCESSED_DIR = "/content/processed_audio"
TRANSCRIPT_PATH = "/content/sentences.json"
os.makedirs(PROCESSED_DIR, exist_ok=True)

# Convert `.mpeg` and `.wav` files to 16kHz `.wav`
def preprocess_audio():
    for file in os.listdir(AUDIO_DIR):
        if file.endswith((".mpeg", ".wav")):
            input_path = os.path.join(AUDIO_DIR, file)
            output_path = os.path.join(PROCESSED_DIR, file.rsplit(".", 1)[0] + ".wav")
            waveform, sample_rate = torchaudio.load(input_path)
            if sample_rate != 16000:
                transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
                waveform = transform(waveform)
            torchaudio.save(output_path, waveform, 16000, encoding="PCM_S", bits_per_sample=16)
preprocess_audio()

# Load transcriptions
with open(TRANSCRIPT_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)
sentences = data[2]["data"]

# Match audio files with transcriptions
dataset_dict = {"audio": [], "text": []}
for entry in sentences:
    for file in os.listdir(PROCESSED_DIR):
        if entry["sentence_id"] in file:
            dataset_dict["audio"].append(os.path.join(PROCESSED_DIR, file))
            dataset_dict["text"].append(entry["sentence"])
            break

# Convert dictionary to Hugging Face Dataset
dataset = Dataset.from_dict(dataset_dict).cast_column("audio", Audio(sampling_rate=16000))
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
dataset = DatasetDict({"train": split_dataset["train"], "validation": split_dataset["test"]})

# Function to process dataset
# Function to process dataset (Fixed)
import numpy as np

def prepare_dataset(batch):
    audio = batch["audio"]

    # Extract features
    input_features = feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"]
    ).input_features[0]

    # ✅ Ensure fixed length of 3000
    expected_seq_length = 3000
    current_length = input_features.shape[-1]

    if current_length > expected_seq_length:
        input_features = input_features[..., :expected_seq_length]  # Truncate
    elif current_length < expected_seq_length:
        pad_width = expected_seq_length - current_length
        input_features = np.pad(
            input_features, ((0, 0), (0, pad_width)), mode="constant", constant_values=0
        )  # Pad with zeros

    batch["input_features"] = input_features
    batch["labels"] = tokenizer(batch["text"]).input_ids

    return batch



# Apply to dataset
dataset = dataset.map(prepare_dataset, remove_columns=["audio", "text"])

# ✅ Modified Data Collator to Fix Padding Issues
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Extract input features
        input_features = [{"input_features": feature["input_features"]} for feature in features if "input_features" in feature]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Extract labels and pad them
        label_features = [{"input_ids": feature["labels"]} for feature in features if "labels" in feature]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # Ensure labels are properly masked
        labels = labels_batch["input_ids"]
        labels[labels_batch.attention_mask == 0] = -100  # Ignore padding tokens in loss calculation

        # Handle potential mismatches in dimensions
        max_length = labels.shape[1]
        batch_size = labels.shape[0]

        if batch["input_features"].shape[1] != max_length:
            batch["input_features"] = torch.nn.functional.pad(
                batch["input_features"], (0, max_length - batch["input_features"].shape[1])
            )

        batch["labels"] = labels

        return batch

# Use the fixed data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
import evaluate

metric = evaluate.load("wer")

We then simply have to define a function that takes our model
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Load a Pre-Trained Checkpoint

Now let's load the pre-trained Whisper `small` checkpoint. Again, this
is trivial through use of 🤗 Transformers!

In [None]:
from transformers import WhisperForConditionalGeneration    #required
model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path)

Override generation arguments - no tokens are forced as decoder outputs (see [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)), no tokens are suppressed during generation (see [`suppress_tokens`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.suppress_tokens)):

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

In [None]:
!pip install --upgrade peft #required




In [None]:
import torch
model = torch.compile(model)


### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
from transformers import AutoModelForSpeechSeq2Seq, BitsAndBytesConfig
import torch

# Define quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16,  # FP16 computation
    bnb_4bit_use_double_quant=True,  # Additional memory optimization
    llm_int8_enable_fp32_cpu_offload=True  # Offload FP32 layers to CPU
)

# Load model with quantization
model_name = "openai/whisper-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name,
    quantization_config=quantization_config,  # Apply quantization
    device_map="auto"  # Auto-assign model to available GPUs
)


We are ONLY using **1%** of the total trainable parameters, thereby performing **Parameter-Efficient Fine-Tuning**

### Define the Training Configuration

In the final step, we define all the parameters related to training. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments).

In [None]:
model.gradient_checkpointing_enable()


In [None]:
import torch
torch.cuda.empty_cache()


In [None]:
from transformers import WhisperForConditionalGeneration

# Load the Whisper model
model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path)

training_args = TrainingArguments(
    output_dir="./whisper-checkpoints",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    evaluation_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
    learning_rate=1e-5,
    warmup_steps=500,
    save_total_limit=2,
    num_train_epochs=3,
    gradient_accumulation_steps=8,
    fp16=True,
    report_to="none",  # ✅ Disables wandb properly
    remove_unused_columns=False,
)






**Few Important Notes:**
1. `remove_unused_columns=False` and `label_names=["labels"]` are required as the PeftModel's forward doesn't have the signature of the base model's forward.

2. INT8 training required autocasting. `predict_with_generate` can't be passed to Trainer because it internally calls transformer's `generate` without autocasting leading to errors.

3. Because of point 2, `compute_metrics` shouldn't be passed to `Seq2SeqTrainer` as seen below. (commented out)

In [None]:
# ✅ Fix Trainer to use correct model inputs
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=processor.feature_extractor,  # ✅ Use feature extractor, not tokenizer
    data_collator=data_collator,
)

# Run training
trainer.train()

  trainer = Trainer(


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 6.12 MiB is free. Process 710807 has 14.73 GiB memory in use. Of the allocated memory 14.15 GiB is allocated by PyTorch, and 483.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
model_name_or_path = "openai/whisper-large-v2"
peft_model_id = "smangrul/" + f"{model_name_or_path}-{model.peft_config.peft_type}-colab".replace("/", "-")
model.push_to_hub(peft_model_id)
print(peft_model_id)

Uploading the following files to smangrul/openai-whisper-large-v2-LORA-colab: adapter_model.bin,adapter_config.json


Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.bin:   0%|          | 0.00/63.1M [00:00<?, ?B/s]

smangrul/openai-whisper-large-v2-LORA-colab


# Evaluation and Inference

In [None]:
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer

peft_model_id = "smangrul/openai-whisper-large-v2-LORA-colab"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)

Downloading (…)/adapter_config.json:   0%|          | 0.00/358 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--openai--whisper-large-v2/snapshots/bd0efe4d58db161e5ca3940e7c5940221e1b9646/config.json
Model config WhisperConfig {
  "_name_or_path": "openai/whisper-large-v2",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "architectures": [
    "WhisperForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "begin_suppress_tokens": [
    220,
    50257
  ],
  "bos_token_id": 50257,
  "d_model": 1280,
  "decoder_attention_heads": 20,
  "decoder_ffn_dim": 5120,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 32,
  "decoder_start_token_id": 50258,
  "dropout": 0.0,
  "encoder_attention_heads": 20,
  "encoder_ffn_dim": 5120,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 32,
  "eos_token_id": 50257,
  "forced_decoder_ids": [
    [
      1,
      50259
    ],
    [
      2,
      50359
    ],
    [
      3,
      50363
    ]
  ],
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "max_lengt

Downloading (…)"adapter_model.bin";:   0%|          | 0.00/63.1M [00:00<?, ?B/s]

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

eval_dataloader = DataLoader(common_voice["test"], batch_size=8, collate_fn=data_collator)

model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    del generated_tokens, labels, batch
    gc.collect()
wer = 100 * metric.compute()
print(f"{wer=}")

100%|██████████| 227/227 [1:48:29<00:00, 28.68s/it]

wer=36.74811424150603





## Using AutomaticSpeechRecognitionPipeline

**Few important notes:**
1. `pipe()` should be in the autocast context manager `with torch.cuda.amp.autocast():`
2. `forced_decoder_ids` specifying the `language` being transcribed should be provided in `generate_kwargs` dict.
3. You will get warning along the below lines which is **safe to ignore**.
```
The model 'PeftModel' is not supported for . Supported models are ['SpeechEncoderDecoderModel', 'Speech2TextForConditionalGeneration', 'SpeechT5ForSpeechToText', 'WhisperForConditionalGeneration', 'Data2VecAudioForCTC', 'HubertForCTC', 'MCTCTForCTC', 'SEWForCTC', 'SEWDForCTC', 'UniSpeechForCTC', 'UniSpeechSatForCTC', 'Wav2Vec2ForCTC', 'Wav2Vec2ConformerForCTC', 'WavLMForCTC'].

```

In [None]:
import torch
import gradio as gr
from transformers import (
    AutomaticSpeechRecognitionPipeline,
    WhisperForConditionalGeneration,
    WhisperTokenizer,
    WhisperProcessor,
)
from peft import PeftModel, PeftConfig


peft_model_id = "smangrul/openai-whisper-large-v2-LORA-colab"
language = "Marathi"
task = "transcribe"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)

model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
feature_extractor = processor.feature_extractor
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)


def transcribe(audio):
    with torch.cuda.amp.autocast():
        text = pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]
    return text


iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(source="microphone", type="filepath"),
    outputs="text",
    title="PEFT LoRA + INT8 Whisper Large V2 Marathi",
    description="Realtime demo for Marathi speech recognition using `PEFT-LoRA+INT8` fine-tuned Whisper Large V2 model.",
)

iface.launch(share=True)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--openai--whisper-large-v2/snapshots/bd0efe4d58db161e5ca3940e7c5940221e1b9646/config.json
Model config WhisperConfig {
  "_name_or_path": "openai/whisper-large-v2",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "architectures": [
    "WhisperForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "begin_suppress_tokens": [
    220,
    50257
  ],
  "bos_token_id": 50257,
  "d_model": 1280,
  "decoder_attention_heads": 20,
  "decoder_ffn_dim": 5120,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 32,
  "decoder_start_token_id": 50258,
  "dropout": 0.0,
  "encoder_attention_heads": 20,
  "encoder_ffn_dim": 5120,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 32,
  "eos_token_id": 50257,
  "forced_decoder_ids": [
    [
      1,
      50259
    ],
    [
      2,
      50359
    ],
    [
      3,
      50363
    ]
  ],
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "max_lengt

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://ee6fef1c-b214-4067.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


