**Finetuning Whisper with ATC0**

Based on Jianhua's code at

https://github.com/eraus-projs/salai-docs/blob/doc-brch/whisper/finetune_whisper_atc0_all.md

# **Using the HF token**

In [1]:
%%writefile .env
HF_TOKEN_NEW1 = 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

Overwriting .env


In [2]:
!pip install python-dotenv
!pip install datasets==2.16.1



In [3]:
import os
from dotenv import load_dotenv

load_dotenv()
hf_token = os.getenv("HF_TOKEN_NEW1")
print(hf_token)

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


# **Defining all parameters**

In [4]:
# the base model name or path
model_name_or_path = "openai/whisper-medium.en"
output_dir = "whisper-lora-atc0-all"
org = "HF-SaLAI"
trained_model_name = "whisper-medium.en-finetuned-on-atc0-all"

adapter_to_choose = f"{output_dir}/checkpoint-28120"
trained_model_local = output_dir + '/' + trained_model_name
trained_model_repo = org + '/' + trained_model_name

# **Loading the dataset**

In [5]:
#Loading the dataset
from datasets import DatasetDict, load_dataset, concatenate_datasets

atc0 = load_dataset("HF-SaLAI/salai_atc0", "base", token=hf_token)
atc0p2 = load_dataset("HF-SaLAI/salai_atc0", "part2", token=hf_token)
atc0p3 = load_dataset("HF-SaLAI/salai_atc0", "part3", token=hf_token)

dataset = DatasetDict()
print(atc0)
print(atc0p2)
print(atc0p3)


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/5.44k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.15k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/243M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/121M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/151M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading data:   0%|          | 0.00/245M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data:   0%|          | 0.00/254M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 6853
    })
    validation: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 3395
    })
    test: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 4007
    })
})
DatasetDict({
    train: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 7510
    })
})
DatasetDict({
    train: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 8668
    })
})


In [6]:
train_dataset = concatenate_datasets([atc0["train"],
                                         atc0p2["train"],
                                         atc0p3["train"]]).shuffle(seed=42)

dataset["train"] = train_dataset.select(range(100));

#dataset["train"] = concatenate_datasets([atc0["train"],
#                                         atc0p2["train"],
#                                         atc0p3["train"]]).shuffle(seed=42)

shuffled_dataset = atc0["validation"].shuffle(seed=42)
dataset["validation"] = shuffled_dataset.select(range(100))

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 100
    })
    validation: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 100
    })
})


# **Creating a text normalizer**

XL: normalizer is to remove capital letter, and , . etc?

In [7]:
import transformers.models.whisper.english_normalizer as en

english_text_normalizer = en.EnglishTextNormalizer({})

# **Filtering the test dataset**


Some examples will have an empty string after normalization, which will cause issues with the WER calculation. Here, we remove these examples.

In [8]:
def is_transcript_empty(transcript):
    normalized_transcript = english_text_normalizer(transcript)
    return len(normalized_transcript) > 0

dataset["train"] = dataset["train"].filter(is_transcript_empty,
        input_columns=["text"])
dataset["validation"] = dataset["validation"].filter(is_transcript_empty,
        input_columns=["text"])
print(dataset)

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 97
    })
    validation: Dataset({
        features: ['audio', 'text', 'file', 'speaker_id', 'bgn_time', 'end_time'],
        num_rows: 99
    })
})


**Creating a processor and its feature extractor and tokenizer**


In [9]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(model_name_or_path)
feature_extractor = processor.feature_extractor
tokenizer = processor.tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.41M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

# **Creating input features from audio data**

In [10]:
def prepare_dataset(batch):
    # compute log-Mel input features from input audio array
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"],
            sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(english_text_normalizer(
            batch["text"])).input_ids
    return batch

dataset = dataset.map(prepare_dataset,
                      remove_columns=dataset.column_names["train"],
                      num_proc=1)

print(dataset)

Map:   0%|          | 0/97 [00:00<?, ? examples/s]

Map:   0%|          | 0/99 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 97
    })
    validation: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 99
    })
})


In [11]:
print(dataset["train"][1]["input_features"])

Output hidden; open in https://colab.research.google.com to view.

# **Training and Evaluation**



**Define a Data Collator**

In [12]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self,
                 features: List[Dict[str, Union[List[int], torch.Tensor]]]) \
                        -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths
        # and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]}
                          for feature in features]
        batch = self.processor.feature_extractor.pad(input_features,
                                                     return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]}
                          for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features,
                                                    return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
                labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().\
                            cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

# Note that we have the following issue when doing the training:
# he attention mask is not set and cannot be inferred from input because
# pad token is same as eos token.As a consequence, you may observe
# unexpected behavior. Please pass your input's `attention_mask` to obtain
# reliable results.
# The issue may be related to the padding of feature_extractor.

**Define Evaluation Metrics**

In [13]:
!pip install jiwer
!pip install evaluate


import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Collecting jiwer
  Downloading jiwer-3.0.5-py3-none-any.whl.metadata (2.7 kB)
Collecting rapidfuzz<4,>=3 (from jiwer)
  Downloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading jiwer-3.0.5-py3-none-any.whl (21 kB)
Downloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.0.5 rapidfuzz-3.10.1
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

**Load a pre-trained checkpoint**

In [14]:
import torch
torch.cuda.empty_cache() # XL added this to free some memory

print(f"{torch.cuda.is_available() = }")

from transformers import WhisperForConditionalGeneration

base_model = WhisperForConditionalGeneration.from_pretrained(
        model_name_or_path).to("cuda")


torch.cuda.is_available() = True


config.json:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

Override generation arguments - no tokens are forced as decoder outputs (see forced_decoder_ids), no tokens are suppressed during generation (see suppress_tokens):

In [15]:
base_model.config.forced_decoder_ids = None
base_model.config.suppress_tokens = []

**Apply LoRA**

In [16]:
!pip install peft

from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"],
                    lora_dropout=0.05, bias="none")

model = get_peft_model(base_model, config)

model.print_trainable_parameters()

trainable params: 9,437,184 || all params: 773,294,080 || trainable%: 1.2204


**Define the Training Configuration**


In [17]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,          # change to a repo name of your choice
    per_device_train_batch_size=8,  # increase to 16 for larger datasets
    gradient_accumulation_steps=1,  # inc by 2x for every 2x dec in batch size
    learning_rate=1e-3,
    report_to="none",
    # warmup_steps=50,
    num_train_epochs=10,
    eval_strategy="epoch",
    fp16=True,
    per_device_eval_batch_size=1,
    generation_max_length=128,
    logging_steps=1,
    remove_unused_columns=False,  # required as the PeftModel forward doesn't
            # have the signature of the wrapped model's forward
    label_names=["labels"],       # same reason as above
    predict_with_generate=True,
    save_steps=0.1,               #if you wish to save checkpoints
)

In [18]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)
model.config.use_cache = False  # silence warnings; re-enable for inference!

  trainer = Seq2SeqTrainer(


**Train the adapter**

In [19]:
trainer.train()

Epoch,Training Loss,Validation Loss,Wer
1,1.5313,1.480372,72.758229
2,0.0179,1.284232,50.964813
3,0.2405,1.433681,38.592509
4,0.5167,1.35081,36.662883
5,0.0084,1.301419,33.484677
6,0.0035,1.308257,33.825199
7,0.0425,1.297612,33.598184
8,0.003,1.324401,34.506243
9,0.003,1.30693,34.506243
10,0.003,1.305545,34.165721


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instea

TrainOutput(global_step=130, training_loss=0.3911855651674649, metrics={'train_runtime': 947.1256, 'train_samples_per_second': 1.024, 'train_steps_per_second': 0.137, 'total_flos': 1.0031686189056e+18, 'train_loss': 0.3911855651674649, 'epoch': 10.0})

# **Saving the finetuned model locally and push to Hugging Face**

In [20]:
from transformers import WhisperForConditionalGeneration
from peft import PeftModel

# base_model = WhisperForConditionalGeneration.from_pretrained(
#              model_name_or_path, load_in_8bit=False, device_map="auto")
base_model = WhisperForConditionalGeneration.from_pretrained(
        model_name_or_path).to("cuda")

In [21]:
adapter_to_choose = f"{output_dir}/checkpoint-117"
print(f"{adapter_to_choose = } \n\n")

# model = PeftModel.from_pretrained(base_model, adapter_to_choose)
model = PeftModel.from_pretrained(base_model, adapter_to_choose)
print(f"{model = } \n\n")

# model.merge_and_unload() merges the adapter parameters with the base model
# parameters and unloads the adapter. This typically results in a standard
# model that can be used without needing the PEFT infrastructure.
model = model.merge_and_unload()
print(f"{model = } \n\n")

adapter_to_choose = 'whisper-lora-atc0-all/checkpoint-117' 


model = PeftModel(
  (base_model): LoraModel(
    (model): WhisperForConditionalGeneration(
      (model): WhisperModel(
        (encoder): WhisperEncoder(
          (conv1): Conv1d(80, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
          (conv2): Conv1d(1024, 1024, kernel_size=(3,), stride=(2,), padding=(1,))
          (embed_positions): Embedding(1500, 1024)
          (layers): ModuleList(
            (0-23): 24 x WhisperEncoderLayer(
              (self_attn): WhisperSdpaAttention(
                (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
                (v_proj): lora.Linear(
                  (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1024, out_feat

Save trained model in my google drive

In [22]:
from google.colab import drive
drive.mount('/content/drive')

!cp -r /content/whisper-lora-atc0-all/whisper-medium.en-finetuned-on-atc0-all /content/drive/My\ Drive/

Mounted at /content/drive
cp: cannot stat '/content/whisper-lora-atc0-all/whisper-medium.en-finetuned-on-atc0-all': No such file or directory


**Saving locally**

In [23]:
model.save_pretrained(trained_model_local)
processor.save_pretrained(trained_model_local)
print(trained_model_local)



whisper-lora-atc0-all/whisper-medium.en-finetuned-on-atc0-all


**Pushing to Hugging Face (XL:Temporarily commented out)**

In [24]:
# model.push_to_hub(trained_model_repo, safe_serialization=True)
# processor = WhisperProcessor.from_pretrained(trained_model_local)
# processor.push_to_hub(trained_model_repo)

**Notes about model loading**

Note that we can use two approaches for loading a model:

model1 = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, load_in_8bit=False, device_map="auto")
model2 = WhisperForConditionalGeneration.from_pretrained(model_name_or_path).to("cuda")
Here is a comparison of the above two approaches:

They both use the default 16-bit coefficient for the checkpoint.
The first one uses device_map="auto" to specify to automatically distributes model layers across available hardware, which can optimize performance and memory usage, especially in multi-GPU setups.
The second uses .to("cuda") to specify to move the entire model to a single GPU, which is straightforward but may not utilize multiple GPUs or balance resources as effectively.
The second approach is the preferred way for the single GPU case. If we use multiple GPUs, we can use the first one.