### Explanatory Note

Task: ASR\
Model: Whisper small(https://huggingface.co/openai/whisper-small) \
Languiges: Turkish, Azerbaijani, Swahili\
Data: mozilla-foundation/common_voice_16_1 (https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1)

Content:
1. Get data
2. Whisper small
3. Whisper small Peft
4. Evaluation

Model Whisper was chosen as the most popular model with many examples and updates. It has different sizes, which is why will be easy to use the same pipeline for research and production by changing one line - the name of the model.
Data was chosen for the same reason: it is popular, has updates, has necessary languages, and has many processing examples.

At first, I tried fine-tuning the model without peft, but it was so long. This experiment brings 2 points:
1. not fine tuned wer (metric, less better) ~ 100
2. Training Loss and Validation Loss go down, that mean all going by the plan. 

After I trained the model using peft method Lora, here is results:
1. Double speed compare previous experiment
2. That give ~20% whores wer result (wer=60 by 1000 steps is ~ equal  750 steps in normal mode to gain the same wer)
3. Peft model was saved to the Hugging Face Hub (https://huggingface.co/voodyara/openai-whisper-small-swahili-LORA-colab)

In Evaluation block I just calculate wer for model from previous step.

Conclusions: the model have potential for tuning with special language. Small version after fine tuning (only 1000 steps) show good results witch not so far from SOTA (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) and after changing to bigger whisper model and more longer period of training it will be more better.  

In [1]:
!nvidia-smi

Fri Mar 22 19:57:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 2070 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   48C    P8               6W /  80W |      6MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [8]:
# !pip install datasets
# !pip install git+https://github.com/huggingface/transformers
# !pip install librosa
# !pip install evaluate
# !pip install jiwer
# !pip install gradio

# !nvcc --version
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu115
# !pip install --upgrade datasets transformers accelerate soundfile librosa evaluate jiwer tensorboard gradio

#!pip install ipywidgets

# Get data

In [2]:
# python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('')
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_16_1", "sw", split="train+validation", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_16_1", "sw", split="test", use_auth_token=True)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [5]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes", "variant"])

### Data example

In [6]:
example0 = common_voice['train'][0]
example0_arr_len = example0['audio']['array'].shape

train_len = common_voice['train'].shape, 
test_len = common_voice['test'].shape, 

example0, example0_arr_len, train_len, test_len

({'audio': {'path': '/home/mnemonic/.cache/huggingface/datasets/downloads/extracted/29b8bd879ffa4303988fa7658a0b717fcd75875e1b9b59cb379f316c322e370e/sw_train_0/common_voice_sw_30574257.mp3',
   'array': array([ 8.52651283e-14,  7.38964445e-13,  1.16529009e-12, ...,
           1.15293153e-06, -1.70660792e-06, -2.12845862e-06]),
   'sampling_rate': 48000},
  'sentence': 'blob anapenda kutembea usiku kuliko mchana'},
 (255744,),
 ((58440, 2),),
 ((12234, 2),))

In [7]:
from transformers import WhisperFeatureExtractor, WhisperTokenizer

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Swahili", task="transcribe")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Data example

In [7]:
input_ = common_voice["train"][0]["sentence"]
labels = tokenizer(input_).input_ids
decoded = tokenizer.decode(labels, skip_special_tokens=True)
input_, labels, decoded

('blob anapenda kutembea usiku kuliko mchana',
 [50258,
  50318,
  50359,
  50363,
  15962,
  65,
  364,
  569,
  7639,
  350,
  325,
  443,
  650,
  64,
  505,
  24320,
  27576,
  10770,
  275,
  339,
  2095,
  50257],
 'blob anapenda kutembea usiku kuliko mchana')

In [8]:
from datasets import Audio
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

### Data example

In [9]:
example0 = common_voice['train'][0]
example0_rate = example0['audio']['sampling_rate']
example0, example0_rate

({'audio': {'path': '/home/mnemonic/.cache/huggingface/datasets/downloads/extracted/29b8bd879ffa4303988fa7658a0b717fcd75875e1b9b59cb379f316c322e370e/sw_train_0/common_voice_sw_30574257.mp3',
   'array': array([ 5.22959454e-12, -4.54747351e-12,  6.82121026e-12, ...,
           1.84139935e-06, -5.95064193e-07, -3.07081791e-07]),
   'sampling_rate': 16000},
  'sentence': 'blob anapenda kutembea usiku kuliko mchana'},
 16000)

In [9]:
import os
num_proc = os.cpu_count()-1
print(num_proc)

def prepare_dataset(batch):
    audio = batch["audio"]
    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=num_proc, load_from_cache_file=True)

11


In [10]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union
from transformers import WhisperProcessor

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels

        return batch


processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Swahili", task="transcribe")
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Whisper small

In [9]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.generation_config.language = "sw" 
# model

In [10]:
from datasets import load_metric
metric = load_metric('wer')

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = tokenizer.pad_token_id
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

  metric = load_metric('wer')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [13]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-swahili",
    per_device_train_batch_size=4, # 16 def
    gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size # 1 def
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=5000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=4, # 8 def
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=100,
    eval_steps=100,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,

)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [14]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
100,2.3369,2.210492,95.853908
200,1.6458,1.759184,83.513039
300,1.3311,1.472982,72.928401
400,1.0394,1.16728,68.523841
500,0.854,1.003059,60.994522
600,0.7703,0.95593,59.352095
700,0.7665,0.911541,66.273821
800,0.7724,0.87356,56.288095


Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}


KeyboardInterrupt: 

In [None]:
!tensorboard --logdir='./whisper-small-swahili/runs/' --port=6006

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)
E0321 11:56:43.155679 139670943929920 _internal.py:96] Error on request:
Traceback (most recent call last):
  File "/home/mnemonic/arhiv/projects/profit_center/pct2/lib/python3.9/site-packages/werkzeug/serving.py", line 362, in run_wsgi
    execute(self.server.app)
  File "/home/mnemonic/arhiv/projects/profit_center/pct2/lib/python3.9/site-packages/werkzeug/serving.py", line 323, in execute
    application_iter = app(environ, start_response)
  File "/home/mnemonic/arhiv/projects/profit_center/pct2/lib/python3.9/site-packages/tensorboard/backend/application.py", line 528,

In [14]:
# import gc
# torch.cuda.empty_cache()
# gc.collect()

# with torch.no_grad():
#     torch.cuda.empty_cache()

from numba import cuda
 
cuda.select_device(0) # choosing second GPU 
cuda.close()

# Whisper small Peft

In [4]:
# !add-apt-repository -y ppa:jonathonf/ffmpeg-4
# !apt update
# !apt install -y ffmpeg

In [5]:
# !pip install datasets>=2.6.1
# !pip install git+https://github.com/huggingface/transformers
# !pip install librosa
# !pip install evaluate>=0.30
# !pip install jiwer
# !pip install gradio
# !pip install -q bitsandbytes datasets accelerate
# !pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main

In [13]:
model_name_or_path = "openai/whisper-small"
language = "Swahili"
language_abbr = "sw"
task = "transcribe"

In [14]:
import torch
from transformers import WhisperForConditionalGeneration, BitsAndBytesConfig

model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, quantization_config=BitsAndBytesConfig(load_in_8bit=True))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [15]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.generation_config.language = language_abbr

In [16]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

In [17]:
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 3,538,944 || all params: 245,273,856 || trainable%: 1.442854145857274


In [19]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-swahili-peft",
    per_device_train_batch_size=16, # 16 def
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-3,
    warmup_steps=50,
    max_steps=1000,
    #num_train_epochs=3,
    evaluation_strategy="steps",
    fp16=True,
    save_steps=100,
    eval_steps=100,
    per_device_eval_batch_size=16,
    report_to=["tensorboard"],
    generation_max_length=128,
    logging_steps=25,
    remove_unused_columns=False,  # required as the PeftModel forward doesn't have the signature of the wrapped model's forward
    label_names=["labels"],  # same reason as above
    push_to_hub=False,
)

In [20]:
from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR


class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

In [21]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [22]:
trainer.train()



Step,Training Loss,Validation Loss
100,1.2436,1.373525
200,1.1103,1.192044
300,0.867,1.11535
400,0.86,1.064793
500,0.8363,1.028168
600,0.8078,0.984833
700,0.816,0.948149
800,0.735,0.92495
900,0.7489,0.907668
1000,0.701,0.899649




TrainOutput(global_step=1000, training_loss=0.9657648029327393, metrics={'train_runtime': 29028.277, 'train_samples_per_second': 0.551, 'train_steps_per_second': 0.034, 'total_flos': 4.69890367488e+18, 'train_loss': 0.9657648029327393, 'epoch': 0.27})

In [23]:
model_name_or_path = "openai/whisper-small-swahili"
peft_model_id = "voodyara/" + f"{model_name_or_path}-{model.peft_config['default'].peft_type}-colab".replace("/", "-")
model.push_to_hub(peft_model_id)
print(peft_model_id)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

voodyara/openai-whisper-small-swahili-LORA-colab


# Evaluation

In [11]:
from peft import PeftModel, PeftConfig

from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer, BitsAndBytesConfig

peft_model_id = "voodyara/openai-whisper-small-swahili-LORA-colab"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, quantization_config=BitsAndBytesConfig(load_in_8bit=True), device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)

model.generation_config.language = "sw"

In [12]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

In [13]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union
from transformers import WhisperProcessor

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels

        return batch


processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Swahili", task="transcribe")
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
eval_dataloader = DataLoader(common_voice["test"], batch_size=8, collate_fn=data_collator)

In [15]:
from datasets import load_metric
metric = load_metric('wer')

  metric = load_metric('wer')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [17]:
model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    del generated_tokens, labels, batch
    gc.collect()
wer = 100 * metric.compute()
print(f"{wer=}")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1530/1530 [3:49:18<00:00,  8.99s/it]


wer=60.185363783658666
