## Whisper Base Fine Tune

In this we will be fine tuning whisper-base model using medical-speech-transcription-and-intent dataset through Hugging Face

Logging in through Hugging Face

An authnetication token will be required, token of type write or fine grained will work

In [7]:

from huggingface_hub import login
login(token="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")


## Loading Dataset

We will be using medical-speech-transcription-and-intent dataset which is available in Hugging Face at https://huggingface.co/datasets/Hani89/medical_asr_recording_dataset

In [8]:
from datasets import load_dataset

# Load the dataset from Hugging Face
dataset = load_dataset("Hani89/medical_asr_recording_dataset")

# Check the structure of the dataset
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 5328
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 1333
    })
})


## Prepare Feature Extractor, Tokenizer and Data
The ASR pipeline can be de-composed into three stages:

1. A feature extractor which pre-processes the raw audio-inputs

2. The model which performs the sequence-to-sequence mapping

3. A tokenizer which post-processes the model outputs to text format

In Transformers, the Whisper model has an associated feature extractor and tokenizer, called WhisperFeatureExtractor and WhisperTokenizer respectively.

In [11]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base")

In [12]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base", language="English", task="transcribe")

In [13]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-base", language="English", task="transcribe")

Now we can write a function to prepare our data ready for the model:

1. We load and resample the audio data by calling batch["audio"]

2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.

3. We encode the transcriptions to label ids through the use of the tokenizer.

In [15]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [16]:
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names["train"], num_proc=2)

## Training and Evaluation
Now that we've prepared our data, we're ready to dive into the training pipeline. The Trainer will do much of the heavy lifting for us. All we have to do is:

1. Load a pre-trained checkpoint: we need to load a pre-trained checkpoint and configure it correctly for training.

2. Define a data collator: the data collator takes our pre-processed data and prepares PyTorch tensors ready for the model.

3. Evaluation metrics: during evaluation, we want to evaluate the model using the word error rate (WER) metric. We need to define a compute_metrics function that handles this computation.

4. Define the training configuration: this will be used by the Trainer to define the training schedule.

Once we've fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it to transcribe speech

In [19]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")

In [20]:
model.generation_config.language = "english"
model.generation_config.task = "transcribe"

model.generation_config.forced_decoder_ids = None

## Define a Data Collator
The data collator for a sequence-to-sequence speech model is unique in the sense that it treats the input_features and labels independently: the input_features must be handled by the feature extractor and the labels by the tokenizer.

The input_features are already padded to 30s and converted to a log-Mel spectrogram of fixed dimension by action of the feature extractor, so all we have to do is convert the input_features to batched PyTorch tensors. We do this using the feature extractor's .pad method with return_tensors=pt.

We can leverage the WhisperProcessor we defined earlier to perform both the feature extractor and the tokenizer operations

In [21]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [22]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

## Evaluation Metrics
We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing ASR systems.

In [25]:
import evaluate

metric = evaluate.load("wer")

In [26]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Define the Training Configuration
In the final step, we define all the parameters related to training.

In [28]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-base-shantanu",  
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)



In [29]:
import evaluate
metric = evaluate.load("wer")


In [30]:
!pip install tensorboard




In [31]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

  trainer = Seq2SeqTrainer(
max_steps is given, it will override any value given in num_train_epochs


In [32]:
processor.save_pretrained(training_args.output_dir)

[]

## Training
Training will take approximately 3 hours depending on your GPU

In [33]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
1000,0.0544,0.127506,7.140255
2000,0.007,0.114664,6.404372
3000,0.0007,0.118265,5.938069
4000,0.0004,0.11945,5.945355


You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, 50259], [2, 50359], [3, 50363]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use

TrainOutput(global_step=4000, training_loss=0.07025776202371344, metrics={'train_runtime': 7249.5923, 'train_samples_per_second': 8.828, 'train_steps_per_second': 0.552, 'total_flos': 4.15103975424e+18, 'train_loss': 0.07025776202371344, 'epoch': 12.012012012012011})

These are the final results

## Push Fine Tuned Model to Hugging Face

We will be saving this model on Hugging Face 

In [37]:
kwargs = {
    "dataset_tags": "Hani89/medical_asr_recording_dataset",
    "dataset": "medical-speech-transcription-and-intent ", 
    "dataset_args": "config: en, split: test",
    "language": "en",
    "model_name": "Whisper Base - Shantanu",  
    "finetuned_from": "openai/whisper-base",
    "tasks": "automatic-speech-recognition",
}

In [38]:
trainer.push_to_hub(**kwargs)

CommitInfo(commit_url='https://huggingface.co/shantanu007/whisper-base-shantanu/commit/91ceda49aac5aa48229a73f7c166eebedc3a332e', commit_message='End of training', commit_description='', oid='91ceda49aac5aa48229a73f7c166eebedc3a332e', pr_url=None, repo_url=RepoUrl('https://huggingface.co/shantanu007/whisper-base-shantanu', endpoint='https://huggingface.co', repo_type='model', repo_id='shantanu007/whisper-base-shantanu'), pr_revision=None, pr_num=None)

## Live Demo

Either you can give an audio file as input or via a microphone

In [43]:
from transformers import pipeline
import gradio as gr
import numpy as np

# Load the pipeline and set device=0 to use the first GPU if available
pipe = pipeline("automatic-speech-recognition", model="shantanu007/whisper-base-shantanu", device=0)

def transcribe(audio):
    # Check if audio input is None
    if audio is None:
        return "No audio input received"
    
    # Ensure audio is in numpy array format for compatibility
    audio = np.array(audio, dtype=np.float32)
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(type="numpy"),  # Set type to numpy for compatibility
    outputs="text",
    title="Whisper Base Real Time",
    description="Realtime demo for Medical speech recognition using a fine-tuned Whisper base model.",
)

iface.launch()


* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.


