## Wev2vec 2.0 Fine Tune

In this we will be fine tuning wav2vec2-base-960h model using medical-speech-transcription-and-intent dataset through Hugging Face

Logging in through Hugging Face

An authnetication token will be required, token of type write or fine grained will work

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Loading Dataset

We will be using medical-speech-transcription-and-intent dataset which is available in Hugging Face at https://huggingface.co/datasets/Hani89/medical_asr_recording_dataset

In [31]:
from datasets import load_dataset

# Load the dataset from Hugging Face
dataset = load_dataset("Hani89/medical_asr_recording_dataset")

# Check the structure of the dataset
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 5328
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 1333
    })
})


## Prepare Feature Extractor, Tokenizer and Data
The ASR pipeline can be de-composed into three stages:

1. A feature extractor which pre-processes the raw audio-inputs

2. The model which performs the sequence-to-sequence mapping

3. A tokenizer which post-processes the model outputs to text format

   In wev2vec we have Wav2Vec2Processor and Wav2Vec2ForCTC

In [4]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Trainer, TrainingArguments
import torch

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now we can write a function to prepare our data ready for the model:

1. We load and resample the audio data by calling batch["audio"]

2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.

3. We encode the transcriptions to label ids through the use of the tokenize

In [32]:
def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
    return batch

# Apply the preprocessing function to the dataset
dataset = dataset.map(preprocess_data, remove_columns=["audio", "sentence"])


Map:   0%|          | 0/5328 [00:00<?, ? examples/s]



Map:   0%|          | 0/1333 [00:00<?, ? examples/s]

## Training and Evaluation
Now that we've prepared our data, we're ready to dive into the training pipeline. The Trainer will do much of the heavy lifting for us.

### Define a Data Collator
The data collator for a sequence-to-sequence speech model is unique in the sense that it treats the input_features and labels independently: the input_features must be handled by the feature extractor and the labels by the tokenizer.

In [33]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
   

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [34]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

## Evaluation Metrics
We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing ASR systems.

In [36]:
import evaluate

metric = evaluate.load("wer")

In [37]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Define the Training Configuration
In the final step, we define all the parameters related to training.

In [46]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir='./wev2vec-base-shantanu',
  group_by_length=True,
  per_device_train_batch_size=1,
  per_device_eval_batch_size=1,
  evaluation_strategy="steps",
  num_train_epochs=3,
  fp16=True,
)



In [47]:
# Define a custom compute_metrics function for evaluation
import evaluate

wer_metric = evaluate.load("wer")

## Training

In [48]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=processor.feature_extractor,
)

  trainer = Trainer(


In [49]:
trainer.train()



OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 0 has a total capacity of 5.77 GiB of which 41.56 MiB is free. Process 4606 has 2.99 GiB memory in use. Including non-PyTorch memory, this process has 1.22 GiB memory in use. Of the allocated memory 936.57 MiB is allocated by PyTorch, and 159.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)