In [1]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"wasifijaz","key":"463d4e434369bdd26e8898d3d4cf3159"}'}

# Approach 1

In [None]:
!pip install transformers datasets librosa torch soundfile

In [None]:
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer, Wav2Vec2Processor
from transformers import TrainingArguments, Trainer
import soundfile as sf
import numpy as np

Load the "clean" subset of the LibriSpeech dataset, reads audio files into arrays, and initializes a tokenizer and model from the Wav2Vec 2.0 base pretrained on 960 hours of speech. The map_to_array function reads each audio file, storing the waveform data in the dataset for processing.

In [None]:
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

dataset = load_dataset("librispeech_asr", "clean", split="train.100")
dataset = dataset.map(map_to_array)

tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Tokenizes the speech waveforms and corresponding texts in the LibriSpeech dataset for training with Wav2Vec 2.0. The prepare_dataset function converts speech waveforms into model-ready input values and text into padded/truncated input IDs. After mapping, the dataset retains only the processed inputs and labels, discarding the original columns.

In [None]:
def prepare_dataset(batch):
    input_values = tokenizer(batch["speech"], return_tensors="pt").input_values

    batch["input_values"] = input_values[0].tolist()
    batch["labels"] = tokenizer(batch["text"], padding="max_length", max_length=128, truncation=True).input_ids
    return batch

dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

Below code sets up and starts the fine-tuning process for a Wav2Vec 2.0 model on the LibriSpeech dataset with custom training arguments. The TrainingArguments specify configurations like batching, learning rate, evaluation intervals, and the use of mixed-precision (FP16) to enhance training efficiency. The Trainer orchestrates the training process, leveraging the tokenizer for data collation. With these settings, the model undergoes training for one epoch, aiming to improve its performance on the speech recognition task.

In [None]:
training_args = TrainingArguments(
  output_dir="./wav2vec2-finetuned-librispeech",
  group_by_length=True,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=1,
  fp16=True,
  save_steps=400,
  eval_steps=400,
  logging_steps=400,
  learning_rate=1e-4,
  warmup_steps=500,
  save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    data_collator=tokenizer.data_collator,
)

trainer.train()

In [None]:
trainer.save_model('speech2text')

This function converts speech from an audio file into text using a pretrained Wav2Vec 2.0 model. It reads the audio file, processes the speech waveform through the tokenizer to generate model input values, and feeds these to the model to get logits. The most likely token IDs are selected from the logits, and these IDs are decoded into a transcription of the audio. Finally, it returns the transcribed text, providing a straightforward way to apply speech recognition to an audio file.

In [None]:
def speech_to_text(audio_file):
    speech, _ = sf.read(audio_file)
    input_values = tokenizer(speech, return_tensors="pt").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)[0]
    return transcription

audio_file = "test audio.wav"
print(speech_to_text(audio_file))

# Approach 2

In [2]:
!pip install kaggle



In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [4]:
!chmod 600 ~/.kaggle/kaggle.json

In [5]:
!kaggle datasets download -d mozillaorg/common-voice

Downloading common-voice.zip to /content
100% 12.0G/12.0G [10:41<00:00, 22.1MB/s]
100% 12.0G/12.0G [10:41<00:00, 20.2MB/s]


In [6]:
!unzip common-voice.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: cv-valid-train/cv-valid-train/sample-190776.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190777.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190778.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190779.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190780.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190781.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190782.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190783.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190784.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190785.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190786.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190787.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190788.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190789.mp3  
  inflating: cv-valid-train/cv-valid-train/sample-190

In [7]:
!pip install datasets jiwer torch deepspeed
!pip install transformers[torch] -U
!pip install -U accelerate
!pip install -U transformers

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jiwer
  Downloading jiwer-3.0.3-py3-none-any.whl (21 kB)
Collecting deepspeed
  Downloading deepspeed-0.14.0.tar.gz (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m70.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m455.8 kB/s[0m eta [36m0:00:00[0m
[?25hCo

In [10]:
!python -c "from accelerate.utils import write_basic_config; write_basic_config(mixed_precision='fp16')"

In [11]:
import pandas as pd
df = pd.read_csv("/content/cv-valid-train.csv")

subset_df = df.sample(10000, replace=False, random_state=24)

listnames = subset_df["filename"]
listnames = ['/content/cv-valid-train/'+x for x in listnames]

In [12]:
from datasets import load_dataset, load_metric, Audio
dataset = load_dataset("audiofolder", data_files={"train": listnames})

Resolving data files:   0%|          | 0/10000 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [13]:
print(dataset)
dataset = dataset["train"]
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
dataset = dataset.add_column("text", subset_df["text"])
dataset = dataset.add_column("file", subset_df["filename"])

DatasetDict({
    train: Dataset({
        features: ['audio'],
        num_rows: 10000
    })
})


In [14]:
cv_ds = dataset.train_test_split(test_size=0.3)
cv_ds["train"][1]

{'audio': {'path': '/content/cv-valid-train/cv-valid-train/sample-189365.mp3',
  'array': array([-5.09317033e-11,  1.89174898e-10,  2.40106601e-10, ...,
          1.96028594e-03,  1.56119280e-03,  1.88966468e-03]),
  'sampling_rate': 16000},
 'text': 'the boy offered his bottle hoping that the old man would leave him alone',
 'file': 'cv-valid-train/sample-189365.mp3'}

Function to clean text data by removing specified special characters and converting the text to uppercase. It uses a regular expression to match and remove characters such as commas, question marks, periods, exclamation marks, dashes, semicolons, and colons. After defining this cleaning function, it applies the function to each item in a dataset cv_ds using the .map method, updating the text in place.

In [15]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).upper()
    return batch

cv_ds = cv_ds.map(remove_special_characters)

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Preparing an audio dataset for training or evaluation with a Wav2Vec 2.0 model from the transformers library. Here's a breakdown:

- Processor Initialization: A Wav2Vec2Processor is initialized by loading the pretrained "facebook/wav2vec2-large-960h" model. This processor is crucial for both feature extraction from audio signals and encoding text labels into a format suitable for model training.
- Dataset Preparation Function: The prepare_dataset function processes each batch of the dataset. For the audio data, it converts the audio array and its sampling rate into model-compatible input values using the processor. For the text data, it switches the processor to a target processor mode to encode the text into input IDs (numerical tokens representing the text) that the model can use as labels for supervised learning.
- Dataset Mapping: The .map method applies the prepare_dataset function across the dataset (cv_ds). It also removes the original columns in the dataset to keep only the processed features (input_values) and labels (labels). The num_proc=2 argument indicates that the operation should be parallelized across 2 processes, speeding up the preparation of the dataset.
- Printing the Prepared Dataset: Finally, it prints the prepared dataset (cv_prepared), which is now ready for use with the Wav2Vec 2.0 model for tasks such as speech recognition.

This approach is efficient for transforming raw audio and text data into a form that is directly usable by neural network models, streamlining the pipeline from data preprocessing to model training.

In [16]:
from transformers import Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")

def prepare_dataset(batch):
    audio = batch["audio"]

    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]

    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
    return batch

cv_prepared = cv_ds.map(prepare_dataset, remove_columns=cv_ds.column_names["train"], num_proc=2)
print(cv_prepared)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Map (num_proc=2):   0%|          | 0/7000 [00:00<?, ? examples/s]



Map (num_proc=2):   0%|          | 0/3000 [00:00<?, ? examples/s]



DatasetDict({
    train: Dataset({
        features: ['input_values', 'labels'],
        num_rows: 7000
    })
    test: Dataset({
        features: ['input_values', 'labels'],
        num_rows: 3000
    })
})


In [17]:
import random
import IPython.display as ipd
import numpy as np

rand_int = random.randint(0, len(cv_prepared["train"]))

print("Target text:", cv_ds["train"][rand_int]["text"])
print("Target text:", cv_prepared["train"][rand_int]["labels"])
print("Input array shape:", np.asarray(cv_prepared["train"][rand_int]["input_values"]).shape)

ipd.Audio(data=np.asarray(cv_prepared["train"][rand_int]["input_values"]), autoplay=False, rate=16000)

Target text: AT MOST THEY THOUGHT THAT ANOTHER METEORITE HAD DESCENDED
Target text: [7, 6, 4, 17, 8, 12, 6, 4, 6, 11, 5, 22, 4, 6, 11, 8, 16, 21, 11, 6, 4, 6, 11, 7, 6, 4, 7, 9, 8, 6, 11, 5, 13, 4, 17, 5, 6, 5, 8, 13, 10, 6, 5, 4, 11, 7, 14, 4, 14, 5, 12, 19, 5, 9, 14, 5, 14]
Input array shape: (91008,)


Below code defines a custom data collator, DataCollatorCTCWithPadding, for use with PyTorch models, specifically tailored for sequence-to-sequence tasks like speech recognition with models like Wav2Vec 2.0. It's designed to dynamically pad batches of audio samples and their corresponding labels to the longest sequence in each batch or to a specified maximum length, ensuring uniform input size for model training. Here's a breakdown of its components:

- Initialization: The collator takes a Wav2Vec2Processor for processing the input data and optional arguments for padding control. These include enabling/disabling padding, setting maximum lengths for inputs and labels, and specifying if padding should be added to make sequences a multiple of a specific number.
- Call Method: When called with a batch of data, it separates input features (audio data) from label features (transcriptions), then pads both using the processor's .pad method. This padding aligns sequences within a batch to the same length, a requirement for efficient batch processing in deep learning models.
- Labels Masking: The attention mask generated during padding of labels is used to replace padding tokens in labels with -100. This value is commonly used in PyTorch models to ignore specific tokens when calculating the loss, effectively excluding padded areas from contributing to the model's training.
- Batch Assembly: Finally, it combines the padded inputs and masked labels into a single batch and returns it as a dictionary of tensors, ready for model training.

This collator is essential for preparing batches in tasks where input sequences vary significantly in length, such as in speech recognition, ensuring that the model receives uniformly shaped input batches for efficient and effective training.

In [18]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

Below code sets up a custom data collator for use with a CTC (Connectionist Temporal Classification) model like Wav2Vec 2.0 and integrates a metric for evaluating the model's performance, specifically the Word Error Rate (WER).

DataCollatorCTCWithPadding
This class is designed to dynamically pad batches of data to the longest sequence in a batch, both for the input features and labels, ensuring they are all the same length for efficient processing by the model. It uses the provided Wav2Vec2Processor for padding and potentially limits the maximum length of sequences and labels for consistency and to manage computational resources.
When called with a batch of features, it separates input values and labels, pads them according to the specified options, and returns a dictionary containing the padded tensor batches. Notably, it replaces padding areas in the labels with -100 to ignore them during loss computation.

Compute Metrics Function
It defines a function to compute the Word Error Rate (WER), a common metric for evaluating speech recognition systems. WER is the ratio of the total number of errors (insertions, deletions, and substitutions) to the number of words in the reference.
The function takes predictions from the model, extracts the logits, and determines the most likely token IDs by applying the argmax function. It then decodes these IDs back into text strings.
Labels masked with -100 (used to ignore padding) are replaced with the pad token ID to prevent them from affecting the metric calculation.
Finally, it computes WER using the load_metric function from the datasets library and returns the result in a dictionary.

This setup is essential for training and evaluating speech recognition models, providing a framework for data preparation and performance assessment through an interpretable metric like WER.

In [19]:
from datasets import load_metric

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
wer_metric = load_metric("wer")

  wer_metric = load_metric("wer")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [20]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Training pipeline for a Wav2Vec 2.0 model using the Hugging Face Transformers library, specifically configured for a speech recognition task using the Connectionist Temporal Classification (CTC) approach. Here's a breakdown of the key components:

- Model Initialization: Wav2Vec2ForCTC is instantiated with a pretrained model from Hugging Face's model hub ("facebook/wav2vec2-large-960h"), which is optimized for English speech recognition tasks. The feature extractor part of the model is frozen with freeze_feature_extractor(), so only the CTC head is trained, reducing the computational requirements and focusing on adapting the model to the specific task or dataset.
- Training Arguments: TrainingArguments setup includes specifying the output directory, batching strategy, evaluation intervals, optimization settings like learning rate and weight decay, and enabling mixed precision training with fp16=True for efficiency. It also configures gradient checkpointing to save memory, a crucial feature for training large models on limited hardware.
- Memory Management: Prior to training, garbage collection is invoked, and CUDA memory is cleared to ensure maximum available memory. The environment variable PYTORCH_CUDA_ALLOC_CONF is set to enable expandable segments, which can help manage memory more efficiently during training.
- Trainer Setup: The Trainer class orchestrates the training process, taking the model, data collator for batching, training arguments, metric computation function, datasets, and tokenizer. The train_dataset and eval_dataset are subsets prepared earlier, ready for training and evaluation.
- Training Execution: Finally, trainer.train() initiates the training process, following the specified arguments and using the provided datasets. The trainer utilizes the custom data collator for efficient data loading and batching, and computes metrics to evaluate model performance during training.

This comprehensive setup demonstrates how to fine-tune a sophisticated speech recognition model on a specific dataset, utilizing best practices for memory management, training efficiency, and performance evaluation.

In [21]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h",
)
model.freeze_feature_extractor()

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-large-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You s

In [22]:
from transformers import TrainingArguments

repo_name = '/content'

training_args = TrainingArguments(
  output_dir=repo_name,
  group_by_length=True,
  per_device_train_batch_size=8,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  max_steps=100,
  fp16=True,
  gradient_checkpointing=True,
  save_steps=20,
  eval_steps=20,
  logging_steps=20,
  learning_rate=1e-6,
  weight_decay=0.005,
  warmup_steps=20,
  save_total_limit=3,
  report_to=["tensorboard"],
)

In [23]:
from transformers import Trainer
import gc
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'


gc.collect()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_accumulated_memory_stats()
torch.cuda.empty_cache()

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=cv_prepared["train"],
    eval_dataset=cv_prepared["test"],
    tokenizer=processor.feature_extractor,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [24]:
trainer.train()



OutOfMemoryError: CUDA out of memory. Tried to allocate 6.80 GiB. GPU 0 has a total capacity of 39.56 GiB of which 1.19 GiB is free. Process 72285 has 38.37 GiB memory in use. Of the allocated memory 31.00 GiB is allocated by PyTorch, and 6.86 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Create DataFrame from Log History: First, it converts the log_history attribute of the Trainer object into a pandas DataFrame. This log history contains a variety of metrics and statistics collected during training and evaluation, including loss values at different steps.

Extract Training Loss Data: It filters out rows where the "loss" column (representing training loss) is not null, creating a new DataFrame trainloss_df. This DataFrame now specifically holds the history of training loss values along with the steps or epochs they were recorded at.

Extract Evaluation Loss Data: Similarly, it creates evalloss_df by filtering for rows where the "eval_loss" column (representing evaluation loss) is present. This DataFrame focuses on the loss values recorded during the evaluation phases.

By isolating these subsets, I can easily visualize trends in training and evaluation loss, which are key indicators of model learning and generalization performance. Plotting these DataFrames can help identify overfitting, underfitting, or other patterns that might inform adjustments to the training process for improved results.

In [None]:
loss_df = pd.DataFrame(trainer.state.log_history)
trainloss_df = loss_df.dropna(axis=0, subset="loss")
evalloss_df = loss_df.dropna(axis=0, subset="eval_loss")
trainloss_df

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 2, figsize=(10,6))

axs[0].plot(trainloss_df["step"], trainloss_df["loss"])
axs[0].set_title("Training Loss")
axs[0].set_xlabel('Steps')
axs[0].set_ylabel('Loss')

axs[1].plot(evalloss_df["step"], evalloss_df["eval_loss"])
axs[1].set_title("Validation Loss")
axs[1].set_xlabel('Steps')
axs[1].set_ylabel('Loss')

Create a ZIP Archive: Using the pathlib and zipfile modules, it specifies a directory containing a model checkpoint (/content/checkpoint-100). It then creates a new ZIP file named wav2vec2-large-960h-cv.zip and iterates through each file in the checkpoint directory, adding these files to the ZIP archive. This step is useful for compressing the model checkpoint for easy storage or transfer.

Load the Model from the Checkpoint: After the ZIP file is created, the code loads the model from the checkpoint using Wav2Vec2ForCTC.from_pretrained(), specifying the path to the checkpoint directory. This demonstrates how to restore a model from a checkpoint for inference or continued training.

Prepare the Model for Use: Finally, it moves the model to a CUDA-enabled device using .to('cuda:0') for GPU-accelerated operations. This step is essential for leveraging hardware acceleration during inference or further training.

In [None]:
import pathlib
from zipfile import ZipFile

directory = pathlib.Path("/content/checkpoint-100")

with ZipFile('wav2vec2-large-960h-cv.zip', 'w') as myzip:
    for file_path in directory.iterdir():
        myzip.write(file_path, arcname=file_path.name)

In [None]:
model = Wav2Vec2ForCTC.from_pretrained(repo_name+'/checkpoint-100')
model.to('cuda:0')

Dataset Loading: It loads an evaluation dataset stored in a specified directory (/content/cv-valid-dev/cv-valid-dev) using the load_dataset function from the datasets library, specifying the format as "audiofolder". This implies the dataset consists of audio files organized in a folder structure.

Text Preprocessing: A DataFrame is created by reading metadata from a CSV file (/content/cv-valid-dev.csv), where each row corresponds to an audio file. The "text" column, which contains the transcripts associated with each audio file, is converted to uppercase. This step standardizes the case of the transcripts, making them consistent with the model's expected input format.

Dataset Modification: The loaded dataset (eval_ds), initially recognized as a collection of audio files, is then cast to have a specific "audio" column with a sampling rate of 16,000 Hz, which matches the expected input format of the Wav2Vec 2.0 model. Additional columns for text transcripts and filenames are added from the prepared DataFrame, ensuring that each audio sample is associated with its correct transcript and file name.

Data Cleaning: The dataset undergoes cleaning through the remove_special_characters function, which removes specific punctuation marks and other special characters from the text transcripts. This step helps reduce variability in the dataset and aligns the transcripts closer to the model's training data.

Dataset Preparation for Model: Finally, the dataset is processed through the prepare_dataset function, which applies the necessary transformations for the Wav2Vec 2.0 model, including feature extraction from audio and tokenization of text transcripts. This preparation step is performed with parallel processing (num_proc=2) to speed up the execution.

In [None]:
eval_ds = load_dataset("audiofolder", data_dir="/content/cv-valid-dev/cv-valid-dev")
eval_df = pd.read_csv("/content/cv-valid-dev.csv")

eval_df["text"] = eval_df["text"].apply(lambda x: x.upper())
eval_ds = eval_ds["train"]
eval_ds = eval_ds.cast_column("audio", Audio(sampling_rate=16_000))
eval_ds = eval_ds.add_column("text", eval_df["text"])
eval_ds = eval_ds.add_column("file", eval_df["filename"])

eval_ds[1]

In [None]:
eval_ds = eval_ds.map(remove_special_characters)
eval_prepared = eval_ds.map(prepare_dataset, num_proc=2)
print(eval_prepared)

Evaluates a speech recognition model on a dataset, generating predictions for audio inputs, and calculates the Word Error Rate (WER) as a performance metric. It defines a function map_to_result to process each data entry, predicting text from audio with the model and decoding both predicted and actual text. This function is applied to the dataset, updating each entry with its predicted text. Finally, it computes the WER between the model's predictions and the actual text, providing a measure of the model's accuracy.

In [None]:
def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"], group_tokens=False)
  
  return batch

results = eval_prepared.map(map_to_result)
print(results["pred_str"])
print(results[2]["text"])
print(results[2]["pred_str"])
print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["text"])))

In [None]:
eval_df["generated_text"] = results["pred_str"]
eval_df.to_csv("/content/cv-valid-dev_finetuned.csv")

# Approach 3: Pretrained Model

In [1]:
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import numpy as np

In [2]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoi

In [5]:
audio, sampling_rate = librosa.load("test audio.wav",sr=16000)
audio, sampling_rate

(array([4.0927262e-12, 2.2737368e-12, 0.0000000e+00, ..., 2.3358295e-04,
        2.2082066e-04, 2.4543423e-04], dtype=float32),
 16000)

In [4]:
display.Audio("test audio.wav", autoplay=True)

In [6]:
input_values = tokenizer(audio, return_tensors = 'pt').input_values
input_values

tensor([[-8.3914e-05, -8.3914e-05, -8.3914e-05,  ...,  3.9368e-03,
          3.7171e-03,  4.1408e-03]])

In [7]:
logits = model(input_values).logits
logits

tensor([[[ 13.8998, -27.6967, -27.3771,  ...,  -6.3001,  -7.0616,  -8.5118],
         [ 14.0895, -27.9082, -27.5821,  ...,  -6.6897,  -7.3542,  -8.6042],
         [ 13.9620, -27.7169, -27.3943,  ...,  -6.4439,  -7.2200,  -8.5072],
         ...,
         [ 13.5507, -27.2958, -26.9817,  ...,  -6.4635,  -7.9118,  -8.6638],
         [ 13.5540, -27.7158, -27.3963,  ...,  -6.9251,  -8.1390,  -8.8124],
         [ 13.7476, -28.0836, -27.7644,  ...,  -6.9984,  -8.1285,  -8.8816]]],
       grad_fn=<ViewBackward0>)

In [8]:
predicted_ids = torch.argmax(logits, dim =-1)
transcriptions = tokenizer.decode(predicted_ids[0])

In [9]:
transcriptions

'YOU HAVE GOT TO CHANGE THE WAY YOU THINK IT IS THE WHOLE DETERMINING FACTOR OF WHAY YOU GO IN LIFE WE ARE ALL WHERE WE ARE TO DAY BECAUSE WE THOUGHT OURSELF TO THIS POSITION'