### Getting Started

This notebook was originally written and executed in Google Colab to leverage GPU resources for training. To use this script, follow the instructions below:

1. Copy the necessary contents in the `asr-train/` folder to your Google Drive.
2. Open Google Colab and mount your Google Drive.
3. Ensure that your directory structure resembles the following:

```bash
My Drive/
├── Colab Notebooks/
│   ├── htx-tha/                      # Directory for fine-tuning the ASR model
│   │   ├── cv-train-2a.ipynb         # Jupyter notebook for fine-tuning the ASR model
│   │   ├── cv-valid-test-result.csv  # Generated from predicting on cv-valid-test.csv
│   │   └── wav2vec2-large-960h-cv/   # Model checkpoint directory (generated during training)    
```

### Overview

This notebook guides you through the process of fine-tuning a pre-trained `Wav2Vec2-large-960h` model on the Common Voice dataset. The following steps outline the workflow:

1. **Dataset Loading and Splitting**:
   - The Common Voice dataset is loaded from Kaggle Hub, and the `cv-valid-train.csv` file is read.
   - The data is split into a training set (70%) and a validation set (30%).

2. **Data Loading and Transformation**:
   - Using the `datasets` library, the audio and corresponding text data are efficiently loaded.
   - The data is structured into a dictionary format, where each entry corresponds to an audio sample and its associated transcription (ground-truth), ready for model training.

3. **Data Preprocessing**:
   - The audio and text data are preprocessed to ensure compatibility with the ASR model:
     - **Audio Preprocessing**: The audio is converted into the appropriate format (mono-channel, 16kHz sample rate) and tokenized for model input.
     - **Text Preprocessing**: The text is tokenized and processed to match the model's input format.

4. **Subset Selection**:
   - Due to constraints such as limited training time, memory, and computational resources, a subset of the training and validation sets is used (10% and 5% respectively).

5. **Model Training**:
   - The preprocessed data is fed into the model for training, and the model is fine-tuned based on the selected subset.

6. **Model Testing**:
   - The fine-tuned model is evaluated on the test set (`cv-valid-test.csv`) to assess its performance on unseen data.

#### Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Import Common Voice Dataset form Kaggle 

**IMPORTANT:** When you import the Common Voice dataset from Kaggle into Google Colab, it is stored in the Colab disk (temporary storage). This storage only stay while your Colab runtime is active. If your runtime disconnects or resets, the dataset in the cache folder will be lost, and you will need to re-download it.

In [23]:
# Import dataset from kaggle
import kagglehub

path = kagglehub.dataset_download("mozillaorg/common-voice")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/mozillaorg/common-voice?dataset_version_number=2...


100%|██████████| 12.0G/12.0G [00:57<00:00, 223MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2


### Import Libraries

In [2]:
!pip install pydub evaluate jiwer

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jiwer
  Downloading jiwer-3.0.5-py3-none-any.whl.metadata (2.7 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting rapidfuzz<4,>=3 (from jiwer)
  Downloading rapidfuzz-3.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70

In [1]:
import os
import evaluate
import torch
import torchaudio
import pandas as pd

from datasets import Audio, Dataset
from sklearn.model_selection import train_test_split
from transformers import (
    Wav2Vec2Processor,
    Wav2Vec2ForCTC,
    TrainingArguments,
    Trainer
)
from tqdm import tqdm
from jiwer import wer

In [7]:
# Check if our dataset has been loaded
os.listdir('/root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2')

['cv-other-test.csv',
 'cv-valid-test.csv',
 'cv-valid-train',
 'cv-valid-train.csv',
 'cv-other-test',
 'cv-other-train',
 'cv-valid-dev',
 'cv-valid-dev.csv',
 'cv-other-dev',
 'cv-other-dev.csv',
 'README.txt',
 'LICENSE.txt',
 'cv-valid-test',
 'cv-invalid',
 'cv-invalid.csv',
 'cv-other-train.csv']

### Load Dataset

1. Convert the text in the dataset to uppercase for standardisation since the model predicts texts in uppercase.

2. Split the dataset into a 70-30 ratio for training and validation respectively.

In [4]:
# Load train csv file
cv_valid_train = pd.read_csv('/root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2/cv-valid-train.csv')

# Convert text to uppercase
cv_valid_train["text"] = cv_valid_train["text"].str.upper()

cv_valid_train.head()

In [6]:
# Split train into 70-30 ratio
train_df, val_df = train_test_split(cv_valid_train, test_size=0.3, random_state=42)

# Remove unecessary columns
train_df = train_df.copy().drop(columns=["up_votes", "down_votes", "age", "gender", "accent", "duration"])
val_df = val_df.copy().drop(columns=["up_votes", "down_votes", "age", "gender", "accent", "duration"])

print(f"cv-valid-train shape: {cv_valid_train.shape}")
print(f"train (70%) shape: {train_df.shape}")
print(f"val (30%) shape: {val_df.shape}")

cv-valid-train shape: (195776, 8)
train (70%) shape: (137043, 2)
val (30%) shape: (58733, 2)


### Pre-processing audio and text

In the clean dataset, only two columns are included: `filename` and `text`. The `filename` states the path of the audio file from the `cv-valid-train` folder while `text` is the label (or ground-truth) of the audio.

In [6]:
# Load pre-trained model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("facebook/wav2vec2-large-960h")

# feature_extractor = Wav2Vec2FeatureExtractor(
#     feature_size=1,
#     sampling_rate=16000,
#     padding_value=0.0,
#     do_normalize=True,
#     return_attention_mask=False
# )

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
model.freeze_feature_extractor()

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Audio Preprocessing

Since the audio files are in `.mp3` format, I used pydub's `AutoSegment` library to read the audio file. `Wav2Vec2Processor` with a `sampling_rate` of 16000 is used to resample the audio to 16kHz since that format is required to run `Wav2Vec2` ASR model. The `Dataset` library is used to convert the audio files into arrays.

In [8]:
train_audio_dir = [os.path.join('/root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2/cv-valid-train', filename) for filename in train_df['filename'].tolist()]
train_ds = Dataset.from_dict({
    "audio": train_audio_dir,
    "text": train_df['text'].tolist()
}).cast_column("audio", Audio(sampling_rate=16000))

val_audio_dir = [os.path.join('/root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2/cv-valid-train', filename) for filename in val_df['filename'].tolist()]
val_ds = Dataset.from_dict({
    "audio": val_audio_dir,
    "text": val_df['text'].tolist()
}).cast_column("audio", Audio(sampling_rate=16000))

In [9]:
print("Example of the dataset:")

train_ds[1], val_ds[1]

Example of the dataset:


({'audio': {'path': '/root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2/cv-valid-train/cv-valid-train/sample-168647.mp3',
   'array': array([-1.70530257e-12,  2.27373675e-13,  4.54747351e-13, ...,
          -6.72709632e-07,  1.20441780e-06,  9.20904654e-07]),
   'sampling_rate': 16000},
  'text': 'DENSE CLOUDS OF SMOKE OR DUST CAN BE SEEN THROUGH A POWERFUL TELESCOPE'},
 {'audio': {'path': '/root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2/cv-valid-train/cv-valid-train/sample-114956.mp3',
   'array': array([-2.03726813e-10,  1.81898940e-10, -4.00177669e-11, ...,
           6.15763362e-04, -1.39448233e-03,  1.47150655e-03]),
   'sampling_rate': 16000},
  'text': 'THE MIXTURE TOOK ON A REDDISH COLOR ALMOST THE COLOR OF BLOOD'})

Listen to an example of the audio.

In [10]:
import IPython.display as ipd
import numpy as np
import random

rand_int = random.randint(0, len(train_ds))

print(train_ds[rand_int]["text"])
ipd.Audio(data=np.asarray(train_ds[rand_int]["audio"]["array"]), autoplay=True, rate=16000)

WE HAD TO RUSH DOWN THE MINUTE IT HAPPENED


In [11]:
rand_int = random.randint(0, len(train_ds))

print("Target text:", train_ds[rand_int]["text"])
print("Input array shape:", np.asarray(train_ds[rand_int]["audio"]["array"]).shape)
print("Sampling rate:", train_ds[rand_int]["audio"]["sampling_rate"])

Target text: THE BOY SAW A MAN APPEAR BEHIND THE COUNTER
Input array shape: (83328,)
Sampling rate: 16000


### Preparing the dataset

The function does the following:

1. Trims the audio arrays to remove leading and trailing zeros.
2. Extract the audio features using `processor.feature_extractor`.
3. Tokenizes the text labels using `processor.tokenizer`. 
4. Prepares additional information such as input length for model input.

In [12]:
def prepare_dataset(batch):
    """
    Process each batch of data in the dataset to make it compatible with the model.
    """
    # Trim leading and trailing zeros from the audio array
    batch["audio"]["array"] = np.trim_zeros(batch["audio"]["array"], "fb")
    audio = batch["audio"]

    # Extract audio features
    audio_features = processor.feature_extractor(
        audio["array"],
        sampling_rate=audio["sampling_rate"],
        return_tensors="pt"
    )
    batch["input_values"] = audio_features.input_values[0]
    batch["input_length"] = len(batch["input_values"])

    # Tokenize text labels
    with processor.as_target_processor():
        labels = processor.tokenizer(
            batch["text"],
            padding=True,
            return_tensors="pt"
        )
    batch["labels"] = labels.input_ids[0]

    return batch


train_ds = train_ds.map(prepare_dataset, remove_columns=train_ds.column_names)
val_ds = val_ds.map(prepare_dataset, remove_columns=val_ds.column_names)

Map:   0%|          | 0/137043 [00:00<?, ? examples/s]

  normed_input_values = [(x - x.mean()) / np.sqrt(x.var() + 1e-7) for x in input_values]
  ret = ret.dtype.type(ret / rcount)
  normed_input_values = [(x - x.mean()) / np.sqrt(x.var() + 1e-7) for x in input_values]
  arrmean = um.true_divide(arrmean, div, out=arrmean,
  ret = ret.dtype.type(ret / rcount)


Map:   0%|          | 0/58733 [00:00<?, ? examples/s]

In [13]:
# Save dataset to disk
train_ds.save_to_disk('/content/drive/MyDrive/Colab Notebooks/htx-tha/train')
val_ds.save_to_disk('/content/drive/MyDrive/Colab Notebooks/htx-tha/val')

Saving the dataset (0/79 shards):   0%|          | 0/137043 [00:00<?, ? examples/s]

Saving the dataset (0/34 shards):   0%|          | 0/58733 [00:00<?, ? examples/s]

In [4]:
# Load saved dataset from disk
from datasets import load_from_disk

  # Load the datasets
train_ds = load_from_disk('/content/drive/MyDrive/Colab Notebooks/htx-tha/train')
val_ds = load_from_disk('/content/drive/MyDrive/Colab Notebooks/htx-tha/val')

print(train_ds)
print(val_ds)

Loading dataset from disk:   0%|          | 0/79 [00:00<?, ?it/s]

Loading dataset from disk:   0%|          | 0/34 [00:00<?, ?it/s]

Dataset({
    features: ['input_values', 'input_length', 'labels'],
    num_rows: 137043
})
Dataset({
    features: ['input_values', 'input_length', 'labels'],
    num_rows: 58733
})


### Fine-tuning model, Training and Evaluation

The Data Collator (adapted from this [example](https://github.com/huggingface/transformers/blob/9a06b6b11bdfc42eea08fa91d0c737d1863c99e3/examples/research_projects/wav2vec2/run_asr.py#L81)) is used to process and prepare batches of data for input into the model during training. 

In [8]:
from dataclasses import dataclass
from typing import Dict, List, Union

# Data Collator
@dataclass
class DataCollator:
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        filtered_features = []

        for feature in features:
            try:
                # Convert input_values and labels to tensors
                if not isinstance(feature["input_values"], torch.Tensor):
                    feature["input_values"] = torch.tensor(feature["input_values"])
                if "labels" in feature and not isinstance(feature["labels"], torch.Tensor):
                    feature["labels"] = torch.tensor(feature["labels"])

                # Only include valid features
                if feature["input_values"].nelement() > 0 and feature["labels"].nelement() > 0:
                    filtered_features.append(feature)
            except Exception as e:
                print(f"Error processing feature: {feature}. Skipping. Error: {e}")

        if not filtered_features:
            raise "No values in this batch"  # For debugging

        input_features = [{"input_values": feature["input_values"]} for feature in filtered_features]
        label_features = [{"input_ids": feature["labels"]} for feature in filtered_features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

data_collator = DataCollator(processor=processor, padding=True)

* A `batch_size=1` is used to limit memory usage for each forward or backward pass.

* `gradient_accumulation=32` is used to simulate an overall batch size of 32, which ensures training is stabalised without exceeding memory limits.

* `learning_rate=3e-4` is chosen as it is high enough to enable efficient training but not too high.

* Due to time constraint and given the substantial size of the dataset (~200k samples), training for `epoch=1` is considered sufficient to obtain meaningful results while keeping the training duration manageable.

* `weight_decay=0.005` is used to provide regularisation by penalising large weights to prevent overfitting and promotes bettwe generalisation on such a large dataset.

* Since our dataset is handling text, Word Rate Error (WER) is used as a metric to quantify how well our model is performing. It calculates the proportion of errors relative to the total number of words in the reference text.

In [7]:
wer_metric = evaluate.load("wer")

def compute_metrics(pred):
    """
    Calculates the word error rate (WER).
    """
    pred_logits = pred.predictions
    pred_id = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
    pred_str = processor.batch_decode(pred_id)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer_score = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer_score}

In [12]:
# Getting a subset of the training and validation data
train_subset = train_ds.select(range(13700))
val_subset = val_ds.select(range(2950))

# Filter out samples that have empty values
train_subset = train_subset.filter(lambda x: len(x["input_values"]) > 0 and len(x["labels"]) > 0)
val_subset = val_subset.filter(lambda x: len(x["input_values"]) > 0 and len(x["labels"]) > 0)

print(train_subset)
print(val_subset)

Filter:   0%|          | 0/13700 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2950 [00:00<?, ? examples/s]

Dataset({
    features: ['input_values', 'input_length', 'labels'],
    num_rows: 13699
})
Dataset({
    features: ['input_values', 'input_length', 'labels'],
    num_rows: 2950
})


In [13]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Colab Notebooks/htx-tha/wav2vec2-large-960h-cv",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    eval_strategy="steps",
    logging_strategy="steps",
    logging_steps=100,
    eval_steps=100,
    save_steps=100,
    num_train_epochs=1,
    fp16=True,
    gradient_checkpointing=True,
    learning_rate=3e-4,
    weight_decay=0.005,
    warmup_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    report_to=["tensorboard"],
    logging_dir="/content/drive/MyDrive/Colab Notebooks/htx-tha/logs"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset,
    eval_dataset=val_subset,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [14]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

trainer.train()



Step,Training Loss,Validation Loss,Wer
100,539.0159,26.496565,0.193386
200,713.9859,26.759058,0.317948
300,704.9935,33.501598,0.340508
400,659.1463,24.218805,0.268931
500,602.5616,20.556421,0.250912
600,505.0108,18.42569,0.206686
700,463.9501,16.767374,0.186307
800,382.1096,13.907246,0.149482




TrainOutput(global_step=856, training_loss=557.3135769567757, metrics={'train_runtime': 3957.2054, 'train_samples_per_second': 3.462, 'train_steps_per_second': 0.216, 'total_flos': 1.8294678026642857e+18, 'train_loss': 557.3135769567757, 'epoch': 0.9997810059128404})

> _**Note:** For this assessment, given the constraints on training time, GPU compute resources, and CUDA memory, I decided to train only on a representative subset instead of the full dataset. The complete training dataset contains 137,043 samples, and training on the entire set would have significantly extended the time required and increased the risk of computational inefficiencies or memory issues. As such, I selected a 10% subset of the training data and 5% of the validation data. This approach provided a balanced solution, enabling meaningful training and fine-tuning while respecting resource limitations. By carefully ensuring the subset retained the diversity of the original dataset, I was able to perform a reliable evaluation of the model’s performance within the available compute budget._

In [21]:
# Save the fine-tuned model
model.save_pretrained("/content/drive/MyDrive/Colab Notebooks/htx-tha/wav2vec2-large-960h-cv")
processor.save_pretrained("/content/drive/MyDrive/Colab Notebooks/htx-tha/wav2vec2-large-960h-cv")

[]

#### Transcribe `cv-valid-test` using fine-tuned model

In [35]:
# Load fine-tuned model
model = Wav2Vec2ForCTC.from_pretrained("/content/drive/MyDrive/Colab Notebooks/htx-tha/wav2vec2-large-960h-cv")
processor = Wav2Vec2Processor.from_pretrained("/content/drive/MyDrive/Colab Notebooks/htx-tha/wav2vec2-large-960h-cv")

# Load test set
cv_valid_test = pd.read_csv('/root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2/cv-valid-test.csv')
cv_valid_test.shape

(3995, 8)

In [36]:
# Transcribe
def transcribe_audio(file_path):
    """
    Transcribes an audio file using a pre-trained model.

    Args:
      file_path (str): Path to audio file.
    """
    waveform, sample_rate = torchaudio.load(file_path)

    if sample_rate != 16000:
        waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)

    input_values = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_values

    # Move input values and model to the same device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    input_values = input_values.to(device)
    model.to(device)

    # Generate predictions
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    transcription = processor.batch_decode(predicted_ids)[0]
    return transcription


transcriptions = []
texts = []
for idx, row in tqdm(cv_valid_test.iterrows(), total=cv_valid_test.shape[0], desc="Processing audio..."):
    audio_path = os.path.join('/root/.cache/kagglehub/datasets/mozillaorg/common-voice/versions/2/cv-valid-test', row['filename'])
    text = row["text"]

    try:
        transcription = transcribe_audio(audio_path)  # Get transcription
        transcriptions.append(transcription)
        texts.append(text)
    except Exception as e:
        print(f"Error transcribing: {e}")

Processing audio...: 100%|██████████| 3995/3995 [02:10<00:00, 30.61it/s]


In [37]:
# Get overall performance
wer_score = wer(texts, transcriptions)
print(f"WER Score: {wer_score}")

cv_valid_test["generated_text"] = transcriptions
cv_valid_test.to_csv("/content/drive/MyDrive/Colab Notebooks/htx-tha/cv-valid-test-result.csv", index=False)

WER Score: 1.023469091101303


In [38]:
cv_valid_test.head()

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration,generated_text
0,cv-valid-test/sample-000000.mp3,without the dataset the article is useless,1,0,,,,,WE FOUND THE THATTE SAT THE ARTICLE ISUSEDLE
1,cv-valid-test/sample-000001.mp3,i've got to go to him,1,0,twenties,male,,,I'VE GOT GIRL TO HIM
2,cv-valid-test/sample-000002.mp3,and you know it,1,0,,,,,HOW DO YOU KNOW IT
3,cv-valid-test/sample-000003.mp3,down below in the darkness were hundreds of pe...,4,0,twenties,male,us,,DWN BELOW IN THE DARKNESS WERE HUNDREDS OF PEO...
4,cv-valid-test/sample-000004.mp3,hold your nose to keep the smell from disablin...,2,0,,,,,HOLD YOUR KNOSE TO KEEP THIS SMELL FROM DISCOP...
