## Tuning facebook:wav2vec2-large-960h

Here, we finetune the facebook:wav2vec2-large-960h model from huggingface using the `cv-valid-train` common_voice dataset. This notebook follows the finetuning framework from this [hugginface blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) with minor adaptations. First, we import the required libraries.

In [1]:
# Imports
import os
import random
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
import gc

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import Audio as PlayAudio

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, Audio, DatasetDict, load_from_disk
import evaluate

import torch
from torch.utils.data import DataLoader
import torchaudio
from transformers import get_linear_schedule_with_warmup
from torch.utils.tensorboard import SummaryWriter
from torch.optim import AdamW
from torch.amp import autocast, GradScaler
from tqdm import tqdm

from pydub import AudioSegment
import soundfile as sf

from jiwer import wer

  from .autonotebook import tqdm as notebook_tqdm


### Pre-processing

We first convert all mp3 files to wav files, which the wav2vec2 model assumes. This may take some time.

In [2]:
# # Function to convert mp3 to wav
# def convert_mp3_to_wav(mp3_file):
#     # Generate the output wav file path
#     wav_file = mp3_file.replace('.mp3', '.wav')
    
#     # Convert mp3 to wav if wav file does not exist
#     if not os.path.exists(wav_file):
#         waveform, sample_rate = torchaudio.load(mp3_file)
#         torchaudio.save(wav_file, waveform, sample_rate)
    
#     return wav_file

# # File locations assumed in parent directory
# transcription_file = os.path.expanduser(
#     '~/asr_project/common_voice/cv-valid-train.csv')              # Transcription file location
# audio_folder = os.path.expanduser(
#     '~/asr_project/common_voice/cv-valid-train')   # Audio files directory
# df = pd.read_csv(transcription_file)[['filename','text']]         # Read transcription file

# # Convert mp3 to wav. Change mp3 file extension in df accordingly
# df['filename'] = df['filename'].apply(
#     lambda filename: convert_mp3_to_wav(
#         os.path.join(audio_folder, filename)))
# df.to_csv('temp.csv',index=False)                                 # Save temp copy of csv

We create a `DatasetDict` for pre-processing, following the approach in the [hugginface blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) mentioned earlier.

In [3]:
# Load csv file with wav filenames as dataset
dataset = load_dataset('csv', data_files='temp.csv', split='train')
dataset = dataset.cast_column("filename",
                              Audio(sampling_rate=16000))         # Cast audio files with 16kHz sampling rate

# train-val 70-30 split
dataset = dataset.train_test_split(test_size=0.3, seed=42)        # Split to train-val

# Final, combined dataset
dataset = DatasetDict({
    'train': dataset['train'],
    'val': dataset['test']})

dataset

DatasetDict({
    train: Dataset({
        features: ['filename', 'text'],
        num_rows: 137043
    })
    val: Dataset({
        features: ['filename', 'text'],
        num_rows: 58733
    })
})

We first create a vocab required for a tokenizer.

In [4]:
# Get unique characters across all splits and provide unique id.
all_text_train = " ".join(dataset["train"]["text"])
all_text_val = " ".join(dataset["val"]["text"])

all_text = " ".join([all_text_train, all_text_val])
vocab = list(set(all_text))

vocab_dict = {v: k for k, v in enumerate(vocab)}
vocab_dict


{'q': 0,
 'n': 1,
 'u': 2,
 'l': 3,
 'i': 4,
 'h': 5,
 's': 6,
 'w': 7,
 'g': 8,
 'r': 9,
 'y': 10,
 'x': 11,
 'e': 12,
 'o': 13,
 'd': 14,
 'a': 15,
 'p': 16,
 'k': 17,
 "'": 18,
 'b': 19,
 'c': 20,
 ' ': 21,
 'v': 22,
 'z': 23,
 't': 24,
 'f': 25,
 'm': 26,
 'j': 27}

In [5]:
# Creating special tokens
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
print(len(vocab_dict))

30


In [6]:
# Saving in json file.
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

Then, we create the tokenizer using the vocab created earlier.

In [7]:
tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

Next, we create the `feature_extractor`. We follow the instructions to use `return_attention_mask=True`.

In [8]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

We wrap everything as a processor.

In [9]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

For a quick check, play a random audio file below...

In [10]:
rand_int = random.randint(0, len(dataset["train"]))
print(dataset["train"]["text"][rand_int])

audio_data = dataset["train"][rand_int]["filename"]["array"]
PlayAudio(data=audio_data, rate=16000)

she comes by it naturally


... and check the data formats, e.g. 1-D waveform.

In [11]:
rand_int = random.randint(0, len(dataset["train"]))

print("Target text:", dataset["train"][rand_int]["text"])
print("Input array shape:", np.asarray(dataset["train"][rand_int]["filename"]["array"]).shape)
print("Sampling rate:", dataset["train"][rand_int]["filename"]["sampling_rate"])


Target text: i heard a peculiar humming sound from the pit
Input array shape: (72576,)
Sampling rate: 16000


Finally, we prepare the dataset into the format expected by the model.

In [12]:
# def prepare_dataset(batch):
#     audio = batch["filename"]

#     # batched output is "un-batched" to ensure mapping is correct
#     batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    
#     with processor.as_target_processor():
#         batch["labels"] = processor(batch["text"]).input_ids
#     return batch

# dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names["train"], num_proc=4)


In [13]:
# # Save the dataset to a directory
# dataset.save_to_disk("temp_dataset")

### Training

As elaborated [here](https://huggingface.co/blog/fine-tune-wav2vec2-english), a data collator with dynamic padding is more efficient for ASR applications, considering the lengths of the input sequences.

In [14]:
# Load dataset
dataset = load_from_disk("temp_dataset")

In [15]:
@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Using the WER metric.

In [16]:
wer_metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Finally, loading the pre-trained model and setting up trainer so that we can begin finetuning.

In [17]:
!rm -rf ~/asr_project/asr-train/logs/*

model_dir = os.path.expanduser('~/asr_project/asr-train/model_outputs/wav2vec2-finetuned-smol')
tokenizer.save_pretrained(model_dir)

# Load the processor and model
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Freeze feature extractor layers
model.freeze_feature_encoder()

# # Freeze all layers except the head
# for param in model.parameters():
#     param.requires_grad = False  # Freeze all parameters

# # Assuming the head is the `classifier` in Wav2Vec2ForCTC
# for param in model.lm_head.parameters():  # For the head (classifier) layer
#     param.requires_grad = True  # Unfreeze the head

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Define the training arguments
training_args = TrainingArguments(
    output_dir=os.path.expanduser('~/asr_project/asr-train/model_outputs'),
    logging_dir=os.path.expanduser('~/asr_project/asr-train/logs'),
    per_device_train_batch_size=8,              # batch size for training
    per_device_eval_batch_size=8,               # batch size for evaluation
    num_train_epochs=6,                           # total number of training epochs
    logging_steps=50,                            # log every 100 steps
    eval_strategy="steps",                  # evaluate during training
    save_steps=250,                               # save checkpoint every 500 steps
    eval_steps=250,                               # evaluate every 500 steps
    load_best_model_at_end=True,                  # load the best model at the end of training
    fp16=True
)

# # Create the Trainer instance
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=dataset['train'],                  # your training dataset
#     eval_dataset=dataset['val'],                      # your validation dataset
# )


# Define huggingface trainer
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset["train"].select(range(2000)),
    eval_dataset=dataset["val"].select(range(200)),
    processing_class= processor.feature_extractor
)

# Start the training
trainer.train()

# Save the final model
trainer.save_model(model_dir)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  3%|▎         | 50/1500 [00:14<06:28,  3.74it/s]

{'loss': 7.3456, 'grad_norm': 10.738499641418457, 'learning_rate': 4.8433333333333336e-05, 'epoch': 0.2}


  7%|▋         | 100/1500 [00:28<07:05,  3.29it/s]

{'loss': 6.4887, 'grad_norm': 22.676118850708008, 'learning_rate': 4.676666666666667e-05, 'epoch': 0.4}


 10%|█         | 150/1500 [00:43<06:12,  3.62it/s]

{'loss': 5.4989, 'grad_norm': 7.022335529327393, 'learning_rate': 4.5100000000000005e-05, 'epoch': 0.6}


 13%|█▎        | 200/1500 [00:57<06:36,  3.28it/s]

{'loss': 5.6937, 'grad_norm': 10.28407096862793, 'learning_rate': 4.3433333333333336e-05, 'epoch': 0.8}


 17%|█▋        | 250/1500 [01:12<07:49,  2.66it/s]

{'loss': 5.4204, 'grad_norm': 1.9037978649139404, 'learning_rate': 4.176666666666667e-05, 'epoch': 1.0}


                                                  
 17%|█▋        | 250/1500 [01:15<07:49,  2.66it/s]

{'eval_loss': 5.385029315948486, 'eval_wer': 1.0, 'eval_runtime': 3.6403, 'eval_samples_per_second': 54.941, 'eval_steps_per_second': 6.868, 'epoch': 1.0}


 20%|██        | 300/1500 [01:32<05:51,  3.42it/s]

{'loss': 5.4364, 'grad_norm': 9.899394035339355, 'learning_rate': 4.0100000000000006e-05, 'epoch': 1.2}


 23%|██▎       | 350/1500 [01:47<05:06,  3.75it/s]

{'loss': 5.2742, 'grad_norm': 7.543509483337402, 'learning_rate': 3.843333333333334e-05, 'epoch': 1.4}


 27%|██▋       | 400/1500 [02:01<05:16,  3.48it/s]

{'loss': 5.1096, 'grad_norm': 6.498869895935059, 'learning_rate': 3.676666666666667e-05, 'epoch': 1.6}


 30%|███       | 450/1500 [02:15<05:49,  3.00it/s]

{'loss': 5.2477, 'grad_norm': 1.3148185014724731, 'learning_rate': 3.51e-05, 'epoch': 1.8}


 33%|███▎      | 500/1500 [02:29<04:03,  4.11it/s]

{'loss': 4.8424, 'grad_norm': 11.746952056884766, 'learning_rate': 3.343333333333333e-05, 'epoch': 2.0}


                                                  
 33%|███▎      | 500/1500 [02:32<04:03,  4.11it/s]

{'eval_loss': 5.293658256530762, 'eval_wer': 1.0, 'eval_runtime': 3.6049, 'eval_samples_per_second': 55.48, 'eval_steps_per_second': 6.935, 'epoch': 2.0}


 37%|███▋      | 550/1500 [02:49<04:36,  3.43it/s]

{'loss': 5.0363, 'grad_norm': 16.386117935180664, 'learning_rate': 3.176666666666667e-05, 'epoch': 2.2}


 40%|████      | 600/1500 [03:04<04:06,  3.65it/s]

{'loss': 5.1616, 'grad_norm': 2.575950860977173, 'learning_rate': 3.01e-05, 'epoch': 2.4}


 43%|████▎     | 650/1500 [03:18<03:51,  3.67it/s]

{'loss': 5.3157, 'grad_norm': 5.570316314697266, 'learning_rate': 2.8433333333333334e-05, 'epoch': 2.6}


 47%|████▋     | 700/1500 [03:32<03:54,  3.41it/s]

{'loss': 4.717, 'grad_norm': 2.1007020473480225, 'learning_rate': 2.676666666666667e-05, 'epoch': 2.8}


 50%|█████     | 750/1500 [03:45<03:05,  4.05it/s]

{'loss': 5.1545, 'grad_norm': 5.170818328857422, 'learning_rate': 2.51e-05, 'epoch': 3.0}


                                                  
 50%|█████     | 750/1500 [03:48<03:05,  4.05it/s]

{'eval_loss': 5.313230514526367, 'eval_wer': 1.0, 'eval_runtime': 3.5318, 'eval_samples_per_second': 56.629, 'eval_steps_per_second': 7.079, 'epoch': 3.0}


 53%|█████▎    | 800/1500 [04:06<03:11,  3.65it/s]

{'loss': 5.0283, 'grad_norm': 4.516280651092529, 'learning_rate': 2.3433333333333335e-05, 'epoch': 3.2}


 57%|█████▋    | 850/1500 [04:20<03:10,  3.41it/s]

{'loss': 5.2261, 'grad_norm': 4.357248783111572, 'learning_rate': 2.18e-05, 'epoch': 3.4}


 60%|██████    | 900/1500 [04:34<02:31,  3.96it/s]

{'loss': 5.453, 'grad_norm': 19.7100772857666, 'learning_rate': 2.0133333333333336e-05, 'epoch': 3.6}


 63%|██████▎   | 950/1500 [04:48<02:35,  3.53it/s]

{'loss': 5.3564, 'grad_norm': 9.602667808532715, 'learning_rate': 1.8466666666666667e-05, 'epoch': 3.8}


 67%|██████▋   | 1000/1500 [05:02<02:10,  3.84it/s]

{'loss': 4.9006, 'grad_norm': 8.760149955749512, 'learning_rate': 1.6800000000000002e-05, 'epoch': 4.0}


                                                   
 67%|██████▋   | 1000/1500 [05:05<02:10,  3.84it/s]

{'eval_loss': 5.355025768280029, 'eval_wer': 1.0, 'eval_runtime': 3.5516, 'eval_samples_per_second': 56.313, 'eval_steps_per_second': 7.039, 'epoch': 4.0}


 70%|███████   | 1051/1500 [05:22<01:55,  3.88it/s]

{'loss': 5.1488, 'grad_norm': 0.9166431427001953, 'learning_rate': 1.5133333333333333e-05, 'epoch': 4.2}


 73%|███████▎  | 1100/1500 [05:36<01:51,  3.58it/s]

{'loss': 4.9565, 'grad_norm': 1.7678842544555664, 'learning_rate': 1.3466666666666666e-05, 'epoch': 4.4}


 77%|███████▋  | 1150/1500 [05:50<01:49,  3.19it/s]

{'loss': 5.2498, 'grad_norm': 0.9555894136428833, 'learning_rate': 1.18e-05, 'epoch': 4.6}


 80%|████████  | 1200/1500 [06:05<01:23,  3.60it/s]

{'loss': 4.9125, 'grad_norm': 2.0177369117736816, 'learning_rate': 1.0133333333333333e-05, 'epoch': 4.8}


 83%|████████▎ | 1250/1500 [06:18<01:00,  4.10it/s]

{'loss': 4.952, 'grad_norm': 1.9424035549163818, 'learning_rate': 8.466666666666666e-06, 'epoch': 5.0}


                                                   
 83%|████████▎ | 1250/1500 [06:22<01:00,  4.10it/s]

{'eval_loss': 5.36012077331543, 'eval_wer': 1.0, 'eval_runtime': 3.5295, 'eval_samples_per_second': 56.666, 'eval_steps_per_second': 7.083, 'epoch': 5.0}


 87%|████████▋ | 1301/1500 [06:38<00:51,  3.88it/s]

{'loss': 5.0425, 'grad_norm': 1.4594476222991943, 'learning_rate': 6.800000000000001e-06, 'epoch': 5.2}


 90%|█████████ | 1350/1500 [06:52<00:40,  3.75it/s]

{'loss': 5.2433, 'grad_norm': 1.1296777725219727, 'learning_rate': 5.133333333333334e-06, 'epoch': 5.4}


 93%|█████████▎| 1400/1500 [07:06<00:28,  3.48it/s]

{'loss': 5.0715, 'grad_norm': 1.506329894065857, 'learning_rate': 3.466666666666667e-06, 'epoch': 5.6}


 97%|█████████▋| 1450/1500 [07:21<00:12,  3.89it/s]

{'loss': 5.0711, 'grad_norm': 2.138375997543335, 'learning_rate': 1.8e-06, 'epoch': 5.8}


100%|██████████| 1500/1500 [07:35<00:00,  3.46it/s]

{'loss': 5.0311, 'grad_norm': 0.991845428943634, 'learning_rate': 1.3333333333333334e-07, 'epoch': 6.0}


                                                   
100%|██████████| 1500/1500 [07:38<00:00,  3.46it/s]

{'eval_loss': 5.36480712890625, 'eval_wer': 1.0, 'eval_runtime': 3.5382, 'eval_samples_per_second': 56.525, 'eval_steps_per_second': 7.066, 'epoch': 6.0}


100%|██████████| 1500/1500 [07:41<00:00,  3.25it/s]


{'train_runtime': 461.9248, 'train_samples_per_second': 25.978, 'train_steps_per_second': 3.247, 'train_loss': 5.279536122639974, 'epoch': 6.0}


In [20]:
# Load the processor and model
model_dir = os.path.expanduser('~/asr_project/asr-train/model_outputs/wav2vec2-finetuned-smol')
processor = Wav2Vec2Processor.from_pretrained('facebook/wav2vec2-large-960h')
model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-large-960h')

# Load audio file
audio_file = "/home/tfc/asr_project/common_voice/cv-valid-train/cv-valid-train/sample-000000.wav"  # Replace with your audio file path
waveform, sample_rate = torchaudio.load(audio_file)


# If the sample rate is not 16kHz, resample it
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert to the right format for the model
input_values = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_values

# Get logits from the model
with torch.no_grad():
    logits = model(input_values).logits

# Get predicted ids
predicted_ids = logits.argmax(dim=-1)

# Decode the predicted ids to text
transcription = processor.batch_decode(predicted_ids)

print(transcription)  # Print the transcription result


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


['LEARNED TO RECOGNIZE OMENS AND FOLLOW THEM THE OLD KING HAD SAID']


In [None]:
# Load the processor and model
model_dir = os.path.expanduser('~/asr_project/asr-train/model_outputs/wav2vec2-finetuned-smol')
processor = Wav2Vec2Processor.from_pretrained(model_dir)
model = Wav2Vec2ForCTC.from_pretrained(model_dir)

# Load audio file
audio_file = "/home/tfc/asr_project/common_voice/cv-valid-train/cv-valid-train/sample-000000.wav"  # Replace with your audio file path
waveform, sample_rate = torchaudio.load(audio_file)


# If the sample rate is not 16kHz, resample it
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert to the right format for the model
input_values = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_values

# Get logits from the model
with torch.no_grad():
    logits = model(input_values).logits

# Get predicted ids
predicted_ids = logits.argmax(dim=-1)

# Decode the predicted ids to text
transcription = processor.batch_decode(predicted_ids)

print(transcription)  # Print the transcription result


In [None]:
# File locations assumed in parent directory
transcription_file = os.path.expanduser(
    '~/asr_project/common_voice/cv-valid-train.csv')              # Transcription file location
audio_folder = os.path.expanduser(
    '~/asr_project/common_voice/cv-valid-train')   # Audio files directory
df = pd.read_csv(transcription_file)[['filename','text']]         # Read transcription file

# Convert mp3 to wav. Change mp3 file extension in df accordingly
df['filename'] = df['filename'].apply(
    lambda filename: convert_mp3_to_wav(
        os.path.join(audio_folder, filename)))
df.to_csv('temp.csv',index=False)                                 # Save temp copy of csv

In [None]:
# Accessing log history to find train and val losses
train_losses = []
val_losses = []

for log in trainer.state.log_history:
    if 'loss' in log:
        train_losses.append(log['loss'])  # Training loss
    if 'eval_loss' in log:
        val_losses.append(log['eval_loss'])  # Validation loss

# Get the final train and val losses
final_train_loss = train_losses[-1] if train_losses else None
final_val_loss = val_losses[-1] if val_losses else None

print(f"Final Training Loss: {final_train_loss}")
print(f"Final Validation Loss: {final_val_loss}")


250.0

In [22]:
trainer.state.log_history

[{'loss': 16.7967,
  'grad_norm': 6.2722487449646,
  'learning_rate': 4.8433333333333336e-05,
  'epoch': 0.2,
  'step': 50},
 {'loss': 15.7987,
  'grad_norm': 14.26067066192627,
  'learning_rate': 4.676666666666667e-05,
  'epoch': 0.4,
  'step': 100},
 {'loss': 14.7088,
  'grad_norm': 12.909340858459473,
  'learning_rate': 4.5100000000000005e-05,
  'epoch': 0.6,
  'step': 150},
 {'loss': 13.7017,
  'grad_norm': 13.007761001586914,
  'learning_rate': 4.3433333333333336e-05,
  'epoch': 0.8,
  'step': 200},
 {'loss': 13.6914,
  'grad_norm': 4.1002326011657715,
  'learning_rate': 4.176666666666667e-05,
  'epoch': 1.0,
  'step': 250},
 {'eval_loss': 27.835580825805664,
  'eval_wer': 1.0,
  'eval_runtime': 3.6111,
  'eval_samples_per_second': 55.385,
  'eval_steps_per_second': 6.923,
  'epoch': 1.0,
  'step': 250},
 {'loss': 13.462,
  'grad_norm': 4.646790027618408,
  'learning_rate': 4.013333333333333e-05,
  'epoch': 1.2,
  'step': 300},
 {'loss': 11.7557,
  'grad_norm': 4.622196674346924,


In [21]:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-base", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Access the vocabulary
vocab = processor.tokenizer.get_vocab()

# Print the vocabulary size and a sample of the vocabulary
print(f"Vocabulary size: {len(vocab)}")
print("Sample vocabulary items:")
for token, index in list(vocab.items()):  # Print the first 10 tokens
    print(f"Token: {token}, Index: {index}")


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Vocabulary size: 32
Sample vocabulary items:
Token: <pad>, Index: 0
Token: <s>, Index: 1
Token: </s>, Index: 2
Token: <unk>, Index: 3
Token: |, Index: 4
Token: E, Index: 5
Token: T, Index: 6
Token: A, Index: 7
Token: O, Index: 8
Token: N, Index: 9
Token: I, Index: 10
Token: H, Index: 11
Token: S, Index: 12
Token: R, Index: 13
Token: D, Index: 14
Token: L, Index: 15
Token: U, Index: 16
Token: M, Index: 17
Token: W, Index: 18
Token: C, Index: 19
Token: F, Index: 20
Token: G, Index: 21
Token: Y, Index: 22
Token: P, Index: 23
Token: B, Index: 24
Token: V, Index: 25
Token: K, Index: 26
Token: ', Index: 27
Token: X, Index: 28
Token: J, Index: 29
Token: Q, Index: 30
Token: Z, Index: 31


In [17]:
!rm -rf ~/asr_project/asr-train/logs/*

# Load the processor and model
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Freeze feature extractor layers
# model.freeze_feature_encoder()

# Freeze all layers except the head
for param in model.parameters():
    param.requires_grad = False  # Freeze all parameters

# Assuming the head is the `classifier` in Wav2Vec2ForCTC
for param in model.lm_head.parameters():  # For the head (classifier) layer
    param.requires_grad = True  # Unfreeze the head

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Define the training arguments
training_args = TrainingArguments(
    output_dir=os.path.expanduser('~/asr_project/asr-train/model_outputs'),
    logging_dir=os.path.expanduser('~/asr_project/asr-train/logs'),
    per_device_train_batch_size=8,              # batch size for training
    per_device_eval_batch_size=8,               # batch size for evaluation
    num_train_epochs=3,                           # total number of training epochs
    logging_steps=50,                            # log every 100 steps
    eval_strategy="steps",                  # evaluate during training
    save_steps=500,                               # save checkpoint every 500 steps
    eval_steps=500,                               # evaluate every 500 steps
    load_best_model_at_end=True,                  # load the best model at the end of training
    fp16=True
)

# # Create the Trainer instance
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=dataset['train'],                  # your training dataset
#     eval_dataset=dataset['val'],                      # your validation dataset
# )


# Define huggingface trainer
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset["train"],
    eval_dataset=dataset["val"],
    processing_class= processor.feature_extractor
)

# Start the training
trainer.train()

# Save the final model
trainer.save_model(os.path.expanduser('~/asr_project/asr-train/model_outputs/wav2vec2-finetuned-final'))



Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 50/51393 [00:13<3:40:37,  3.88it/s]

{'loss': 16.6252, 'grad_norm': 6.919604778289795, 'learning_rate': 4.995427392835601e-05, 'epoch': 0.0}


  0%|          | 100/51393 [00:25<3:40:29,  3.88it/s]

{'loss': 15.7953, 'grad_norm': 5.722819805145264, 'learning_rate': 4.990562917128792e-05, 'epoch': 0.01}


  0%|          | 150/51393 [00:38<3:56:36,  3.61it/s]

{'loss': 15.4045, 'grad_norm': 5.995153903961182, 'learning_rate': 4.9856984414219837e-05, 'epoch': 0.01}


  0%|          | 200/51393 [00:51<3:03:57,  4.64it/s]

{'loss': 14.2438, 'grad_norm': 4.1416754722595215, 'learning_rate': 4.980833965715175e-05, 'epoch': 0.01}


  0%|          | 251/51393 [01:03<2:57:46,  4.79it/s]

{'loss': 12.9417, 'grad_norm': 10.482752799987793, 'learning_rate': 4.975969490008367e-05, 'epoch': 0.01}


  1%|          | 300/51393 [01:16<3:45:51,  3.77it/s]

{'loss': 12.661, 'grad_norm': 9.821016311645508, 'learning_rate': 4.9712023038156955e-05, 'epoch': 0.02}


  1%|          | 351/51393 [01:29<3:09:16,  4.49it/s]

{'loss': 12.0799, 'grad_norm': 5.653285980224609, 'learning_rate': 4.966337828108887e-05, 'epoch': 0.02}


  1%|          | 401/51393 [01:42<2:58:17,  4.77it/s]

{'loss': 11.4288, 'grad_norm': 5.822386741638184, 'learning_rate': 4.961473352402078e-05, 'epoch': 0.02}


  1%|          | 451/51393 [01:54<3:54:32,  3.62it/s]

{'loss': 11.6048, 'grad_norm': 8.978076934814453, 'learning_rate': 4.95660887669527e-05, 'epoch': 0.03}


  1%|          | 500/51393 [02:07<4:07:09,  3.43it/s]

{'loss': 11.1998, 'grad_norm': 8.123398780822754, 'learning_rate': 4.951744400988462e-05, 'epoch': 0.03}




OutOfMemoryError: CUDA out of memory. Tried to allocate 2.35 GiB. GPU 0 has a total capacity of 15.69 GiB of which 1.63 GiB is free. Process 1941 has 249.95 MiB memory in use. Including non-PyTorch memory, this process has 13.38 GiB memory in use. Of the allocated memory 10.55 GiB is allocated by PyTorch, and 2.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# Load the processor and model
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Function to count trainable parameters
def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Get the number of trainable parameters
num_trainable_params = count_trainable_parameters(model)

print(f'Number of trainable parameters: {num_trainable_params}')

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Number of trainable parameters: 315461792


In [20]:
# Load the processor and model
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

model.freeze_feature_encoder()

# Function to count trainable parameters
def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Get the number of trainable parameters
num_trainable_params = count_trainable_parameters(model)

print(f'Number of trainable parameters: {num_trainable_params}')

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Number of trainable parameters: 311261344


In [None]:
# Load the processor and model
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Freeze all layers except the head
for param in model.parameters():
    param.requires_grad = False  # Freeze all parameters

# Assuming the head is the `classifier` in Wav2Vec2ForCTC
for param in model.lm_head.parameters():  # For the head (classifier) layer
    param.requires_grad = True  # Unfreeze the head

# Function to count trainable parameters
def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Get the number of trainable parameters
num_trainable_params = count_trainable_parameters(model)

print(f'Number of trainable parameters: {num_trainable_params}')

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Number of trainable parameters: 32800


In [None]:
# Importing pretrained checkpoint
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Unfreeze the parameters in the head
for param in model.lm_head.parameters():
    param.requires_grad = True

# Defining training arguments
training_args = TrainingArguments(
    output_dir=os.path.expanduser('~/asr_project/asr-train/model_outputs'),
    logging_dir=os.path.expanduser('~/asr_project/asr-train/logs'),
    group_by_length=True,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_strategy = "steps",
    eval_strategy="steps",
    num_train_epochs=30,
    fp16=True,
    gradient_checkpointing=True,
    save_steps=500,
    eval_steps=500,
    logging_steps=1,
    learning_rate=1e-4,
    weight_decay=0.005,
    warmup_steps=1000,
    save_total_limit=2,
    # report_to="tensorboard"
)

# Define huggingface trainer
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset["train"],
    eval_dataset=dataset["val"],
    processing_class= processor.feature_extractor
)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
writer = SummaryWriter(log_dir=os.path.expanduser('~/asr_project/asr-train/logs'))

num_epochs = 30
num_warmup_steps = 1000
num_log_steps = 100

# Importing pretrained checkpoint
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-960h", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Unfreeze the parameters in the head
for param in model.lm_head.parameters():
    param.requires_grad = True

# Set up DataLoader for training and validation sets
train_dataloader = DataLoader(dataset["train"], batch_size=8, shuffle=True, collate_fn=data_collator)
val_dataloader = DataLoader(dataset["val"], batch_size=8, shuffle=False, collate_fn=data_collator)

# Initialize optimizer and gradient scaler for mixed precision
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.005)
scaler = torch.amp.GradScaler()  # For mixed precision training

# Scheduler
num_training_steps = len(train_dataloader) * num_epochs  # epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)

# Training and evaluation loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
global_step = 0

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    with tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs}") as progress_bar:
        for batch in progress_bar:
            # Move data to the device
            input_values = batch["input_values"].to(device)
            labels = batch["labels"].to(device)
            
            # Forward pass with mixed precision
            with autocast(device_type="cuda"):
                outputs = model(input_values, labels=labels)
                loss = outputs.loss

            # Backward pass
            optimizer.zero_grad()
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

            # Logging
            total_loss += loss.item()
            progress_bar.set_postfix({"train_loss": total_loss / (progress_bar.n + 1)})
            global_step += 1

            if global_step % 20 == 0:
                writer.add_scalar("train/loss", total_loss/ (progress_bar.n + 1), global_step)

    writer.add_scalar("train/loss", total_loss.item()/ len(train_dataloader), global_step)

    # Evaluation loop
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_values = batch["input_values"].to(device)
            labels = batch["labels"].to(device)

            with autocast(device_type="cuda"):
                outputs = model(input_values, labels=labels)
                val_loss += outputs.loss.item()

    writer.add_scalar("train/loss", val_loss/ len(val_dataloader), global_step)

    print(f"Epoch {epoch+1} - Training Loss: {total_loss / len(train_dataloader):.4f}, Validation Loss: {val_loss / len(val_dataloader):.4f}")

    # Save checkpoint
    model_dir = os.path.expanduser("~/asr_project/asr-train/model_outputs")
    os.makedirs(model_dir, exist_ok=True)
    model.save_pretrained(model_dir)


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/30:   6%|▌         | 999/17131 [03:16<52:51,  5.09it/s, train_loss=26.8]  


OutOfMemoryError: CUDA out of memory. Tried to allocate 6.60 GiB. GPU 0 has a total capacity of 15.69 GiB of which 3.30 GiB is free. Process 1941 has 249.95 MiB memory in use. Including non-PyTorch memory, this process has 11.79 GiB memory in use. Of the allocated memory 11.19 GiB is allocated by PyTorch, and 293.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
trainer.train()

KeyboardInterrupt: 

In [None]:
torch.cuda.empty_cache()
gc.collect()

0

In [None]:
# Extract losses and epochs
epochs = []
train_losses = []
eval_losses = []

training_logs = trainer.state.log_history

for log in training_logs:
    if 'loss' in log:  # Training loss
        train_losses.append(log['loss'])
        epochs.append(log['epoch'])
    if 'eval_loss' in log:  # Validation loss
        eval_losses.append(log['eval_loss'])

# Plot training and validation loss
plt.plot(epochs[:len(train_losses)], train_losses, label="Training Loss")
plt.plot(epochs[:len(eval_losses)], eval_losses, label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.title("Training and Validation Loss over Epochs")
plt.show()


In [None]:
trainer.train()