# Fine Tuning Pre-trained WAV2VEC Using DSing Train

## Summary

1. Loading WAV2VEC model
2. Creating a Huggingface dataset for batch processing by Cropping the DSing Dataset described below (8 songs for each split, 1.5seconds max)

The dataset is cropped to allow for processing examples for this notebook.  in the actual processing, it will look different.

## DSing Dataset

Reading data from DSing Dataset.  Filesystem formatted this way to convert easily to huggingface dataset.

where: <br>
dev/test/trainX are datasets split.<br>
\[split\]_text contains transcript for the snippet.<br>
\[split\]_spk2gender contains information about gender for snippet.<br>

Tests Split: 480 Utterances, 48 minutes<br>
Dev Split: 482 Utterances, 41 minutes<br>
Train1 Split: 8794 Utterances, 15.1 hours<br>
Train3 Split: 25526 Utterances, 44.7 hours<br>
Train30 Split: 268,392 Utterances, 149.1 hours<br>

sing_300x30x2/dataset/<br>
├── dev/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── dev_spk2gender<br>
├── dev_text<br>
├── test/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── test_spk2gender<br>
├── test_text<br>
├── train1/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── train1_spk2gender<br>
├── train1_text<br>
├── train3/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── train3_spk2gender<br>
└── train3_text<br>
├── train30/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── train30_spk2gender<br>
├── train30_text<br>


# For Colab Training

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
#
# Curated DAMP 300x30x2 dataset (DSing) unpacked and preprocessed into
# child folder dataset/. DSing is around 1.9GB after being preprocessed.
#

!unzip /content/drive/MyDrive/damp_dataset.zip -d /
#!unzip /content/drive/MyDrive/dali_test_utt_dataset.zip -d /

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: /dataset/train1/F1055725605-3769810_3769810-432296257_1544391567-GB-F-012.wav  
  inflating: /dataset/train1/F442868307-33905583_146537-33464602_1633090021-GB-F-049.wav  
  inflating: /dataset/train1/F361482855-370582646_496045-418380481_1685916908-GB-F-060.wav  
  inflating: /dataset/train1/M555404170-4212785_4212785-1083723563_1664781629-GB-M-055.wav  
  inflating: /dataset/train1/F566280723-36822194_1763021-971347342_1481501273-GB-F-027.wav  
  inflating: /dataset/train1/M121639960-179962242_116514-121642433_1556101705-GB-M-071.wav  
  inflating: /dataset/train1/F647568384-413929848_343398-647571369_1640878322-GB-F-032.wav  
  inflating: /dataset/train1/F1299168138-393858919_176664-901508571_1588438614-GB-F-017.wav  
  inflating: /dataset/train1/M496682159-413929848_343398-496674697_1567266163-GB-M-026.wav  
  inflating: /dataset/train1/F1261578186-3771456_3771456-1221818620_1511368441-GB-F-033.wav  
  inf

In [3]:
!pip install transformers==4.39.1
!pip install accelerate==0.28.0
!pip install evaluate
!pip install torchaudio
!pip install flashlight-text
!pip install jiwer

Collecting transformers==4.39.1
  Downloading transformers-4.39.1-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed transformers-4.39.1
Collecting accelerate==0.28.0
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate==0.28.0)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate==0.28.0)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-

# Useful Memory Clearing Commands

In [None]:
#del model
#del dsing_train
#del dsing_val
#del data_collator
#del asr_pipeline
#del training_args
#del trainer
#del wer_metric

In [None]:
import torch, gc
gc.collect()
torch.cuda.empty_cache()

# Basic Setup

In [4]:
import os
import sys
import time
import glob
import re
import json
import random
import tqdm

import torch
import torchaudio
import evaluate

import pandas as pd
import numpy as np
import IPython
from IPython.display import display, HTML, Audio

from datasets import load_dataset, load_metric, ClassLabel, DatasetDict

from torch.utils.data import DataLoader

from torchaudio.models.decoder import ctc_decoder
from torchaudio.models.decoder import download_pretrained_files

from transformers import Wav2Vec2CTCTokenizer
from transformers import Wav2Vec2FeatureExtractor
from transformers import pipeline

from transformers import TrainingArguments, Trainer
from transformers.utils import logging
from transformers import AutoModel




In [14]:
#############################################
# Special Variables...
#############################################
root_folder = "/content/drive/MyDrive"

# Google Colab
#
tokens_file=f"{root_folder}/tokens.txt"
#dataset_folder = "/dali_datasets"
dataset_folder = "/dataset"

# Local
#
# tokens_file="./tokens.txt"
# dataset_folder = "../sing_300x30x2/damp_dataset"
# root_folder = "./"


# read labels for all song utterances for DSING splits (test, dev (aka validation), train1 (aka train)
BATCH_SIZE=1
CALC_HOURS=True

In [6]:

#
# Create Pipeline
#

# Simpler, smaller model.  Ou paper, the loss values fine tuned on DSing
# were comparable to the loss values with lv60-self.
#model_checkpoint="facebook/wav2vec2-base"
model_checkpoint="facebook/wav2vec2-large-960h-lv60-self"

asr_pipeline = pipeline("automatic-speech-recognition", model=model_checkpoint, model_kwargs={"ctc_loss_reduction": 'mean'})
model = asr_pipeline.model
model.freeze_feature_encoder()

#
# Gradient Accumulation
# Used in example here: https://huggingface.co/blog/fine-tune-wav2vec2-english
# Per the code: By default, non-reentrant checkpoint stops recomputation as soon as it
#               has computed all needed Tensors.
#
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

# Note: I want to use a beam search library that has been pre-trained on librispeech.  the pretrained beam search
#       has been created with all lowercase characters (i.e. phoneme settings, etc.).  I am changing the default
#       model tokenizer vocab settings to accept lower case inputs so i can use the decoder.  However, it will also
#       accept upper case, so this may also impact performance but only for inference.  (i.e. not sure where this
#       is normalized to .
asr_pipeline.tokenizer.do_lower_case = True

target_sampling_rate = asr_pipeline.feature_extractor.sampling_rate

# Count the number of trainable parameters in the model
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Trainable parameters:", trainable_params)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-large-960h-lv60-self were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.maske

tokenizer_config.json:   0%|          | 0.00/162 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

Trainable parameters: 311261344


In [None]:
asr_pipeline.feature_extractor

Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

In [None]:
asr_pipeline.tokenizer

Wav2Vec2CTCTokenizer(name_or_path='facebook/wav2vec2-large-960h-lv60-self', vocab_size=32, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	1: AddedToken("<s>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	2: AddedToken("</s>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	3: AddedToken("<unk>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
}

## Decoders

In [7]:
files = download_pretrained_files("librispeech-4-gram")
# Found from Fairseq for Wav2Vec2 -
# https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/config/finetuning/vox_960h_2_aws.yaml
# Note: I am not using the same lexicon file though...
# LM_WEIGHT = 2.0
# WORD_SCORE = 0
# SIL_SCORE = -1

LM_WEIGHT = 3.23
WORD_SCORE = -0.26
SIL_SCORE = 0

beam_search_decoder = ctc_decoder(
    lexicon=files.lexicon,
    tokens=tokens_file,
    lm=files.lm,
    nbest=1,
    beam_size=512, #per the paper by Ou
    lm_weight=LM_WEIGHT,
    word_score=WORD_SCORE,
    sil_score=SIL_SCORE,
    blank_token='<pad>',
    unk_word='<unk>'
)

greedy_decoder = ctc_decoder(
    lexicon=files.lexicon,
    tokens=tokens_file,
    lm=files.lm,
    nbest=1,
    beam_size=1,
    lm_weight=LM_WEIGHT,
    word_score=WORD_SCORE,
    sil_score=SIL_SCORE,
    blank_token='<pad>',
    unk_word='<unk>'
)

100%|██████████| 4.97M/4.97M [00:00<00:00, 113MB/s]
100%|██████████| 57.0/57.0 [00:00<00:00, 75.0kB/s]
100%|██████████| 2.91G/2.91G [00:11<00:00, 271MB/s]


# Generate Hugginface Dataset Object from DAMP 300x30x2 dataset

Reference: https://huggingface.co/docs/datasets/audio_load

Other pre-processing is heavily inspired from: https://colab.research.google.com/drive/1nCC5Ci-81U5opK_VuXDiZlmcAuATreF2#scrollTo=RBDRAAYxRE6n



# Data Cleaning

In [8]:
def prepare_dataset(batch,tokenizer,feature_extractor):
    """
    Creating a new dataset with the map function to generate the
    keys below.  Padding will occur in the data collator on a per
    batch basis.

    Inputs (i.e. feature extractor):
    input_values   - tensor array for audio samples (shape=(n,) - where n is the number of audio samples)
    attention_mask - used for expressing where there are padded samples

    Outputs (i.e. tokenizer related)
    labels - tensor array for text output tokens (i.e. not transcript).  (shape=(m,) - where m is the number of character tokens)
    """
    chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'
    batch["transcription"] = re.sub(chars_to_ignore_regex, '', batch["transcription"]).lower() + " "
    batch["transcription"] = batch["transcription"].replace("in' ", "in ")

    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct

    # Feature Extractor manipulation
    #
    # this object will return a list of lists because the
    # transcriptions are not padded (i.e. as opposed to a
    # Tensor of tensors when using return_tensors='pt').
    # Padding is done per batch to optimize the size for inference and
    # training.
    #
    # data_collator is responsible for padding the data.
    inputs_values_pt = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"])

    #
    # use this condition because some Wav2Vec2 models require attention_masks
    # for better performance while others do not.
    #
    if "attention_mask" in inputs_values_pt:
        batch["attention_mask"] = inputs_values_pt.attention_mask

    batch["input_values"] = inputs_values_pt.input_values[0]
    batch["input_length"] = len(batch["input_values"])

    # Tokenizer manipulation
    #
    # this object will return a list of lists because the
    # transcriptions are not padded (i.e. as opposed to a
    # Tensor of tensors when using return_tensors='pt').
    # Padding is done per batch to optimize the size for inference and
    # training.
    #
    # data_collator is responsible for padding the data.
    labels_pt = tokenizer(batch["transcription"])
    batch["labels"] = labels_pt['input_ids']

    return batch

In [9]:
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the already tokenized inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).

            Other Options in the pad method that are NOT implemented for this class (i.e. I always want to pad to longest for the
            input and the labels)
            * (not implemented) :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * (not implemented) :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).

    Reference Code here:
    https://huggingface.co/blog/fine-tune-wav2vec2-english


    Note: in the example referenced above, there were parameters for padding max length, etc.  I have created some logic
    in the prepare_dataset to support truncation of data for testing and benchmarking.  I do not think i need max_length
    options for collator at this time.

    """

    tokenizer: Wav2Vec2CTCTokenizer
    feature_extractor: Wav2Vec2FeatureExtractor
    padding: Union[bool, str] = "longest"

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:

        # Features in this case is a list of batch size that contains DataSet objects from the train split
        # (including pretokenized labels). the output batch has been changed from a list back to a dictionary
        # with the respective data objects.
        #
        # Note for future self:
        # pad is being called from PreTrainedTokenizerBase.pad.  From docs:
        #      Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
        #      in the batch.
        #
        #    Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
        #    `self.pad_token_id` and `self.pad_token_type_id`).

        #    Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the
        #    text followed by a call to the `pad` method to get a padded encoding.
        #
        #         <Tip>
        #
        #         If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
        #         result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
        #         PyTorch tensors, you will lose the specific device of your tensors however.
        #
        #         </Tip>

        # Audio Input Data (not tokenized)
        input_features = [{"input_values": feature["input_values"]} for feature in features]

        # batch is a dictionary-like type.
        batch = self.feature_extractor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )

        # Tokenized Transcript Labels (character level tokens)
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.tokenizer.pad(
            label_features,
            padding=self.padding,
            return_tensors="pt",
        )

        # replace padding with -100 to ignore loss correctly
        batch["labels"] = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        return batch

In [10]:
wer_metric = evaluate.load("wer")

def compute_metrics_dummy(eval_pred):
    return {'wer':1.0}


def compute_metrics(eval_pred,kind='beam',compute=True):
    """
    Calculates WER for a batch of logits and their labels.

    eval_pred - tuple (logit output from the model, token labels from dataset)
    kind - can compare between beam search and greedy search.  both using kenlm

    compute - bool - for training this will compute WER every time its logged.
                     this is nice for understanding if the training is working.
                     for evaluation, this is set to false so compute is run after
                     all batches are processed.

    output is the WER computed from the batch.  if the model is run multiple times, the
    batch WERs are aggregated.

    Note: add_batch and then doing compute will clear the previously cached batch results.
    """
    logits, labels = eval_pred
    #print(f"Logit Type: {type(logits)}")

    # In some scenarios, the input the compute_metrics is a tensor.
    if type(logits) is np.ndarray:
        logits = torch.Tensor(logits)
    else:
        # copy this tensor for computing things...
        logits = logits.clone().detach().requires_grad_(False)
    #print(f"Changing Logit Type to: {type(logits)}")
    #print(f"{logits.shape}")

    if kind=='beam':
        # Creates a list of lists that are of size [batch_size,emissions,vocab_size]
        #
        # Where output[0][0] gives you the CTCHypothesis object.
        #
        # Extract transcript from output[0][0].words (i.e. list of words).
        # May need to join depending on objective.
        #
        predictions = beam_search_decoder(logits)
    elif kind=='greedy':
        # Creates a list of lists that are of size [batch_size,1]
        #
        # Where output[0][0] gives you the CTCHypothesis object.
        #
        # Extract transcript from output[0][0].words (i.e. list of words).
        # May need to join depending on objective.
        #
        predictions = greedy_decoder(logits)
    else:
        print(f"Error passing in decoder kind: {kind}")
        sys.exit()

    ref = asr_pipeline.tokenizer.batch_decode(labels)
    pred = [" ".join(prediction[0].words) for prediction in predictions]

    wer_metric.add_batch(predictions=pred, references=ref)

    if compute:
        return {"wer":wer_metric.compute()}
    else:
      return {"wer":None}

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

# Load Dataset

In [15]:

data_collator = DataCollatorCTCWithPadding(
    tokenizer=asr_pipeline.tokenizer,
    feature_extractor=asr_pipeline.feature_extractor,
)

#
dali_test = load_dataset("audiofolder", data_dir=dataset_folder, split=f'test')
if CALC_HOURS:
    arr_lens = [len(d['array']) for d in dali_test['audio']]
    print(f"Total Hours of Training Data: {np.sum(arr_lens)/ target_sampling_rate / 3600:.2f}")

dali_test = dali_test.to_iterable_dataset()
dali_test = dali_test.with_format('torch')
# make changes to dataset object to prepare for Wav2Vec2 model
dali_test = dali_test.map(
    prepare_dataset,
    remove_columns=["audio","transcription"],
    fn_kwargs={'tokenizer':asr_pipeline.tokenizer, 'feature_extractor':asr_pipeline.feature_extractor}
)

Resolving data files:   0%|          | 0/8797 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/485 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/483 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Total Hours of Training Data: 0.80


## Evaluation Output

In [16]:
from transformers import Wav2Vec2ForCTC

# Load the model checkpoint
model_checkpoint_path = "/content/drive/MyDrive/test_trainer_dali/checkpoint-22000/"
model = Wav2Vec2ForCTC.from_pretrained(model_checkpoint_path)


data_loader = DataLoader(
    dali_test,
    batch_size=32,  #32 with A100 (~5 min), 16 with V100 (~10 min)
    shuffle=False,
    num_workers=0,
    collate_fn=data_collator,
    pin_memory=True,
)

model.to('cuda')

total_start = time.time()
batch_inference_time = []
for i, batch in enumerate(data_loader):

    batch = {k:v.to('cuda') for k,v in batch.items()}
    labels = batch.pop("labels")

    # measure inference time
    start = time.time()
    with torch.no_grad():
        logits = model(**batch).logits

    # add batch, compute later...
    wer = compute_metrics((logits.to('cpu'),labels),compute=False)
    finish = time.time()
    batch_inference_time.append(finish-start)
    #print("*******************")
    print(f"{i}:{' ' if wer['wer'] is None else wer['wer']}",end='')
    print(f"({finish-start:.1f}s)")

total_finish = time.time()
total_processing_time = total_finish-total_start
total_inference_time = np.sum(batch_inference_time)
total_dataloading_time = total_processing_time - total_inference_time
print(f"Batch Inference took: {total_processing_time:.1f} seconds.")
print(f"Inference: {total_inference_time:.1f} seconds, DataLoading: {total_dataloading_time:.1f} seconds")
print(f"WER for all batches (lower is better): {wer_metric.compute()*100:.1f}%")

0: (91.0s)
1: (60.0s)
2: (49.7s)
3: (54.6s)
4: (55.7s)
5: (76.8s)
6: (50.2s)
7: (78.1s)
8: (40.5s)
9: (72.4s)
10: (62.5s)
11: (49.7s)
12: (79.0s)
13: (72.5s)
14: (85.6s)
Batch Inference took: 980.8 seconds.
Inference: 978.3 seconds, DataLoading: 2.4 seconds
WER for all batches (lower is better): 39.4%
