# Fine Tuning Pre-trained WAV2VEC Using DSing Train

## Summary

1. Loading WAV2VEC model
2. Creating a Huggingface dataset for batch processing by Cropping the DSing Dataset described below (8 songs for each split, 1.5seconds max)

The dataset is cropped to allow for processing examples for this notebook.  in the actual processing, it will look different.

## DSing Dataset 

Reading data from DSing Dataset.  Filesystem formatted this way to convert easily to huggingface dataset.

where: <br>
dev/test/trainX are datasets split.<br>
\[split\]_text contains transcript for the snippet.<br>
\[split\]_spk2gender contains information about gender for snippet.<br>

Tests Split: 480 Utterances, 48 minutes<br>
Dev Split: 482 Utterances, 41 minutes<br>
Train1 Split: 8794 Utterances, 15.1 hours<br>
Train3 Split: 25526 Utterances, 44.7 hours<br>
Train30 Split: 268,392 Utterances, 149.1 hours<br>

sing_300x30x2/dataset/<br>
├── dev/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── dev_spk2gender<br>
├── dev_text<br>
├── test/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── test_spk2gender<br>
├── test_text<br>
├── train1/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── train1_spk2gender<br>
├── train1_text<br>
├── train3/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── train3_spk2gender<br>
└── train3_text<br>
├── train30/<br>
├───| metadata.csv<br>
&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&nbsp;| \<audio files\>.wav<br>
├── train30_spk2gender<br>
├── train30_text<br>


# For Colab Training

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
#!unzip /content/drive/MyDrive/damp_dataset.zip -d /

In [None]:
#!pip install transformers==4.39.1
#!pip install accelerate==0.28.0
#!pip install evaluate
#!pip install torchaudio
#!pip install flashlight-text
#!pip install jiwer

In [None]:
gc.collect()
torch.cuda.empty_cache()

# Basic Setup

In [1]:
import os
import sys
import time
import glob
import re
import json
import random
import tqdm

import torch
import torchaudio
import evaluate

import pandas as pd
import numpy as np
import IPython

from IPython.display import display, HTML, Audio

from datasets import load_dataset, load_metric, ClassLabel, DatasetDict

from torch.utils.data import DataLoader

from torchaudio.models.decoder import ctc_decoder
from torchaudio.models.decoder import download_pretrained_files

from transformers import Wav2Vec2CTCTokenizer
from transformers import Wav2Vec2FeatureExtractor
from transformers import pipeline

from transformers import TrainingArguments, Trainer
from transformers.utils import logging




In [2]:
#
# Dataset unpacked and preprocessed into child folder dataset/
# DSing is around 1.9GB after being preprocessed.
#
tokens_file="./tokens.txt"
dataset_folder = "../sing_300x30x2/damp_dataset"

In [3]:
# Create Pipeline
model_checkpoint="facebook/wav2vec2-large-960h-lv60-self"
asr_pipeline = pipeline("automatic-speech-recognition", model=model_checkpoint)
model = asr_pipeline.model

# Note: I want to use a beam search library that has been pre-trained on librispeech.  the pretrained beam search 
#       has been created with all lowercase characters (i.e. phoneme settings, etc.).  I am changing the default 
#       model tokenizer vocab settings to accept lower case inputs so i can use the decoder.  However, it will also
#       accept upper case, so this may also impact performance but only for inference.  (i.e. not sure where this 
#       is normalized to .
asr_pipeline.tokenizer.do_lower_case = True

target_sampling_rate = asr_pipeline.feature_extractor.sampling_rate

Some weights of the model checkpoint at facebook/wav2vec2-large-960h-lv60-self were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.maske

In [4]:
asr_pipeline.feature_extractor

Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

In [5]:
asr_pipeline.tokenizer

Wav2Vec2CTCTokenizer(name_or_path='facebook/wav2vec2-large-960h-lv60-self', vocab_size=32, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	1: AddedToken("<s>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	2: AddedToken("</s>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
	3: AddedToken("<unk>", rstrip=True, lstrip=True, single_word=False, normalized=False, special=False),
}

## Decoders

In [73]:
files = download_pretrained_files("librispeech-4-gram")
# Found from Fairseq for Wav2Vec2 - 
# https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/config/finetuning/vox_960h_2_aws.yaml
# Note: I am not using the same lexicon file though...
# LM_WEIGHT = 2.0
# WORD_SCORE = 0
# SIL_SCORE = -1

LM_WEIGHT = 3.23
WORD_SCORE = -0.26
SIL_SCORE = 0

beam_search_decoder = ctc_decoder(
    lexicon=files.lexicon,
    tokens=tokens_file,
    lm=files.lm,
    nbest=1,
    beam_size=1500,
    lm_weight=LM_WEIGHT,
    word_score=WORD_SCORE,
    sil_score=SIL_SCORE,
    blank_token='<pad>',
    unk_word='<unk>'
)

greedy_decoder = ctc_decoder(
    lexicon=files.lexicon,
    tokens=tokens_file,
    lm=files.lm,
    nbest=1,
    beam_size=1,
    lm_weight=LM_WEIGHT,
    word_score=WORD_SCORE,
    sil_score=SIL_SCORE,
    blank_token='<pad>',
    unk_word='<unk>'
)

# Generate Hugginface Dataset Object from DAMP 300x30x2 dataset

Reference: https://huggingface.co/docs/datasets/audio_load

Other pre-processing is heavily inspired from: https://colab.research.google.com/drive/1nCC5Ci-81U5opK_VuXDiZlmcAuATreF2#scrollTo=RBDRAAYxRE6n



In [7]:
# read labels for all song utterances for DSING splits (test, dev (aka validation), train1 (aka train)

# This object will create a datasetDict for train, validation and test
dali = load_dataset("audiofolder", data_dir="../dali-dataset/dali_datasets")

# Note: creating small dataset objects to ensure training goes well.
#       However, a limitation of using the audiofolder is that I cannot specify how much of
#       the data to read in.
#dsing_small_train, dsing_small_val, dsing_small_test = load_dataset("audiofolder", data_dir=dataset_folder, split=['train[:8]','validation[:8]','test[:8]'])
# When using split with a list, the return objects are DataSet objects of size of list.
# Recasting back to DatasetDict for ease of use downstream.
#dsing = DatasetDict({"train": dsing_small_train, "validation": dsing_small_val, "test":dsing_small_test})


Resolving data files:   0%|          | 0/10187 [00:00<?, ?it/s]

In [8]:
dali

DatasetDict({
    test: Dataset({
        features: ['audio', 'transcription'],
        num_rows: 10186
    })
})

# Data Cleaning

In [6]:
def prepare_dataset(batch,tokenizer,feature_extractor):
    """
    Creating a new dataset with the map function to generate the 
    keys below.  Padding will occur in the data collator on a per
    batch basis. 

    Inputs (i.e. feature extractor):
    input_values   - tensor array for audio samples (shape=(n,) - where n is the number of audio samples)
    attention_mask - used for expressing where there are padded samples 

    Outputs (i.e. tokenizer related)
    labels - tensor array for text output tokens (i.e. not transcript).  (shape=(m,) - where m is the number of character tokens)
    """
    chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'
    batch["transcription"] = re.sub(chars_to_ignore_regex, '', batch["transcription"]).lower() + " "

    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct

    # Feature Extractor manipulation
    #
    # this object will return a list of lists because the 
    # transcriptions are not padded (i.e. as opposed to a 
    # Tensor of tensors when using return_tensors='pt').
    # Padding is done per batch to optimize the size for inference and 
    # training.
    #
    # data_collator is responsible for padding the data.
    inputs_values_pt = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"])
    batch["attention_mask"] = inputs_values_pt.attention_mask
    batch["input_values"] = inputs_values_pt.input_values[0]
    batch["input_length"] = len(batch["input_values"])
    
    # Tokenizer manipulation
    #
    # this object will return a list of lists because the 
    # transcriptions are not padded (i.e. as opposed to a 
    # Tensor of tensors when using return_tensors='pt').
    # Padding is done per batch to optimize the size for inference and 
    # training.
    #
    # data_collator is responsible for padding the data.
    labels_pt = tokenizer(batch["transcription"])
    batch["labels"] = labels_pt['input_ids']
    
    return batch

In [7]:
# Create dataset keys that are expected by model for each split.
dsing_dataset = dsing.map(prepare_dataset, remove_columns=["audio","transcription"], fn_kwargs={'tokenizer':asr_pipeline.tokenizer, 'feature_extractor':asr_pipeline.feature_extractor})


Map:   0%|          | 0/8794 [00:00<?, ? examples/s]

Map:   0%|          | 0/482 [00:00<?, ? examples/s]

Map:   0%|          | 0/480 [00:00<?, ? examples/s]

In [17]:
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the already tokenized inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).

            Other Options in the pad method that are NOT implemented for this class (i.e. I always want to pad to longest for the 
            input and the labels)
            * (not implemented) :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * (not implemented) :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).

    Reference Code here:
    https://huggingface.co/blog/fine-tune-wav2vec2-english

    
    Note: in the example referenced above, there were parameters for padding max length, etc.  I have created some logic 
    in the prepare_dataset to support truncation of data for testing and benchmarking.  I do not think i need max_length 
    options for collator at this time.

    """
    
    tokenizer: Wav2Vec2CTCTokenizer
    feature_extractor: Wav2Vec2FeatureExtractor
    padding: Union[bool, str] = "longest"

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        
        # Features in this case is a list of batch size that contains DataSet objects from the train split
        # (including pretokenized labels). the output batch has been changed from a list back to a dictionary 
        # with the respective data objects.
        #
        # Note for future self: 
        # pad is being called from PreTrainedTokenizerBase.pad.  From docs:
        #      Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
        #      in the batch.
        #      
        #    Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
        #    `self.pad_token_id` and `self.pad_token_type_id`).

        #    Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the
        #    text followed by a call to the `pad` method to get a padded encoding.
        # 
        #         <Tip>
        # 
        #         If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
        #         result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
        #         PyTorch tensors, you will lose the specific device of your tensors however.
        # 
        #         </Tip>

        # Audio Input Data (not tokenized)
        input_features = [{"input_values": feature["input_values"]} for feature in features]

        # batch is a dictionary-like type.
        batch = self.feature_extractor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        
        # Tokenized Transcript Labels (character level tokens)
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.tokenizer.pad(
            label_features,
            padding=self.padding,
            return_tensors="pt",
        )
        
        # replace padding with -100 to ignore loss correctly
        batch["labels"] = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        
        return batch

In [74]:
wer_metric = evaluate.load("wer")

def compute_metrics(eval_pred,kind='beam',compute=True):
    """
    Calculates WER for a batch of logits and their labels.

    eval_pred - tuple (logit output from the model, token labels from dataset)
    kind - can compare between beam search and greedy search.  both using kenlm 

    compute - bool - for training this will compute WER every time its logged.  
                     this is nice for understanding if the training is working.
                     for evaluation, this is set to false so compute is run after
                     all batches are processed.

    output is the WER computed from the batch.  if the model is run multiple times, the 
    batch WERs are aggregated.

    Note: add_batch and then doing compute will clear the previously cached batch results.
    """
    logits, labels = eval_pred
    #print(f"Logit Type: {type(logits)}")

    # In some scenarios, the input the compute_metrics is a tensor.
    if type(logits) is np.ndarray:
        logits = torch.Tensor(logits)
    else:
        # copy this tensor for computing things...
        logits = logits.clone().detach().requires_grad_(False)    
    #print(f"Changing Logit Type to: {type(logits)}")
    #print(f"{logits.shape}")
    
    if kind=='beam':
        # Creates a list of lists that are of size [batch_size,emissions,vocab_size]
        #
        # Where output[0][0] gives you the CTCHypothesis object.
        #
        # Extract transcript from output[0][0].words (i.e. list of words).  
        # May need to join depending on objective.
        #
        predictions = beam_search_decoder(logits)
    elif kind=='greedy':
        # Creates a list of lists that are of size [batch_size,1]
        #
        # Where output[0][0] gives you the CTCHypothesis object.
        #
        # Extract transcript from output[0][0].words (i.e. list of words).  
        # May need to join depending on objective.
        #
        predictions = greedy_decoder(logits)
    else:
        print(f"Error passing in decoder kind: {kind}")
        sys.exit()

    ref = asr_pipeline.tokenizer.batch_decode(labels)
    pred = [" ".join(prediction[0].words) for prediction in predictions]

    wer_metric.add_batch(predictions=pred, references=ref)

    if compute: 
        return {"wer":wer_metric.compute()}

    return None

In [81]:
#logging.set_verbosity_info()
#logger = logging.get_logger("transformers")
#logger.warning("INFO")

training_args = TrainingArguments(
    output_dir="test_trainer",
    overwrite_output_dir = True,
    # gradient_checkpointing
    # the trade off is  O(sqrt(n)) savings 
    # with implemented memory-wise, at the 
    # cost of performing one additional 
    # forward pass.
    gradient_checkpointing=True,   
    use_cpu=True,
    #fp16=True,                      #use when we are doing the GPU based training
    #resume_from_checkpoint=True,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    evaluation_strategy="steps",
    num_train_epochs=1,
    eval_steps=1,
#    logging_dir='./logs',
    learning_rate=1e-4,              # Based on fairseq yaml
    weight_decay=0.005,
    warmup_steps=1,
    save_strategy='epoch',
    save_total_limit=2,
#    report_to='all',                 #logging thing.  there was a warning.
#    logging_steps=1,
#    logging_strategy='steps',
#    log_level='warning'
)

INFO
PyTorch: setting up devices


In [82]:
model.freeze_feature_encoder()

data_collator = DataCollatorCTCWithPadding(
    tokenizer=asr_pipeline.tokenizer, 
    feature_extractor=asr_pipeline.feature_extractor, 
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dsing_dataset['train'],
    eval_dataset=dsing_dataset['validation'],
    compute_metrics=compute_metrics,
    tokenizer=asr_pipeline.feature_extractor,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [85]:
start = time.time()
trainer.train()
finish = time.time()
print(f"Finished in {(finish-start)/60} minutes.") 



Step,Training Loss,Validation Loss,Wer
1,118.9055,27.841282,0.263889
2,149.0431,25.975536,0.25
3,376.1458,26.509171,0.25
4,206.9452,27.088604,0.25


Finished in 1.9081311305363973 minutes.


## Example Output

In [25]:
labels = dsing_dataset['test']['labels']
ref = asr_pipeline.tokenizer.batch_decode(labels)
ref

['stop making a fol out of me',
 'valerie',
 'are you shoping anywhere',
 "and i've mised your ginger hair",
 'and the way you like to dres',
 "why don't you come on over valerie",
 'valerie valerie valerie',
 'and in my head i paint a picture']

In [26]:
predictions = beam_search_decoder(logits)
pred = [" ".join(prediction[0].words) for prediction in predictions]
pred

NameError: name 'logits' is not defined

In [None]:
arr = np.array([[1.,2.,3.,4.]])

In [None]:
torch.tensor(arr)