## What is `wav2vec2`

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.

Using a novel contrastive pretraining objective, Wav2Vec2 learns powerful speech representations from more than 50.000 hours of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/wav2vec2.png)

For the first time, it has been shown that pretraining, followed by fine-tuning on very little labeled speech data achieves competitive results to state-of-the-art ASR systems. Using as little as 10 minutes of labeled data, Wav2Vec2 yields a word error rate (WER) of less than 5% on the clean test set of [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) - *cf.* with Table 9 of the [paper](https://arxiv.org/pdf/2006.11477.pdf).

## Installs and Imports

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="5"
from functools import partial
import pandas as pd
import numpy as np
from datasets import (
    load_dataset, 
    load_from_disk,
    load_metric,)
# from datasets.filesystems import S3FileSystem
from transformers import (
    Wav2Vec2CTCTokenizer, 
    Wav2Vec2FeatureExtractor,
    Wav2Vec2Processor,
    Wav2Vec2ForCTC,
    TrainingArguments,
    Trainer,
)
import torchaudio
import re
import json
from pythainlp.tokenize import word_tokenize, syllable_tokenize

## Data Preparation

### Clean

### Load Dataset

In [5]:
datasets = load_dataset("../scripts/th_common_voice_70.py", "ur")
datasets

Reusing dataset common_voice (/home/saad/.cache/huggingface/datasets/common_voice/ur/7.0.0/22f498ef791fb449d0ba8185660e0dd2cee815f02b179584bf3fa87098086dac)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['path', 'sentence'],
        num_rows: 3156
    })
    test: Dataset({
        features: ['path', 'sentence'],
        num_rows: 4
    })
    validation: Dataset({
        features: ['path', 'sentence'],
        num_rows: 341
    })
})

In [6]:
def preprocess_data(example, tok_func = word_tokenize):
    example['sentence'] = ' '.join(tok_func(example['sentence']))
    return example

datasets = datasets.map(preprocess_data)

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

### Exploration

#### `sentence`; transcripts

In [7]:
#show random sentences
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))
    
show_random_elements(datasets["train"].remove_columns(["path"]), num_examples=20)

Unnamed: 0,sentence
0,ہمیں اسی جذبے کے ساتھ اس معاملے کو دیکھنا ہو گا۔
1,یہاں عروج وزوال کا قانون کیا ہے۔
2,اس کا کچھ کارن شکستہ کا ستر رہا ہے تو کچھ راج نہیں دیکھا
3,ناشتے کے بعد سب تیار ہو گئے
4,نیلو کے ڈانسز دیکھیں۔
5,یاد کیے جائیں گے۔
6,اتنا انوکھا نہیں۔
7,خواہش دلوں کو خودسر اور جنگجو بنا دیتی ہے
8,ان میں سے ایک کا نام تھا دکش
9,آرکائیوز کے ہجوں کو بازیچہ بنانا غایت سماوی قصور نہیں


In [9]:
train_df = pd.DataFrame({'sentence':datasets['train']['sentence']})
train_df['nb_words'] = train_df.sentence.map(lambda x: len(x.split()))
# train_df.nb_words.hist(bins=30)

In [10]:
validation_df = pd.DataFrame({'sentence':datasets['validation']['sentence']})
validation_df['nb_words'] = validation_df.sentence.map(lambda x: len(x.split()))
# validation_df.nb_words.hist(bins=30)

In [11]:
test_df = pd.DataFrame({'sentence':datasets['test']['sentence']})
test_df['nb_words'] = test_df.sentence.map(lambda x: len(x.split()))
# test_df.nb_words.hist(bins=30)

#### `path`; mp3 files

In [12]:
train_df = pd.DataFrame({'path':datasets['train']['path']})
train_df['sample_rate'] = train_df.path.map(lambda x: torchaudio.info(x).sample_rate)
train_df['num_frames'] = train_df.path.map(lambda x: torchaudio.info(x).num_frames)
train_df['seconds'] = train_df.num_frames / train_df.sample_rate
# train_df.seconds.hist(bins=30)

In [13]:
validation_df = pd.DataFrame({'path':datasets['validation']['path']})
validation_df['sample_rate'] = validation_df.path.map(lambda x: torchaudio.info(x).sample_rate)
validation_df['num_frames'] = validation_df.path.map(lambda x: torchaudio.info(x).num_frames)
validation_df['seconds'] = validation_df.num_frames / validation_df.sample_rate
# validation_df.seconds.hist(bins=30)

In [14]:
test_df = pd.DataFrame({'path':datasets['test']['path']})
test_df['sample_rate'] = test_df.path.map(lambda x: torchaudio.info(x).sample_rate)
test_df['num_frames'] = test_df.path.map(lambda x: torchaudio.info(x).num_frames)
test_df['seconds'] = test_df.num_frames / test_df.sample_rate
# test_df.seconds.hist(bins=30)

### Create Wav2Vec2CTCTokenizer

[Connectionist Temporal Classification (CTC)](https://distill.pub/2017/ctc/) tokenizer is a character-level tokenizer. We uses space (denoted as `|` token) as word delimiter token and `[PAD]` as blank token. 

In [15]:
def extract_all_chars(batch, text_col = "sentence"):
    all_text = " ".join(batch[text_col])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

vocabs = datasets.map(extract_all_chars, 
                   batched=True, 
                   batch_size=-1, 
                   keep_in_memory=True, 
                   remove_columns=datasets.column_names["train"])

vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["validation"]["vocab"][0]) |set(vocabs["test"]["vocab"][0]))
# vocab_list = list(set(vocabs["train"]["vocab"][0])) #strictly no leakage
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
len(vocab_dict), vocab_dict

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

(67,
 {'ۃ': 0,
  '،': 1,
  'ے': 2,
  'ئ': 3,
  'ف': 4,
  '؟': 5,
  'ٹ': 6,
  'ٰ': 7,
  'ڈ': 8,
  'ء': 9,
  '!': 10,
  '-': 11,
  'َ': 12,
  'ل': 13,
  'ں': 14,
  ' ': 15,
  'ً': 16,
  'ي': 17,
  'ز': 18,
  'و': 19,
  'د': 20,
  'ڑ': 21,
  'ھ': 22,
  '’': 23,
  'گ': 24,
  'چ': 25,
  'ب': 26,
  'س': 27,
  'م': 28,
  'ۂ': 29,
  '‘': 30,
  'ذ': 31,
  'ت': 32,
  'ُ': 33,
  'ط': 34,
  'ن': 35,
  'ع': 36,
  'ر': 37,
  'ض': 38,
  'ہ': 39,
  'ک': 40,
  'ص': 41,
  'ح': 42,
  'ؤ': 43,
  'غ': 44,
  'ی': 45,
  'ۓ': 46,
  'ٔ': 47,
  'ّ': 48,
  'خ': 49,
  'ظ': 50,
  ':': 51,
  'آ': 52,
  'ث': 53,
  'ژ': 54,
  'ٓ': 55,
  'ى': 56,
  '"': 57,
  'ا': 58,
  'ق': 59,
  '۔': 60,
  'ج': 61,
  'پ': 62,
  'ِ': 63,
  "'": 64,
  'ش': 65,
  '\u200f': 66})

In [16]:
#make space = |
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

In [17]:
#padding token serves as blank token
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict), vocab_dict

(69,
 {'ۃ': 0,
  '،': 1,
  'ے': 2,
  'ئ': 3,
  'ف': 4,
  '؟': 5,
  'ٹ': 6,
  'ٰ': 7,
  'ڈ': 8,
  'ء': 9,
  '!': 10,
  '-': 11,
  'َ': 12,
  'ل': 13,
  'ں': 14,
  'ً': 16,
  'ي': 17,
  'ز': 18,
  'و': 19,
  'د': 20,
  'ڑ': 21,
  'ھ': 22,
  '’': 23,
  'گ': 24,
  'چ': 25,
  'ب': 26,
  'س': 27,
  'م': 28,
  'ۂ': 29,
  '‘': 30,
  'ذ': 31,
  'ت': 32,
  'ُ': 33,
  'ط': 34,
  'ن': 35,
  'ع': 36,
  'ر': 37,
  'ض': 38,
  'ہ': 39,
  'ک': 40,
  'ص': 41,
  'ح': 42,
  'ؤ': 43,
  'غ': 44,
  'ی': 45,
  'ۓ': 46,
  'ٔ': 47,
  'ّ': 48,
  'خ': 49,
  'ظ': 50,
  ':': 51,
  'آ': 52,
  'ث': 53,
  'ژ': 54,
  'ٓ': 55,
  'ى': 56,
  '"': 57,
  'ا': 58,
  'ق': 59,
  '۔': 60,
  'ج': 61,
  'پ': 62,
  'ِ': 63,
  "'": 64,
  'ش': 65,
  '\u200f': 66,
  '|': 15,
  '[UNK]': 67,
  '[PAD]': 68})

In [18]:
#save as json; create tokenizer for the first time and upload to hugginface hub
# with open('../data/vocab.json', 'w') as vocab_file:
#     json.dump(vocab_dict, vocab_file)
tokenizer = Wav2Vec2CTCTokenizer("../data/vocab.json", 
                                 unk_token="[UNK]", 
                                 pad_token="[PAD]", 
                                 word_delimiter_token="|")
tokenizer.save_pretrained('../wav2vec2-large-xlsr-53-ur')

('../wav2vec2-large-xlsr-53-ur/tokenizer_config.json',
 '../wav2vec2-large-xlsr-53-ur/special_tokens_map.json',
 '../wav2vec2-large-xlsr-53-ur/vocab.json',
 '../wav2vec2-large-xlsr-53-ur/added_tokens.json')

In [19]:
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("../wav2vec2-large-xlsr-53-ur", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [20]:
datasets['train'][0]['sentence']

'نیلم   نے   سالگرہ   پر   ہیڈ   سیسموگراف   اسود   قریشی   کے   ماتھے   پر   اینٹھن   اور   غم   کی   آتشیں   رو   محسوس   کی'

In [21]:
tokenizer(datasets['train'][0]['sentence'])

{'input_ids': [55, 44, 49, 80, 92, 92, 92, 55, 12, 92, 92, 92, 85, 60, 49, 11, 25, 9, 92, 92, 92, 84, 25, 92, 92, 92, 9, 44, 43, 92, 92, 92, 85, 44, 85, 80, 3, 11, 25, 60, 30, 92, 92, 92, 60, 85, 3, 89, 92, 92, 92, 86, 25, 44, 50, 44, 92, 92, 92, 31, 12, 92, 92, 92, 80, 60, 19, 6, 12, 92, 92, 92, 84, 25, 92, 92, 92, 60, 44, 55, 54, 6, 55, 92, 92, 92, 60, 3, 25, 92, 92, 92, 67, 80, 92, 92, 92, 31, 44, 92, 92, 92, 71, 19, 50, 44, 57, 92, 92, 92, 25, 3, 92, 92, 92, 80, 18, 85, 3, 85, 92, 92, 92, 31, 44], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [22]:
tokenizer.decode(tokenizer(datasets['train'][0]['sentence']).input_ids)

'نیلم نے سالگرہ پر ہیڈ سیسموگراف اسود قریشی کے ماتھے پر اینٹھن اور غم کی آتشیں رو محسوس کی'

In [23]:
tokenizer.vocab_size

93

### Create Wav2Vec2 Feature Extractor

Wav2Vec2 was pretrained on the audio data of [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) and LibriVox which both were sampling with 16kHz. [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) has 32kHz sampling rate.

A Wav2Vec2 feature extractor object requires the following parameters to be instantiated:

- `feature_size`: Speech models take a sequence of feature vectors as an input. While the length of this sequence obviously varies, the feature size should not. In the case of Wav2Vec2, the feature size is 1 because the model was trained on the raw speech signal ${}^2$.
- `sampling_rate`: The sampling rate at which the model is trained on.
- `padding_value`: For batched inference, shorter inputs need to be padded with a specific value
- `do_normalize`: Whether the input should be *zero-mean-unit-variance* normalized or not. Usually, speech models perform better when normalizing the input
- `return_attention_mask`: Whether the model should make use of an `attention_mask` for batched inference. In general, models should **always** make use of the `attention_mask` to mask padded tokens. However, due to a very specific design choice of `Wav2Vec2`'s "base" checkpoint, better results are achieved when using no `attention_mask`. This is **not** recommended for other speech models. For more information, one can take a look at [this](https://github.com/pytorch/fairseq/issues/3227) issue. **Important** If you want to use this notebook to fine-tune [large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60), this parameter should be set to `True`.

In [24]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, 
                                             sampling_rate=16000, 
                                             padding_value=0.0, 
                                             do_normalize=True, 
                                             return_attention_mask=False)

In [25]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

### Preprocess Data

We resample to 16kHz with which `wav2vec2` was pretrained on.

In [26]:
def speech_file_to_array_fn(batch, 
                            text_col="sentence", 
                            fname_col="path",
                            resampling_to=16000):
    speech_array, sampling_rate = torchaudio.load(batch[fname_col])
    resampler=torchaudio.transforms.Resample(sampling_rate, resampling_to)
    batch["speech"] = resampler(speech_array)[0].numpy()
    batch["sampling_rate"] = resampling_to
    batch["target_text"] = batch[text_col]
    return batch

In [27]:
speech_datasets = datasets.map(speech_file_to_array_fn, 
                                   remove_columns=datasets.column_names["train"])
speech_datasets

Loading cached processed dataset at /home/saad/.cache/huggingface/datasets/common_voice/ur/7.0.0/22f498ef791fb449d0ba8185660e0dd2cee815f02b179584bf3fa87098086dac/cache-19b410dddaad8d33.arrow
Loading cached processed dataset at /home/saad/.cache/huggingface/datasets/common_voice/ur/7.0.0/22f498ef791fb449d0ba8185660e0dd2cee815f02b179584bf3fa87098086dac/cache-721b0e73fa2b0393.arrow
Loading cached processed dataset at /home/saad/.cache/huggingface/datasets/common_voice/ur/7.0.0/22f498ef791fb449d0ba8185660e0dd2cee815f02b179584bf3fa87098086dac/cache-321cd90758916757.arrow


DatasetDict({
    train: Dataset({
        features: ['speech', 'sampling_rate', 'target_text'],
        num_rows: 3156
    })
    test: Dataset({
        features: ['speech', 'sampling_rate', 'target_text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['speech', 'sampling_rate', 'target_text'],
        num_rows: 341
    })
})

In [28]:
#sample sounds
import IPython.display as ipd
import numpy as np
import random

rand_int = random.randint(0, len(speech_datasets["train"]))
print(speech_datasets["train"][rand_int]["target_text"])
ipd.Audio(data=np.asarray(speech_datasets["train"][rand_int]["speech"]), autoplay=True, rate=16000)

اس   کا   پروگرام   کی   تاہنگ   رومی   ہوا   ہے


Then we prepare `input_values` using processor and labels using `target_text`.

In [29]:
def prepare_dataset(batch):
    # check that all files have the correct sampling rate
    assert (
        len(set(batch["sampling_rate"])) == 1
    ), f"Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}."

    batch["input_values"] = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0]).input_values
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

In [30]:
prepared_datasets = speech_datasets.map(prepare_dataset, 
                                        remove_columns=speech_datasets.column_names["train"], 
                                        batch_size=16,
                                        batched=True)

  0%|          | 0/198 [00:00<?, ?ba/s]

  tensor = as_tensor(value)


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/22 [00:00<?, ?ba/s]

In [31]:
# credentials = pd.read_csv('../data/rootkey.csv',header=None)
# aws_access_key_id = credentials.iloc[0,0].split('=')[-1]
# aws_secret_access_key = credentials.iloc[1,0].split('=')[-1]
# s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
# prepared_datasets.save_to_disk('s3://sagemaker-studio-g2rfihg7k9q/wav2vec2-large-xlsr-th/', fs=s3)  
# !aws s3 ls --summarize --human-readable --recursive s3://sagemaker-studio-g2rfihg7k9q/wav2vec2-large-xlsr-th/

In [32]:
# credentials = pd.read_csv('../data/rootkey.csv',header=None)
# aws_access_key_id = credentials.iloc[0,0].split('=')[-1]
# aws_secret_access_key = credentials.iloc[1,0].split('=')[-1]
# s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
# prepared_datasets = load_from_disk('s3://sagemaker-studio-g2rfihg7k9q/wav2vec2-large-xlsr-th/', fs=s3)

In [33]:
prepared_datasets['train']

Dataset({
    features: ['input_values', 'labels'],
    num_rows: 3156
})

## Training

### Data Collator

In [34]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [35]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

### Metric

We use word error rate with space as word boundary. We created those spaces using `pythainlp.tokenize.word_tokenize` (2.3.1). We also use character error rate without word boundaries.

In [36]:
wer_metric = load_metric("wer")

In [37]:
# wer_metric.compute(predictions=['สวัสดี ค่า ทุก โคน'],references=['สวัสดี ค่ะ ทุก คน'])

In [38]:
cer_metric = load_metric('cer')

In [39]:
# cer_metric.compute(predictions=['สวัสดี ค่า ทุก โคน'],references=['สวัสดี ค่ะ ทุก คน'])

In [40]:
# cer_metric.compute(predictions=['สวัสดีค่าทุกโคน'],references=['สวัสดีค่ะทุกคน'])

In [41]:
def compute_metrics(pred, processor, metric):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Model

We use the pretrained `facebook/wav2vec2-large-xlsr-53`. The training script is `scripts/wav2vec_finetune.py`.

In [42]:
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    gradient_checkpointing=True,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

Some weights of the model checkpoint at facebook/wav2vec2-large-xlsr-53 were not used when initializing Wav2Vec2ForCTC: ['project_hid.weight', 'project_q.weight', 'quantizer.weight_proj.bias', 'quantizer.codevectors', 'project_hid.bias', 'quantizer.weight_proj.weight', 'project_q.bias']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to u

We do not finetune the feature extractor layer.

In [43]:
model.freeze_feature_encoder()

In [44]:
training_args = TrainingArguments(
    output_dir="../data/wav2vec2-large-xlsr-53-ur",
    group_by_length=True,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    per_device_eval_batch_size=8,
    metric_for_best_model='wer',
    evaluation_strategy="steps",
    eval_steps=500,
    logging_strategy="steps",
    logging_steps=100,
    save_strategy="steps",
    save_steps=1000,
    num_train_epochs=60,
    fp16=True,
    learning_rate=1e-4,
    warmup_steps=1000,
    save_total_limit=3
)

In [45]:
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=partial(compute_metrics, metric=wer_metric, processor=processor),
    train_dataset=prepared_datasets["train"],
    eval_dataset=prepared_datasets["validation"],
    tokenizer=processor.feature_extractor,
    
)

Using amp half precision backend


In [46]:
# torch.cuda.empty_cache()

In [47]:

trainer.train()

***** Running training *****
  Num examples = 3156
  Num Epochs = 60
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 11880


Step,Training Loss,Validation Loss,Wer
500,2.8252,3.309466,1.0
1000,2.5837,2.8238,1.0
1500,1.8148,1.421954,0.844166
2000,0.8559,0.552761,0.525232
2500,0.628,0.339862,0.42713
3000,0.5154,0.251664,0.362939
3500,0.4354,0.188251,0.308438
4000,0.3771,0.152009,0.28583
4500,0.3477,0.121236,0.254744
5000,0.3026,0.095221,0.24021


***** Running Evaluation *****
  Num examples = 341
  Batch size = 8
***** Running Evaluation *****
  Num examples = 341
  Batch size = 8
Saving model checkpoint to ../data/wav2vec2-large-xlsr-53-ur/checkpoint-1000
Configuration saved in ../data/wav2vec2-large-xlsr-53-ur/checkpoint-1000/config.json
Model weights saved in ../data/wav2vec2-large-xlsr-53-ur/checkpoint-1000/pytorch_model.bin
Feature extractor saved in ../data/wav2vec2-large-xlsr-53-ur/checkpoint-1000/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 341
  Batch size = 8
***** Running Evaluation *****
  Num examples = 341
  Batch size = 8
Saving model checkpoint to ../data/wav2vec2-large-xlsr-53-ur/checkpoint-2000
Configuration saved in ../data/wav2vec2-large-xlsr-53-ur/checkpoint-2000/config.json
Model weights saved in ../data/wav2vec2-large-xlsr-53-ur/checkpoint-2000/pytorch_model.bin
Feature extractor saved in ../data/wav2vec2-large-xlsr-53-ur/checkpoint-2000/preprocessor_config.json
Deleting older

TrainOutput(global_step=11880, training_loss=0.819796867563267, metrics={'train_runtime': 12457.5357, 'train_samples_per_second': 15.2, 'train_steps_per_second': 0.954, 'total_flos': 2.8627340719717073e+19, 'train_loss': 0.819796867563267, 'epoch': 60.0})

## Inference and Evaluation

We load the test split, re-splitted from the official Common Voice in order to 1) avoid data leakage from random sampling done on the official splits and 2) increase training set size according to [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split).

In [3]:
test_dataset = load_dataset("../scripts/th_common_voice_70.py", "ur", split="test")


Reusing dataset common_voice (/home/saad/.cache/huggingface/datasets/common_voice/ur/7.0.0/22f498ef791fb449d0ba8185660e0dd2cee815f02b179584bf3fa87098086dac)


In [4]:
from transformers import AutoProcessor

Load pretrained model and processor to process the test dataset.

In [5]:
processor = AutoProcessor.from_pretrained("../data/wav2vec2-large-xlsr-53-ur/checkpoint-11000")
model = Wav2Vec2ForCTC.from_pretrained("../data/wav2vec2-large-xlsr-53-ur/checkpoint-11000")


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [25]:
" ".join(sorted(processor.tokenizer.get_vocab()))

'</s> <s> [PAD] [UNK] | \u0600 \u0601 \u0602 \u0603 ، ؍ ؎ ؏ ؐ ؑ ؒ ؓ ؔ ؕ ؛ ؟ ء آ أ ؤ ئ ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ل م ن و ً ٌ ٍ َ ُ ِ ّ ْ ٓ ٔ ٖ ٗ ٘ ٪ ٫ ٬ ٰ ٹ پ چ ڈ ڑ ژ ک گ ں ھ ہ ۂ ۃ ی ے ۓ ۔ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹'

In [26]:
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

In [27]:
from pyctcdecode import build_ctcdecoder
import kenlm
ken= "../data/model_with_lm1/language_model/urdu.bin"
# ken= "../urdu_saad.arpa"

decoder = build_ctcdecoder(
     labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path=ken  # tuned on a val set
)

Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
Found entries of length > 1 in alphabet. This is unusual unless style is BPE, but the alphabet was not recognized as BPE type. Is this correct?
No known unigrams provided, decoding results might be a lot worse.


In [28]:
from transformers import Wav2Vec2ProcessorWithLM

processor= Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

In [29]:
# processor.save_pretrained("../data/model_with_lm1")

In [30]:
def speech_file_to_array_fn(batch, 
                            text_col="sentence", 
                            fname_col="path",
                            resampling_to=16000):
    speech_array, sampling_rate = torchaudio.load(batch[fname_col])
    resampler=torchaudio.transforms.Resample(sampling_rate, resampling_to)
    batch["speech"] = resampler(speech_array)[0].numpy()
    batch["sampling_rate"] = resampling_to
    batch["target_text"] = batch[text_col]
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:4], sampling_rate=16_000, return_tensors="pt", padding=True)

0ex [00:00, ?ex/s]

In [31]:
model.to("cpu")

Wav2Vec2ForCTC(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (1): Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (2): Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (3): Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elemen

In [32]:
len(test_dataset)


4

In [21]:
import torch
with torch.no_grad():
    logits = model(inputs.input_values,).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(logits.numpy()).text)
# print("Reference:", test_dataset["sentence"][:3])

Prediction: ['سبندر میں اس امید سے گودے کہ جلد مدد مل جائی کی مگر مدد ملنے تک دو پاکستانی لا پتا تھے', 'نہیں یہ ہماری بصریت ہے اور نیازی دیکھ رہا ہے', 'طیفا پیش کر کے حاقد کیا خان کہیں گے تو سیاست میں نظر نہیں آوں گا', 'ان میں جنگ ایک اہم موڑ پر ہے لیکن پہلی بار دوری میں ہونے والے عمل مذاکرات میں ایک باریک روح کی امید پیدا ہوئی ہے پاکستان کے وزیراعظم عمران خان نبی جب صدر یوکرین کے صدر سے فون پر بات کی عما کی بات کی لیکن کیا صدوقان روس اور یوکرین کو جنگ ختم کرنے کے لیے راضی کر سکیں گے']


In [33]:
import torch
with torch.no_grad():
    logits = model(inputs.input_values,).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(logits.numpy()).text)
# print("Reference:", test_dataset["sentence"][:3])

Prediction: ['سبندر میں اس امید سے گودے کہ جلد مدد مل جائی کی مگر مدد ملنے تک دو پاکستانی لا پتا تھے', 'نہیں یہ ہماری بصریت ہے اور نیازی دیکھ رہا ہے', 'طیفا پیش کر کے حاقد کیا خان کہیں گے تو سیاست میں نظر نہیں آوں گا', 'ن میں جنگ ایک اہم موڑ پر ہے لیکن پہلی بار دوری میں ہونے والے عمل مذاکرات میں ایک باریک روح کی امید پیدا ہوئی ہے پاکستان کے وزیراعظم عمران خان نبی جب صدر یوکرین کے صدر سے فون پر بات کی عما کی بات کی لیکن کیا صدرغان روس اور یوکرین کو جنگ ختم کرنے کے لیے راضی کر سکیں گے']


### Infer a few examples

In [None]:
import torch
with torch.no_grad():
    logits = model(inputs.input_values,).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(logits.numpy()).text)
# print("Reference:", test_dataset["sentence"][:3])

Prediction: ['گری ایکسپیکٹیشن بچا دکھنزتہری اور اور پی کا سید عرفان علی بجلی جیو ڈر یو ڈور سی عرفان علی ڈار کا میں اپنی امی ابو اور پان چھوٹے بھائیوں سے ملنے آیا جج کے روس تا میں جہاں وہ سب دن تھے بہاں کا ما اور خوف نا تھا آسمان کے با دنوں کی لا سے ان تھی اجر کے بی ھے موجود یا رام کا دو تھا پھیلا ہوا دلدلی لا ا مک مل خام موچی میرے امی ابو اور ان کے گرد پانچ چھوٹی چھوٹی میرے بھائیوں کی قبریں دل خوف زدہ ہوا اور میں رونے لگااسی وقت خبر کے در ان سے ایک شخص نے مجھے سختی سے پکڑ لیا چھٹی شیطان اپنی آواز بند کرو بنا اپنا گلا کٹوا لو گے مجھ سے اس شخص کی حالت بے حد خراب تھی ہٹے پرانے کپڑے جن پر کی چڑ لگتی تھی بی شد زد کر نے والی ان تھیں کا تھا ہوا جس چہرے پر سختی میں گلا ک نے کے خوف سے ڈر کر بولا مجھے چھوڑ دے مجھے نہ آ ہی برا نام کیا ہے جن میں ھو اس نے مجھے پکڑ کر الٹا کیا اور جب وہ میں ہار ڈالا جہاں سے صرف ڈبل روٹی کا ایک چھکڑا ملا جو اس نے بے صبری سے کھا لیا مجوزہ ڈر کے بی و رہا تیرے گال جو موٹے تازے ہیں انہیں کھا جاؤں میں نے اس قبر کے بد تھر کو جہاں بیٹھا تھا سختی سے پکڑ لیا تاکہ وہ آدمی میرے

### Evaluate on test set

We evaluate the test set on WER with PyThaiNLP 2.3.1 word boundaries and CER without spaces.

In [None]:
def evaluate(batch):
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(device),).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_sentence"] = processor.batch_decode(pred_ids)
    return batch

wer_metric = load_metric("wer")
cer_metric = load_metric("cer")

In [None]:
model.to("cuda")

result = test_dataset.map(evaluate, batched=True, batch_size=8)



  0%|          | 0/43 [00:00<?, ?ba/s]

Process ForkPoolWorker-105:
Process ForkPoolWorker-106:
Process ForkPoolWorker-110:
Process ForkPoolWorker-107:
Process ForkPoolWorker-109:
Process ForkPoolWorker-108:
Process ForkPoolWorker-112:
Process ForkPoolWorker-111:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/saad/anaconda3/envs/wav2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/saad/anaconda3/envs/wav2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/saad/anaconda3/envs/wav2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/saad/anaconda3/envs/wav2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/hom

KeyboardInterrupt: 

In [None]:
result_df = pd.DataFrame({'sentence':result['sentence'], 
                           'pred_sentence_tok': result['pred_sentence']})
result_df['sentence_tok'] = result_df.sentence.map(lambda x: ' '.join(word_tokenize(x)))
result_df['pred_sentence'] = result_df.pred_sentence_tok.map(lambda x: ' '.join(x.split()))
#change tokenization to fit pythainlp tokenization
result_df['pred_sentence_tok'] = result_df.pred_sentence.map(lambda x: ' '.join(word_tokenize(x)))





result_df.to_csv('../data/result_cv70.csv',index=False)

In [None]:
wer_metric.compute(predictions=result_df.pred_sentence_tok,references=result_df.sentence_tok)

0.4642857142857143

In [None]:
cer_metric.compute(predictions=result_df.pred_sentence,references=result_df.sentence)

0.18461538461538463

In [None]:
#wer


AttributeError: 'DataFrame' object has no attribute 'pred_sentence_tok'

In [None]:
#cer


0.028130193905817176

We can further improve by spell correction using n-grams from [TNC](http://www.arts.chula.ac.th/ling/tnc/).

In [None]:
# #install pre version of pythainlp to use; will be available in PyThaiNLP 3.0
# %pip uninstall pythainlp --yes
# %pip install --ignore-requires-python  https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
# %pip install symspellpy

In [None]:
#wer


0.1799639686012096

In [None]:
#cer


0.05225761772853186