<a href="https://colab.research.google.com/github/vasudevgupta7/gsoc-wav2vec2/blob/export-v2/notebooks/wav2vec2_saved_model_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to train TensorFlow saved-model with extra head

In this notebook, we will load the pre-trained wav2vec2 model from [TFHub](https://tfhub.dev) and will train it on [LibriSpeech dataset](https://huggingface.co/datasets/librispeech_asr) by appending LM head over the top of our pre-trained model.

## Setting Up

Before diving into it, let's see what GPU we got using `nvidia-smi`

In [1]:
!nvidia-smi

Wed Jul 21 03:22:56 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

The following cell will clone my code repository ([`gsoc-wav2vec2`](https://github.com/vasudevgupta7/gsoc-wav2vec2)) and will install all the dependencies.

In [2]:
!git clone https://github.com/vasudevgupta7/gsoc-wav2vec2 --branch=export-v2

import sys
import os

os.chdir("gsoc-wav2vec2")
sys.path.append("src")

!pip3 install -qe .

Cloning into 'gsoc-wav2vec2'...
remote: Enumerating objects: 400, done.[K
remote: Counting objects: 100% (400/400), done.[K
remote: Compressing objects: 100% (255/255), done.[K
remote: Total 400 (delta 213), reused 309 (delta 135), pack-reused 0[K
Receiving objects: 100% (400/400), 2.88 MiB | 26.34 MiB/s, done.
Resolving deltas: 100% (213/213), done.
[K     |████████████████████████████████| 1.8 MB 32.9 MB/s 
[K     |████████████████████████████████| 43 kB 1.4 MB/s 
[K     |████████████████████████████████| 50 kB 5.9 MB/s 
[K     |████████████████████████████████| 97 kB 5.4 MB/s 
[K     |████████████████████████████████| 1.8 MB 46.5 MB/s 
[K     |████████████████████████████████| 170 kB 58.2 MB/s 
[K     |████████████████████████████████| 133 kB 57.2 MB/s 
[K     |████████████████████████████████| 63 kB 1.6 MB/s 
[?25h  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel fo

In [3]:
# This cell will be removed after model get exported to TFHub
!wget https://huggingface.co/vasudevgupta/tf-wav2vec2-base/resolve/main/wav2vec2-base.tar.gz
!tar -xf wav2vec2-base.tar.gz

--2021-07-21 03:23:11--  https://huggingface.co/vasudevgupta/tf-wav2vec2-base/resolve/main/wav2vec2-base.tar.gz
Resolving huggingface.co (huggingface.co)... 15.197.130.34
Connecting to huggingface.co (huggingface.co)|15.197.130.34|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/vasudevgupta/tf-wav2vec2-base/6dc0b96ec02586bcf863f5368b4c2a9ef05d924625c7e16708eec037577bb072 [following]
--2021-07-21 03:23:11--  https://cdn-lfs.huggingface.co/vasudevgupta/tf-wav2vec2-base/6dc0b96ec02586bcf863f5368b4c2a9ef05d924625c7e16708eec037577bb072
Resolving cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)... 54.239.152.94, 54.239.152.92, 54.239.152.18, ...
Connecting to cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)|54.239.152.94|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 224929720 (215M) [application/x-gzip]
Saving to: ‘wav2vec2-base.tar.gz’


2021-07-21 03:23:18 (32.0 MB/s) - ‘wav2vec2-base.tar.gz’ saved [224

## Model setup using `TFHub`

We will start by importing all the important libraries & modules.

In [4]:
import tensorflow as tf
import tensorflow_hub as hub

from wav2vec2 import Wav2Vec2Config

config = Wav2Vec2Config()

We will be loading the pre-trained saved-model directly from TFHub. [`hub.load(...)`](https://www.tensorflow.org/hub/api_docs/python/hub/load) will download the pre-trained model first and will call [`tf.saved_model.load(...)`](https://www.tensorflow.org/api_docs/python/tf/saved_model/load) over those downloaded weights.

In [5]:
# TODO: update it to load from TFHub later
loaded = hub.load("saved-model")
print("Available signatures are:", list(loaded.signatures.keys()))

Available signatures are: ['wav2vec2']


We can see available model signatures above (these signatures were passed while saving model with [`tf.saved_model.save(...)`](https://www.tensorflow.org/api_docs/python/tf/saved_model/save) (you can refer this [script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/export2hub.py)). We will be using the `wav2vec2` signature in this notebook. 

First, we will wrap our model signature with [`hub.KerasLayer`](https://www.tensorflow.org/hub/api_docs/python/hub/KerasLayer) to be able to use this model like any other keras layer.

In [6]:
pretrained_layer = hub.KerasLayer(loaded.signatures["wav2vec2"], trainable=False)

Object `pretrained_layer` is the freezed version of [`Wav2Vec2Model`](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/wav2vec2/modeling.py). Pre-trained weights are converted from HuggingFace PyTorch [pre-trained weights](https://huggingface.co/facebook/wav2vec2-base) using [this script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/convert_torch_to_tf.py).

Originally, wav2vec2 was pre-trained with a masked language modelling approach with the objective to identify the true quantized latent speech representation for a masked time step. You can read more about the training objective in the paper- [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477).

Now, we will be defining a few constants and hyper-parameters which will be useful in the next few cells. `AUDIO_MAXLEN` is intentionally set to `246000` as the model signature only accepts static sequence length of `246000`.

In [7]:
AUDIO_MAXLEN = 246000
LABEL_MAXLEN = 256
BATCH_SIZE = 2

In the following cell, we will wrap `pretrained_layer` & a dense layer (LM head) with the [TensorFlow's Functional API](https://www.tensorflow.org/guide/keras/functional).

In [8]:
inputs = tf.keras.Input(shape=(AUDIO_MAXLEN,))
hidden_states = pretrained_layer(inputs)["output_0"]
outputs = tf.keras.layers.Dense(config.vocab_size)(hidden_states)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

The dense layer (defined above) is having an output dimension of `vocab_size` as we want to predict probabilities of each token in the vocabulary at each time step.

## Setting up training state

Alright, let's define our training forward pass by calling the model with `training=True` and wrapping it with `tf.function(...)`. It's important to wrap it with `tf.function(...)` to be able to get performance benefits during training.

Additionally, we will be passing `jit_compile=True` to compile (using XLA) our model graph on the accelerators (i.e GPUs/TPUs) & fuse many operations to get out-of-box performance.

In [9]:
@tf.function(jit_compile=True)
def forward(batch):
    return model(batch, training=True)

In TensorFlow, model weights are build only when `model.__call__` is called for the first time, so the following cell will build the model weights for us. Further, we will be running `model.summary()` for checking the total number of trainable parameters.

In [10]:
forward(tf.random.uniform(shape=(BATCH_SIZE, AUDIO_MAXLEN)))
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 246000)]          0         
_________________________________________________________________
keras_layer (KerasLayer)     {'output_0': (None, 768,  94370944  
_________________________________________________________________
dense (Dense)                (None, 768, 32)           24608     
Total params: 94,395,552
Trainable params: 24,608
Non-trainable params: 94,370,944
_________________________________________________________________


Now, we need to define `loss_fn` and optimizer to be able to train the model. The following cell will do that for us. We will be using the `Adam` optimizer for simplicity. `CTCLoss` is a very common loss type that is used for tasks (like `ASR`) where input sub-parts can't be easily aligned with output sub-parts. You can read more about CTC-loss from this amazing [blog post](https://distill.pub/2017/ctc/).


`CTCLoss` (from [`gsoc-wav2vec2`](https://github.com/vasudevgupta7/gsoc-wav2vec2) package) accepts 3 arguments: `config`, `model_input_shape` & `division_factor`. If `division_factor=1`, then loss will simply get summed, so pass `division_factor` accordingly to get mean over batch.

In [11]:
from wav2vec2 import CTCLoss

LEARNING_RATE = 1e-5

loss_fn = CTCLoss(config, (BATCH_SIZE, AUDIO_MAXLEN), division_factor=BATCH_SIZE)
optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)

## Loading & Pre-processing data

Let's now download the LibriSpeech dataset from the [official website](http://www.openslr.org/12) and set it up.

In [12]:
!wget https://www.openslr.org/resources/12/dev-clean.tar.gz -P ./data/train/
!tar -xf ./data/train/dev-clean.tar.gz -C ./data/train/

--2021-07-21 03:24:25--  https://www.openslr.org/resources/12/dev-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337926286 (322M) [application/x-gzip]
Saving to: ‘./data/train/dev-clean.tar.gz’


2021-07-21 03:24:36 (29.9 MB/s) - ‘./data/train/dev-clean.tar.gz’ saved [337926286/337926286]



**Note:** We are using `dev-clean` configuration as this notebook is just for demonstration purposes, so we just need small data.

In [13]:
ls ./data/train/

dev-clean.tar.gz  [0m[01;34mLibriSpeech[0m/


Our dataset lies in `LibriSpeech` directory. Let's further narrow down & choose a sub-directory to see few files.

In [14]:
data_dir = "./data/train/LibriSpeech/dev-clean/2428/83705/"
all_files = os.listdir(data_dir)

flac_files = [f for f in all_files if f.endswith(".flac")]
txt_files = [f for f in all_files if f.endswith(".txt")]

print("Transcription files:", txt_files, "\nSound files:", flac_files)

Transcription files: ['2428-83705.trans.txt'] 
Sound files: ['2428-83705-0022.flac', '2428-83705-0013.flac', '2428-83705-0005.flac', '2428-83705-0020.flac', '2428-83705-0024.flac', '2428-83705-0037.flac', '2428-83705-0012.flac', '2428-83705-0009.flac', '2428-83705-0036.flac', '2428-83705-0028.flac', '2428-83705-0006.flac', '2428-83705-0035.flac', '2428-83705-0034.flac', '2428-83705-0043.flac', '2428-83705-0017.flac', '2428-83705-0021.flac', '2428-83705-0033.flac', '2428-83705-0040.flac', '2428-83705-0010.flac', '2428-83705-0008.flac', '2428-83705-0016.flac', '2428-83705-0004.flac', '2428-83705-0029.flac', '2428-83705-0007.flac', '2428-83705-0026.flac', '2428-83705-0001.flac', '2428-83705-0025.flac', '2428-83705-0018.flac', '2428-83705-0030.flac', '2428-83705-0002.flac', '2428-83705-0015.flac', '2428-83705-0039.flac', '2428-83705-0000.flac', '2428-83705-0019.flac', '2428-83705-0038.flac', '2428-83705-0027.flac', '2428-83705-0032.flac', '2428-83705-0031.flac', '2428-83705-0011.flac', '24

Alright, so each sub-directory is having many `.flac` files and single `.txt` file. `.txt` file will have text transcriptions for all the speech samples (i.e. `.flac` files) present in that sub-directory.

In following cell, we will define function for loading & formatting the text data into memory.

In [15]:
def read_txt_file(f):
  with open(f, "r") as f:
    samples = f.read().split("\n")
    samples = {s.split()[0]: " ".join(s.split()[1:]) for s in samples if len(s.split()) > 2}
  return samples

Similary, we will define a function for loading speech sample from `.flac` file.

`REQUIRED_SAMPLE_RATE` is set to `16000` as wav2vec2 was pre-trained with `16K` frequency and it's recommended to train it further without any major change in data distribution due to frequency.

In [16]:
import soundfile as sf

REQUIRED_SAMPLE_RATE = 16000

def read_flac_file(file_path):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != REQUIRED_SAMPLE_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {REQUIRED_SAMPLE_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".flac")]
  return {file_id: audio}

Now, we will combine all the speech & text samples and will define the function (in next cell) for that purpose.

In [17]:
def fetch_sound_text_mapping(data_dir):
  all_files = os.listdir(data_dir)

  flac_files = [os.path.join(data_dir, f) for f in all_files if f.endswith(".flac")]
  txt_files = [os.path.join(data_dir, f) for f in all_files if f.endswith(".txt")]

  txt_samples = {}
  for f in txt_files:
    txt_samples.update(read_txt_file(f))

  speech_samples = {}
  for f in flac_files:
    speech_samples.update(read_flac_file(f))

  assert len(txt_samples) == len(speech_samples)

  samples = [(txt_samples[file_id], speech_samples[file_id]) for file_id in speech_samples.keys()]
  return samples

It's time to have a look at a few samples ...

In [18]:
samples = fetch_sound_text_mapping(data_dir)
samples[:5]

[("A BIRD IN THE HAND IS WORTH TWO IN A BUSH' AND IT WILL BE SOMETHING TO HAVE BY US",
  array([0.00085449, 0.00073242, 0.0005188 , ..., 0.00048828, 0.00054932,
         0.0005188 ])),
 ('IT IS FROM HER ACTION IN THAT MATTER THAT MY SUSPICION SPRINGS',
  array([-0.0007019 , -0.00057983, -0.00033569, ..., -0.00021362,
         -0.00015259, -0.00012207])),
 ('FOR INSTANCE LOOK AT THEIR BEHAVIOUR IN THE MATTER OF THE RING',
  array([-0.00201416, -0.0022583 , -0.00234985, ...,  0.00137329,
          0.0012207 ,  0.00109863])),
 ('IT WAS PLAIN THAT TOGETHER WE SHOULD MANAGE MOST COMFORTABLY DELIGHTFULLY IN FACT',
  array([-9.15527344e-05,  9.15527344e-05, -1.83105469e-04, ...,
         -5.79833984e-04, -4.88281250e-04, -3.96728516e-04])),
 ('AND I WILL SEE THAT THERE IS NO SHIRKING ABOUT THE BOYS OR ABOUT THE GIRLS EITHER',
  array([-1.49536133e-03, -9.76562500e-04, -3.05175781e-05, ...,
         -2.13623047e-04, -2.74658203e-04, -1.83105469e-04]))]

Let's pre-process the data now !!!

We will first define the tokenizer & processor using gsoc-wav2vec2 package.Then, we will do very simple pre-processing. Speech will be normalized over time axis and text will be tokenized using `processor` and `tokenizer` respectively.

In [19]:
from wav2vec2 import Wav2Vec2Processor
tokenizer = Wav2Vec2Processor(is_tokenizer=True)
processor = Wav2Vec2Processor(is_tokenizer=False)

def preprocess_text(text):
  label = tokenizer(text)
  return tf.constant(label, dtype=tf.int32)

def preprocess_speech(audio):
  audio = tf.constant(audio, dtype=tf.float32)
  return processor(tf.transpose(audio))

Downloading `vocab.json` from https://github.com/vasudevgupta7/gsoc-wav2vec2/raw/main/data/vocab.json ... DONE


Now, we will define the python generator to call the preprocessing functions we defined in above cells.

In [20]:
def inputs_generator(samples):
  for text, speech in samples:
    yield preprocess_text(text), preprocess_speech(speech)

In [21]:
from functools import partial
generator = partial(inputs_generator, samples=samples)
next(iter(generator()))

(<tf.Tensor: shape=(81,), dtype=int32, numpy=
 array([ 7,  4, 24, 10, 13, 14,  4, 10,  9,  4,  6, 11,  5,  4, 11,  7,  9,
        14,  4, 10, 12,  4, 18,  8, 13,  6, 11,  4,  6, 18,  8,  4, 10,  9,
         4,  7,  4, 24, 16, 12, 11, 27,  4,  7,  9, 14,  4, 10,  6,  4, 18,
        10, 15, 15,  4, 24,  5,  4, 12,  8, 17,  5,  6, 11, 10,  9, 21,  4,
         6,  8,  4, 11,  7, 25,  5,  4, 24, 22,  4, 16, 12], dtype=int32)>,
 <tf.Tensor: shape=(88400,), dtype=float32, numpy=
 array([0.01726047, 0.01499128, 0.0110202 , ..., 0.01045291, 0.0115875 ,
        0.0110202 ], dtype=float32)>)

## Setting up `tf.data.Dataset`

Following cell will setup `tf.data.Dataset` object using its `.from_generator(...)` method. We will be using the `generator` object, we defined in the above cell.

**Note:** For distributed training (especially on TPUs), `.from_generator(...)` doesn't work currently & it is recommended to train on data stored in `.tfrecord` format. You can refer to [this script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/make_tfrecords.py) for more details on how to convert LibriSpeech data into tfrecords.

In [22]:
output_signature = (
    tf.TensorSpec(shape=(None), dtype=tf.int32),
    tf.TensorSpec(shape=(None),  dtype=tf.float32),
)
dataset = tf.data.Dataset.from_generator(generator, output_signature=output_signature)

Let's shuffle the dataset using `.shuffle(...)` method. Argument buffer size leads to approximate shuffling as many times the complete dataset can't be fitted into memory for actual shuffling (Eg. complete LibriSpeech tfrecords takes around 250 GB on disk).

In [23]:
BUFFER_SIZE = len(flac_files)
SEED = 42

dataset = dataset.shuffle(BUFFER_SIZE, seed=SEED)

We will pass the dataset into multiple batches, so let's prepare batches in the following cell. Now, all the sequences in a batch should be padded to a constant length. We will use the`.padded_batch(...)` method for that purpose. We also need to restrict sequence length to some particular value as some of the sequences are very long.

In [24]:
dataset = dataset.map(lambda labels, speech: (labels[: LABEL_MAXLEN], speech[: AUDIO_MAXLEN]))
dataset = dataset.padded_batch(BATCH_SIZE, padded_shapes=(LABEL_MAXLEN, AUDIO_MAXLEN), padding_values=(0, 0.))

Accelerators (like GPUs/TPUs) are very fast and often data-loading (& pre-processing) becomes the bottleneck during training as the data-loading part happens on CPUs. This can increase the training time significantly especially when there is a lot of online pre-processing involved or data is streamed online from GCS buckets. To handle those issues, `tf.data.Dataset` offers the `.prefetch(...)` method. This method helps in preparing the next few batches in parallel (on CPUs) while the model is making predictions (on GPUs/TPUs) on the current batch.

In [25]:
dataset = dataset.prefetch(tf.data.AUTOTUNE)

Since this notebook is made for demonstration purposes, we will be taking first `num_batches` and will perform training over only that. You are encouraged to train on the whole dataset though.

In [26]:
num_batches = 16
dataset = dataset.take(num_batches)

## Training

Let's define our `train_step` function now. There are 3 main steps in `train_step`: 
1. forward pass with variables tracking
2. backward pass for calculating gradients
3. variables update to minimize training loss

All the trainable variables in the scope of `tf.GradientTape(...)` will get tracked during the forward pass. Further, `.gradient(...)` will help us find gradient of loss w.r.to those tracked variables & `.apply_gradients(...)` will update the trainable variables based on our `optimizer` defined above.

In [27]:
@tf.function
def train_step(speech, labels):
    with tf.GradientTape() as gtape:
        speech = forward(speech)
        loss = loss_fn(labels, speech)
    grads = gtape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

Let's kick start training finally !!!

We will iterate over our dataset (instance of `tf.data.Dataset`) and each batch will be fed to `train_step(...)` for calculating loss, gradients & updating parameters.

In [28]:
from tqdm.auto import tqdm
EPOCHS = 10

pbar = tqdm(range(EPOCHS), total=EPOCHS)
for e in pbar:
  running_loss, steps = tf.constant(0.), 0
  for labels, speech in dataset:
      loss = train_step(speech, labels)
      running_loss += loss
      steps += 1
  pbar.set_postfix(tr_loss=running_loss.numpy().item()/steps, epoch=e)

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

Instructions for updating:
Prefer tf.tensor_scatter_nd_add, which offers the same functionality with well-defined read-write semantics.
Instructions for updating:
Prefer tf.tensor_scatter_nd_update, which offers the same functionality with well-defined read-write semantics.



## Evaluation

Let's compute loss over validation dataset using `eval_step(...)` defined in the following cell.

In [29]:
@tf.function(jit_compile=True)
def eval_fwd(batch):
  return model(batch, training=False)

@tf.function
def eval_step(speech, labels):
    speech = eval_fwd(speech)
    loss = loss_fn(labels, speech)
    return loss, tf.argmax(speech, axis=-1)

We need to compute `WER` (word error rate) over our validation data. We will use `load_metric(...)` function from [HuggingFace datasets](https://huggingface.co/docs/datasets/) library. Let's first install the `datasets` library using `pip` and then define the `metric` object.

In [30]:
!pip3 install -q datasets

from datasets import load_metric
metric = load_metric("wer")

[K     |████████████████████████████████| 262 kB 31.8 MB/s 
[K     |████████████████████████████████| 243 kB 57.0 MB/s 
[K     |████████████████████████████████| 118 kB 59.0 MB/s 
[?25h

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1947.0, style=ProgressStyle(description…




It's time to run the evaluation on validation data now.

In [31]:
pbar = tqdm(dataset, total=num_batches)
for labels, speech in pbar:
    loss, predictions = eval_step(speech, labels)
    pbar.set_postfix(val_loss=loss.numpy().item())
    predictions = [tokenizer.decode(pred) for pred in predictions.numpy().tolist()]
    references = [tokenizer.decode(label, group_tokens=False) for label in labels.numpy().tolist()]
    metric.add_batch(references=references, predictions=predictions)

HBox(children=(FloatProgress(value=0.0, max=16.0), HTML(value='')))




We are using the `tokenizer.decode(...)` method for decoding our predictions and labels back into the text and will add them to the metric for `WER` computation later.

**Note:** We are using the same dataset just for demonstration purposes. In general, we should use separate data (generally called `validation/dev` data) sampled before initiating training.

`metirc.compute()` will calculate the final WER score over all the batches added in previous cell.

In [32]:
metric.compute()

0.998046875

Here metric value doesn't make any sense as the model is trained on very small data and ASR-like tasks often require a very large amount of data to learn a mapping from speech to text. You should probably train on large data to get some good results. This notebook is just for showing the workflow of training a saved model.

Finally, we have reached an end to this notebook. But it's not an end of learning TensorFlow for speech-related tasks, this [repository](https://github.com/vasudevgupta7/gsoc-wav2vec2) contains some more amazing tutorials. Feel free to go through them. You can also refer to [this repositary](https://github.com/tulasiram58827/TTS_TFLite) for some more amazing tutorials on speech-related tasks. In case you encountered any bug in this notebook, please create an issue [here](https://github.com/vasudevgupta7/gsoc-wav2vec2/issues).