<a href="https://colab.research.google.com/github/vasudevgupta7/gsoc-wav2vec2/blob/vg/notebooks/librispeech-evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wav2Vec2 inference on LibriSpeech dataset

In this notebook, we will be evaluating TensorFlow Wav2Vec2 using the checkpoint fine-tuned on 960h of LibriSpeech dataset.

In [1]:
!nvidia-smi

Sun Jun 27 03:43:40 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's start with basic setup and install `wav2vec2` package from this [repositary](https://github.com/vasudevgupta7/gsoc-wav2vec2).

In [2]:
%%capture
!git clone https://github.com/vasudevgupta7/gsoc-wav2vec2 --branch=vg && cd gsoc-wav2vec2 && pip3 install .

import os
os.chdir("./gsoc-wav2vec2/src")

In [3]:
# rm -rf /content/gsoc-wav2vec2

Now that we have installed required packages, lets download validation dataset from official LibriSpeech [website](https://www.openslr.org/12). It may take couple of seconds depending on your internet connection.

In [4]:
!wget https://huggingface.co/datasets/vasudevgupta/gsoc-librispeech-tfrecords/resolve/main/dev-clean.tfrecord -P /content/gsoc-wav2vec2/data/dev/

--2021-06-27 03:36:57--  https://huggingface.co/datasets/vasudevgupta/gsoc-librispeech-tfrecords/resolve/main/test-clean.zip
Resolving huggingface.co (huggingface.co)... 15.197.130.34
Connecting to huggingface.co (huggingface.co)|15.197.130.34|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/datasets/vasudevgupta/gsoc-librispeech-tfrecords/26d83690802f15478b4aa037534dfbd4375d64a661f3a76bffe0c7f01b1df86c [following]
--2021-06-27 03:36:57--  https://cdn-lfs.huggingface.co/datasets/vasudevgupta/gsoc-librispeech-tfrecords/26d83690802f15478b4aa037534dfbd4375d64a661f3a76bffe0c7f01b1df86c
Resolving cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)... 13.225.221.107, 13.225.221.29, 13.225.221.68, ...
Connecting to cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)|13.225.221.107|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 757888972 (723M) [application/octet-stream]
Saving to: ‘/content/gsoc-wav2vec2/data/test

In [5]:
ls /content/gsoc-wav2vec2/data/

SA2.TXT  sample.TXT  test-clean.tfrecord  vocab.json
SA2.wav  sample.wav  test-clean.zip


Let's import `Wav2Vec2Processor` and `Wav2Vec2ForCTC` from our installed `wav2vec2` package.

In [3]:
import tensorflow as tf
from wav2vec2 import Wav2Vec2Processor, Wav2Vec2ForCTC

Now, we will instantiate all the classes from their default configurations. Convenient `.from_pretrained(...)` method will enable us to download pre-trained/fine-tuned weights automatically from HuggingFace Hub.

In [4]:
model_id = "vasudevgupta/tf-wav2vec2-base-960h"

processor = Wav2Vec2Processor(is_tokenizer=False)
tokenizer = Wav2Vec2Processor(is_tokenizer=True)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

Loading weights locally from `vasudevgupta/tf-wav2vec2-base-960h`
Total number of loaded variables: 213


`processor` will help us to convert raw speech into required format which can be accepted into our `Wav2Vec2ForCTC` model. Eg: Normalizing the speech w.r.to frames axis.

`tokenizer` will convert our model outputs into string and will take care of removal of special tokens (depending on your tokenizer configuration).

For getting out of box performance with TensorFlow-2, we will be decorating our forward pass with `tf.function(...)`. Argument `jit_compile=True` will result in compilation of python code using **XLA** and will fuse operations to be able to generate very efficient code for accelerators.

In [5]:
@tf.function(jit_compile=True)
def tf_forward(speech, training=False):
  tf_out = model(speech, training=training)
  return tf.squeeze(tf.argmax(tf_out, axis=-1))

It's time to write function for itertation over complete validation dataset. We will be collecting and storing predictions for each step in `list`.

In [17]:
from data_utils import LibriSpeechDataLoader, LibriSpeechDataLoaderArgs
from tqdm.auto import tqdm

def infer_librispeech(dataset: tf.data.Dataset, num_samples: int = None):
  predictions = []
  labels = []
  for batch in tqdm(dataset, total=num_samples, desc="LibriSpeech Inference ... "):
    speech, label = batch
    tf_out = tf_forward(speech, training=False)
    predictions.extend([tokenizer.decode(pred, group_tokens=True) for pred in tf_out.numpy().tolist()])
    labels.extend([tokenizer.decode(tgt, group_tokens=False) for tgt in label.numpy().tolist()])
  return predictions, labels

Now, we will define the arguments for our `DataLoader` used in `infer_librispeech(...)` and will perform the inference on complete validation dataset.

In [20]:
args = LibriSpeechDataLoaderArgs(data_dir="../data/dev/", batch_size=32, audio_maxlen=500000, labels_maxlen=256)

dataloader = LibriSpeechDataLoader(args)
num_samples = 2000
dataset = dataloader(from_tfrecords=True, seed=None)
dataset = dataset.take(2) # this will take 2 batches

Following cell will take ~ 7 mins

In [21]:
predictions, labels = infer_librispeech(dataset, num_samples=num_samples)
list(zip(predictions, labels))

HBox(children=(FloatProgress(value=0.0, description='LibriSpeech Inference ... ', max=2000.0, style=ProgressSt…




[('EVERY PLANT IN THE GRASS IS SET FORMERLY GROWS PERFECTLY AND MAY BE REALIZED COMPLETELY',
  'EVERY PLANT IN THE GRASS IS SET FORMALLY GROWS PERFECTLY AND MAY BE REALIZED COMPLETELY'),
 ('THEY UNITE EVERY QUALITY AND SOMETIMES YOU WILL FIND ME REFERING TO THEM AS COLORISTS SOMETIMES AS CHIARISCURISTS',
  'THEY UNITE EVERY QUALITY AND SOMETIMES YOU WILL FIND ME REFERRING TO THEM AS COLORISTS SOMETIMES AS CHIAROSCURISTS'),
 ('THE BROWN GROWN BENEATH IS LEFT FOR THE MOST PART ONE TOUCH OF BLACK IS PUT FOR THE HOLLOW TWO DELICATE LINES OF DARK GRAY TO FIND THE OUTER CURVE AND ONE LITTLE QUIVERING TOUCH OF WHITE DRAWS THE INNER EDGE OF THE MANDIBLE',
  'THE BROWN GROUND BENEATH IS LEFT FOR THE MOST PART ONE TOUCH OF BLACK IS PUT FOR THE HOLLOW TWO DELICATE LINES OF DARK GRAY DEFINE THE OUTER CURVE AND ONE LITTLE QUIVERING TOUCH OF WHITE DRAWS THE INNER EDGE OF THE MANDIBLE'),
 ('YOU KNOW I HAVE JUST BEEN TELLING YOU HOW THIS SCHOOL OF MATERIALISM IN CLAY INVOLVED ITSELF AT LAST IN CLOUD A

It's time to calculate **Word Error Rate (WER)** to be able to judge if our model performed well. We will be using `load_metric(...)` function from HuggingFace datasets to setup metric for us. First, let's install `datasets` library using `pip`.

In [None]:
%%capture
!pip3 install datasets

Let's install WER script using `load_metric(wer)` and compute metric value over our predictions.

In [23]:
from datasets import load_metric

wer = load_metric("wer")
wer.compute(references=labels, predictions=predictions)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1947.0, style=ProgressStyle(description…




0.10880829015544041