<a href="https://colab.research.google.com/github/vasudevgupta7/gsoc-wav2vec2/blob/notebook/notebooks/librispeech_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wav2Vec2 inference on LibriSpeech dataset

In this notebook, we will be evaluating TensorFlow Wav2Vec2 using the checkpoint fine-tuned on 960h of LibriSpeech dataset. Let's start with basic setup and install `wav2vec2` package from this [repositary](https://github.com/vasudevgupta7/gsoc-wav2vec2).

In [1]:
!pip3 install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@training

[?25l[K     |▏                               | 10 kB 29.3 MB/s eta 0:00:01[K     |▍                               | 20 kB 33.8 MB/s eta 0:00:01[K     |▌                               | 30 kB 20.7 MB/s eta 0:00:01[K     |▊                               | 40 kB 16.6 MB/s eta 0:00:01[K     |█                               | 51 kB 9.0 MB/s eta 0:00:01[K     |█                               | 61 kB 9.0 MB/s eta 0:00:01[K     |█▎                              | 71 kB 9.2 MB/s eta 0:00:01[K     |█▍                              | 81 kB 10.2 MB/s eta 0:00:01[K     |█▋                              | 92 kB 10.5 MB/s eta 0:00:01[K     |█▉                              | 102 kB 8.6 MB/s eta 0:00:01[K     |██                              | 112 kB 8.6 MB/s eta 0:00:01[K     |██▏                             | 122 kB 8.6 MB/s eta 0:00:01[K     |██▎                             | 133 kB 8.6 MB/s eta 0:00:01[K     |██▌                             | 143 kB 8.6 MB/s eta 0:00:01[

Now that we have installed required packages, lets download validation dataset from official LibriSpeech [website](https://www.openslr.org/12). It may take couple of seconds depending on your internet connection.

In [2]:
!wget https://www.openslr.org/resources/12/dev-clean.tar.gz
!tar -xf dev-clean.tar.gz

DATA_DIR = "LibriSpeech/dev-clean"

--2021-08-06 01:40:03--  https://www.openslr.org/resources/12/dev-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337926286 (322M) [application/x-gzip]
Saving to: ‘dev-clean.tar.gz’


2021-08-06 01:40:21 (18.5 MB/s) - ‘dev-clean.tar.gz’ saved [337926286/337926286]



Let's import `Wav2Vec2Processor` and `Wav2Vec2ForCTC` from our installed `wav2vec2` package.

In [12]:
import tensorflow as tf
from wav2vec2 import Wav2Vec2ForCTC, Wav2Vec2Config

Now, we will instantiate all the classes from their default configurations. Convenient `.from_pretrained(...)` method will enable us to download pre-trained/fine-tuned weights automatically from HuggingFace Hub.

In [4]:
# from google.colab import auth
# auth.authenticate_user()

# model = Wav2Vec2ForCTC(Wav2Vec2Config())
# model(tf.random.uniform(shape=(1, 246000)))
# model.load_weights("gs://gsoc-weights/tf-wav2vec2-base/tf_model")

In [7]:
# TODO: delete this cell later & uncomment above cell
model = Wav2Vec2ForCTC.from_pretrained("vasudevgupta/gsoc-wav2vec2-base-960h")

Downloading model weights from https://huggingface.co/vasudevgupta/gsoc-wav2vec2-base-960h ... Done
Total number of loaded variables: 213


`processor` will help us to convert raw speech into required format which can be accepted into our `Wav2Vec2ForCTC` model. Eg: Normalizing the speech w.r.to frames axis.

`tokenizer` will convert our model outputs into string and will take care of removal of special tokens (depending on your tokenizer configuration).

<!-- For getting out of box performance with TensorFlow-2, we will be decorating our forward pass with `tf.function(...)`. Argument `jit_compile=True` will result in compilation of python code using **XLA** and will fuse operations to be able to generate very efficient code for accelerators. -->

In [8]:
def tf_forward(speech):
  tf_out = model(speech, training=False)
  return tf.squeeze(tf.argmax(tf_out, axis=-1))

It's time to write function for itertation over complete validation dataset. We will be collecting and storing predictions for each step in `list`.

In [9]:
import soundfile as sf
import os

REQUIRED_SAMPLE_RATE = 16000
SPLIT = "dev-clean"

def read_txt_file(f):
  with open(f, "r") as f:
    samples = f.read().split("\n")
    samples = {s.split()[0]: " ".join(s.split()[1:]) for s in samples if len(s.split()) > 2}
  return samples

def read_flac_file(file_path):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != REQUIRED_SAMPLE_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {REQUIRED_SAMPLE_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".flac")]
  return {file_id: audio}

In [10]:
def fetch_sound_text_mapping():
  flac_files = tf.io.gfile.glob(f"LibriSpeech/{SPLIT}/*/*/*.flac")
  txt_files = tf.io.gfile.glob(f"LibriSpeech/{SPLIT}/*/*/*.txt")

  txt_samples = {}
  for f in txt_files:
    txt_samples.update(read_txt_file(f))

  speech_samples = {}
  for f in flac_files:
    speech_samples.update(read_flac_file(f))

  file_ids = set(speech_samples.keys()) & set(txt_samples.keys())
  samples = [(speech_samples[file_id], txt_samples[file_id]) for file_id in file_ids]
  return samples

In [11]:
samples = fetch_sound_text_mapping()

In [13]:
from wav2vec2 import Wav2Vec2Processor
tokenizer = Wav2Vec2Processor(is_tokenizer=True)
processor = Wav2Vec2Processor(is_tokenizer=False)

AUDIO_MAXLEN, LABEL_MAXLEN = 246000, 256
DO_PADDING = False

def preprocess_text(text):
  label = tokenizer(text)
  label = tf.constant(label, dtype=tf.int32)[None]
  if DO_PADDING:
    label = label[:, :LABEL_MAXLEN]
    padding = tf.zeros((label.shape[0], LABEL_MAXLEN - label.shape[1]), dtype=label.dtype)
    label = tf.concat([label, padding], axis=-1)
  return label

def preprocess_speech(audio):
  audio = tf.constant(audio, dtype=tf.float32)
  audio = processor(audio)[None]
  if DO_PADDING:
    audio = audio[:, :AUDIO_MAXLEN]
    padding = tf.zeros((audio.shape[0], AUDIO_MAXLEN - audio.shape[1]), dtype=audio.dtype)
    audio = tf.concat([audio, padding], axis=-1)
  return audio

In [14]:
def inputs_generator():
  for speech, text in samples:
    yield preprocess_speech(speech), preprocess_text(text)

In [15]:
output_signature = (
    tf.TensorSpec(shape=(None),  dtype=tf.float32),
    tf.TensorSpec(shape=(None), dtype=tf.int32),
)
dataset = tf.data.Dataset.from_generator(inputs_generator, output_signature=output_signature)

In [18]:
from tqdm.auto import tqdm

def infer_librispeech(dataset: tf.data.Dataset, num_batches: int = None):
  predictions, labels = [], []
  for batch in tqdm(dataset, total=num_batches, desc="LibriSpeech Inference ... "):
    speech, label = batch
    tf_out = tf_forward(speech)
    predictions.append(tokenizer.decode(tf_out.numpy().tolist(), group_tokens=True))
    labels.append(tokenizer.decode(label.numpy().squeeze().tolist(), group_tokens=False))
    # print({"prediction": predictions[-1], "label": labels[-1]})
  return predictions, labels

Now, we will define the arguments for our `DataLoader` used in `infer_librispeech(...)` and will perform the inference on complete validation dataset.

In [19]:
predictions, labels = infer_librispeech(dataset, num_batches=2698)
list(zip(predictions, labels))

HBox(children=(FloatProgress(value=0.0, description='LibriSpeech Inference ... ', max=2698.0, style=ProgressSt…




[('ONCE IN THE NIGHT I SLIPPED AWAY FROM THE BIVOAC AND HURRIED TO THE OLD TISHUMINGO HOTEL TO SEE A LIEUTENANT OF MY COMPANY WHO HAD BEEN SHOT THROUGH THE BREAST',
  'ONCE IN THE NIGHT I SLIPPED AWAY FROM THE BIVOUAC AND HURRIED TO THE OLD TISHIMINGO HOTEL TO SEE A LIEUTENANT OF MY COMPANY WHO HAD BEEN SHOT THROUGH THE BREAST'),
 ('BUT BY THE LARGESSE OF CELESTIAL GRACES WHICH HAVE SUCH LOFTY VAPOURS FOR THEIR REIGN THAT NEAR TO THEM OUR SIGHT APPROACHES NOT',
  'BUT BY THE LARGESS OF CELESTIAL GRACES WHICH HAVE SUCH LOFTY VAPOURS FOR THEIR RAIN THAT NEAR TO THEM OUR SIGHT APPROACHES NOT'),
 ('THE LADY AND THE GUITAR CERTAINLY PASSED THE NIGHT AT HILL VIEW VILLA BUT WHEN HIS MOTHER VERY ANGRY AND VERY FRIGHTENED CAME UP WITH HIM AT ABOUT NOON THE HOUSE LOOKED JUST AS USUAL AND NO ONE WAS THERE BUT THE CHAR WOMAN',
  'THE LADY AND THE GUITAR CERTAINLY PASSED THE NIGHT AT HILL VIEW VILLA BUT WHEN HIS MOTHER VERY ANGRY AND VERY FRIGHTENED CAME UP WITH HIM AT ABOUT NOON THE HOUSE LOOKED J

It's time to calculate **Word Error Rate (WER)** to be able to judge if our model performed well. We will be using `load_metric(...)` function from HuggingFace datasets to setup metric for us. First, let's install `datasets` library using `pip`.

In [20]:
!pip3 install -q datasets

from datasets import load_metric
wer = load_metric("wer")

[K     |████████████████████████████████| 264 kB 7.3 MB/s 
[K     |████████████████████████████████| 76 kB 4.7 MB/s 
[K     |████████████████████████████████| 118 kB 13.9 MB/s 
[K     |████████████████████████████████| 243 kB 13.8 MB/s 
[?25h

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1947.0, style=ProgressStyle(description…




Let's install WER script using `load_metric(wer)` and compute metric value over our predictions.

In [21]:
wer.compute(references=labels, predictions=predictions)

0.03167454087541592