<a href="https://colab.research.google.com/github/vasudevgupta7/gsoc-wav2vec2/blob/notebook/notebooks/librispeech_saved_model_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wav2Vec2 evaluation on LibriSpeech dataset

In this notebook, we will be evaluating the `Wav2Vec2` SavedModel using the checkpoint fine-tuned on 960h of LibriSpeech dataset.

Let's start by installing `gsoc-wav2vec2` package from this [repositary](https://github.com/vasudevgupta7/gsoc-wav2vec2).

In [1]:
# TODO: change branch name to main after merge of training-v2

!pip3 install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@training-v2

[K     |████████████████████████████████| 1.8 MB 9.4 MB/s 
[K     |████████████████████████████████| 43 kB 2.7 MB/s 
[K     |████████████████████████████████| 50 kB 9.0 MB/s 
[K     |████████████████████████████████| 133 kB 54.5 MB/s 
[K     |████████████████████████████████| 1.7 MB 51.0 MB/s 
[K     |████████████████████████████████| 170 kB 64.5 MB/s 
[K     |████████████████████████████████| 188 kB 77.5 MB/s 
[K     |████████████████████████████████| 63 kB 2.2 MB/s 
[?25h  Building wheel for wav2vec2 (setup.py) ... [?25l[?25hdone
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


Now that we have installed required packages, lets download the test dataset from official LibriSpeech [website](https://www.openslr.org/12). It may take couple of seconds depending on your internet connection.

In [2]:
!wget https://www.openslr.org/resources/12/test-clean.tar.gz
!tar -xf test-clean.tar.gz

--2021-08-10 17:33:02--  https://www.openslr.org/resources/12/test-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 346663984 (331M) [application/x-gzip]
Saving to: ‘test-clean.tar.gz’


2021-08-10 17:33:21 (18.4 MB/s) - ‘test-clean.tar.gz’ saved [346663984/346663984]



In [3]:
ls LibriSpeech/

BOOKS.TXT  CHAPTERS.TXT  LICENSE.TXT  README.TXT  SPEAKERS.TXT  [0m[01;34mtest-clean[0m/


Now, we will load the fine-tuned Wav2Vec2 from TFHub using `hub.KerasLayer` and use this model for inference.

In [4]:
# TODO: change this to TFHub later

!wget https://huggingface.co/vasudevgupta/gsoc-wav2vec2-960h/resolve/main/saved-model.tar.gz
!tar -xf saved-model.tar.gz

import tensorflow as tf
import tensorflow_hub as hub

model = hub.KerasLayer("saved-model")

--2021-08-10 17:33:24--  https://huggingface.co/vasudevgupta/gsoc-wav2vec2-960h/resolve/main/saved-model.tar.gz
Resolving huggingface.co (huggingface.co)... 15.197.130.34
Connecting to huggingface.co (huggingface.co)|15.197.130.34|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/vasudevgupta/gsoc-wav2vec2-960h/2a93d38e08cf94ca6c9e5501ac61ea72aa29e244ef66a767024b70080478de4f [following]
--2021-08-10 17:33:25--  https://cdn-lfs.huggingface.co/vasudevgupta/gsoc-wav2vec2-960h/2a93d38e08cf94ca6c9e5501ac61ea72aa29e244ef66a767024b70080478de4f
Resolving cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)... 13.227.73.107, 13.227.73.21, 13.227.73.71, ...
Connecting to cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)|13.227.73.107|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 390685696 (373M) [application/octet-stream]
Saving to: ‘saved-model.tar.gz’


2021-08-10 17:33:29 (91.6 MB/s) - ‘saved-model.tar.gz’ saved 

In following cell, we are defining the forward pass and wrapping it with `tf.function(...)` to get better performance.

In [5]:
@tf.function(jit_compile=True)
def tf_forward(speech):
  tf_out = model(speech, training=False)
  return tf.squeeze(tf.argmax(tf_out, axis=-1))

In following cell, we are defining a function to read all the sound (`.flac`) and transcription (`.txt`) files.

As `Wav2vec2` model was trained on the speech sampled at 16KHz, we need to ensure that we are evaluating on the same sampling rate as any change in sampling rate will result in change in input data distribution.

In [6]:
import soundfile as sf
import os

REQUIRED_SAMPLE_RATE = 16000
SPLIT = "test-clean"

def read_txt_file(f):
  with open(f, "r") as f:
    samples = f.read().split("\n")
    samples = {s.split()[0]: " ".join(s.split()[1:]) for s in samples if len(s.split()) > 2}
  return samples

def read_flac_file(file_path):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != REQUIRED_SAMPLE_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {REQUIRED_SAMPLE_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".flac")]
  return {file_id: audio}

Let's now map all the speech and text samples in a `List[Tuple]` for further processing on dataset.

In [7]:
def fetch_sound_text_mapping():
  flac_files = tf.io.gfile.glob(f"LibriSpeech/{SPLIT}/*/*/*.flac")
  txt_files = tf.io.gfile.glob(f"LibriSpeech/{SPLIT}/*/*/*.txt")

  txt_samples = {}
  for f in txt_files:
    txt_samples.update(read_txt_file(f))

  speech_samples = {}
  for f in flac_files:
    speech_samples.update(read_flac_file(f))

  file_ids = set(speech_samples.keys()) & set(txt_samples.keys())
  print(f"{len(file_ids)} files are found in LibriSpeech/{SPLIT}")
  samples = [(speech_samples[file_id], txt_samples[file_id]) for file_id in file_ids]
  return samples

Note: Following cell will load complete dataset into memory (It's fine to load complete data in memory as the test data is very small).

In [8]:
samples = fetch_sound_text_mapping()

2618 files are found in LibriSpeech/test-clean


Let's have a look at some random sample:

In [9]:
from IPython.display import Audio
import soundfile as sf
import random

audio, text = random.choice(samples)
sf.write("sample.wav", audio, 16000)

print("Text Transcription:", text, "\nAudio:")
Audio(filename="sample.wav")

Text Transcription: AS THE AMBASSADOR OF A GOVERNMENT IS HONORED FOR HIS OFFICE AND NOT FOR HIS PRIVATE PERSON SO THE MINISTER OF CHRIST SHOULD EXALT HIS OFFICE IN ORDER TO GAIN AUTHORITY AMONG MEN 
Audio:


Now, we will perform necessary processing on our test dataset.

`Wav2Vec2Processor(is_tokenizer=False)` will normalize raw speech w.r.to frames axis and `Wav2Vec2Processor(is_tokenizer=True)` will convert our model outputs into string & will take care of removal of special tokens (depending on your tokenizer configuration).

In [10]:
from wav2vec2 import Wav2Vec2Processor

tokenizer = Wav2Vec2Processor(is_tokenizer=True)
processor = Wav2Vec2Processor(is_tokenizer=False)

Downloading `vocab.json` from https://github.com/vasudevgupta7/gsoc-wav2vec2/raw/main/data/vocab.json ... DONE


In following cell, `DO_PADDING=True` will result in padding of speech sequences to `AUDIO_MAXLEN` and labels to `LABEL_MAXLEN`. This is important as Wav2Vec2 SavedModel can only work with sequences of 246000 length.

In [11]:
AUDIO_MAXLEN, LABEL_MAXLEN = 246000, 256
DO_PADDING = True

def preprocess_text(text):
  label = tokenizer(text)
  label = tf.constant(label, dtype=tf.int32)[None]
  if DO_PADDING:
    label = label[:, :LABEL_MAXLEN]
    padding = tf.zeros((label.shape[0], LABEL_MAXLEN - label.shape[1]), dtype=label.dtype)
    label = tf.concat([label, padding], axis=-1)
  return label

def preprocess_speech(audio):
  audio = tf.constant(audio, dtype=tf.float32)
  audio = processor(audio)[None]
  if DO_PADDING:
    audio = audio[:, :AUDIO_MAXLEN]
    padding = tf.zeros((audio.shape[0], AUDIO_MAXLEN - audio.shape[1]), dtype=audio.dtype)
    audio = tf.concat([audio, padding], axis=-1)
  return audio

Now we will wrap everything in `tf.data.Dataset` with the help of it's `.from_generator` method.

In [12]:
def inputs_generator():
  for speech, text in samples:
    yield preprocess_speech(speech), preprocess_text(text)

output_signature = (
    tf.TensorSpec(shape=(None),  dtype=tf.float32),
    tf.TensorSpec(shape=(None), dtype=tf.int32),
)
dataset = tf.data.Dataset.from_generator(inputs_generator, output_signature=output_signature)

In following cell, we are defining function that will take dataset as argument and will return the predictions (and corresponding labels) from the model.

In [13]:
from tqdm.auto import tqdm

def infer_librispeech(dataset: tf.data.Dataset, num_batches: int = None):
  predictions, labels = [], []
  for batch in tqdm(dataset, total=num_batches, desc="LibriSpeech Inference ... "):
    speech, label = batch
    tf_out = tf_forward(speech)
    predictions.append(tokenizer.decode(tf_out.numpy().tolist(), group_tokens=True))
    labels.append(tokenizer.decode(label.numpy().squeeze().tolist(), group_tokens=False))
  return predictions, labels

Let's run above function!!

In [14]:
predictions, labels = infer_librispeech(dataset, num_batches=2618)
list(zip(predictions, labels))

HBox(children=(FloatProgress(value=0.0, description='LibriSpeech Inference ... ', max=2618.0, style=ProgressSt…




[('I UNDERSTAND BARTLEY I WAS WRONG', 'I UNDERSTAND BARTLEY I WAS WRONG'),
 ("THINKS I O MYSELF I'V NEVER SEEN ANYTHING OSH POP EM GOOD AN MEND IF HE TOOK TIME ENOUGH IN GLUE ENOUG SO I CARRIED THIS LITTLE FELLER HOME IN A BUSHEL BASKET ONE NIGHT LAST MONTH AN I'VE SPENT ELEVEN EVENINS PUTTIN HIM TO GETHER",
  "THINKS I TO MYSELF I NEVER SEEN ANYTHING OSH POPHAM COULDN'T MEND IF HE TOOK TIME ENOUGH AND GLUE ENOUGH SO I CARRIED THIS LITTLE FELLER HOME IN A BUSHEL BASKET ONE NIGHT LAST MONTH AN I'VE SPENT ELEVEN EVENIN'S PUTTIN HIM TOGETHER"),
 ('IT IS YOU WHO ARE MISTAKEN RAUL I HAVE READ HIS DISTRESS IN HIS EYES IN HIS EVERY GESTURE AND ACTION THE WHOLE DAY',
  'IT IS YOU WHO ARE MISTAKEN RAOUL I HAVE READ HIS DISTRESS IN HIS EYES IN HIS EVERY GESTURE AND ACTION THE WHOLE DAY'),
 ('TELL US SAID THE OTHER THE WHOLE STORY AND WHERE SAWLYN HEARD THE STORY',
  'TELL US SAID THE OTHER THE WHOLE STORY AND WHERE SOLON HEARD THE STORY'),
 ('A GREAT SAINT SAINT FRANCIS ZAVIER', 'A GREAT SAINT S

Now, we will calculate **Word Error Rate (WER)** to be able to judge if our model performed well. We will be using `load_metric(...)` function from HuggingFace datasets.

In [15]:
!pip3 install -q datasets

from datasets import load_metric
wer = load_metric("wer")

[K     |████████████████████████████████| 264 kB 8.3 MB/s 
[K     |████████████████████████████████| 243 kB 64.5 MB/s 
[K     |████████████████████████████████| 118 kB 73.6 MB/s 
[K     |████████████████████████████████| 76 kB 6.0 MB/s 
[?25h

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1947.0, style=ProgressStyle(description…




Let's compute the metric value in following cell:

In [16]:
wer.compute(references=labels, predictions=predictions)

0.06010330255125998