<a href="https://colab.research.google.com/github/vasudevgupta7/gsoc-wav2vec2/blob/main/notebooks/librispeech_evaluation_WER_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wav2Vec2 evaluation on LibriSpeech dataset

In this notebook, we will be evaluating the `Wav2Vec2` model using the checkpoint fine-tuned on 960h of LibriSpeech dataset.

Let's start by installing `gsoc-wav2vec2` package from this [repositary](https://github.com/vasudevgupta7/gsoc-wav2vec2).

In [33]:
!pip3 install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@main

Now that we have installed required packages, lets download the test dataset from official LibriSpeech [website](https://www.openslr.org/12). It may take couple of seconds depending on your internet connection.

In [34]:
!wget https://www.openslr.org/resources/12/test-clean.tar.gz
!tar -xf test-clean.tar.gz

--2021-08-14 16:51:51--  https://www.openslr.org/resources/12/test-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 346663984 (331M) [application/x-gzip]
Saving to: ‘test-clean.tar.gz.1’


2021-08-14 16:52:09 (18.9 MB/s) - ‘test-clean.tar.gz.1’ saved [346663984/346663984]



In [35]:
ls LibriSpeech/

BOOKS.TXT  CHAPTERS.TXT  LICENSE.TXT  README.TXT  SPEAKERS.TXT  [0m[01;34mtest-clean[0m/


Now, we will instantiate the model with the fine-tuned weights. Convenient `.from_pretrained(...)` method will download the fine-tuned weights from HuggingFace Hub and will run the `.load_weights` method.

In [36]:
import tensorflow as tf
from wav2vec2 import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained("vasudevgupta/gsoc-wav2vec2-960h")

Loading weights locally from `vasudevgupta/gsoc-wav2vec2-960h`
Total number of loaded variables: 213


In following cell, we are defining the forward pass.

In [37]:
def tf_forward(speech):
  tf_out = model(speech, training=False)
  return tf.squeeze(tf.argmax(tf_out, axis=-1))

Note: We are not wrapping our forward pass with `tf.function(...)` intentionally as we want our function to handle sequences with variable  lengths.

Cons of restricting sequences to constant length:

1.   While predicting, model won't get complete speech for the sequences which are very long. This can result in bad predictions or prediction of truncated sequences which will result in poor metric value.
2.   `Wav2Vec2` model does't accept `attention_mask/padding_mask` as argument and hence any padding on small sequences will result in poor metric.

In following cell, we are defining a function to read all the sound (`.flac`) and transcription (`.txt`) files.

As `Wav2vec2` model was trained on the speech sampled at 16KHz, we need to ensure that we are evaluating on the same sampling rate as any change in sampling rate will result in change in input data distribution.

In [38]:
import soundfile as sf
import os

REQUIRED_SAMPLE_RATE = 16000
SPLIT = "test-clean"

def read_txt_file(f):
  with open(f, "r") as f:
    samples = f.read().split("\n")
    samples = {s.split()[0]: " ".join(s.split()[1:]) for s in samples if len(s.split()) > 2}
  return samples

def read_flac_file(file_path):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != REQUIRED_SAMPLE_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {REQUIRED_SAMPLE_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".flac")]
  return {file_id: audio}

Let's now map all the speech and text samples in a `List[Tuple]` for further processing on dataset.

In [39]:
def fetch_sound_text_mapping():
  flac_files = tf.io.gfile.glob(f"LibriSpeech/{SPLIT}/*/*/*.flac")
  txt_files = tf.io.gfile.glob(f"LibriSpeech/{SPLIT}/*/*/*.txt")

  txt_samples = {}
  for f in txt_files:
    txt_samples.update(read_txt_file(f))

  speech_samples = {}
  for f in flac_files:
    speech_samples.update(read_flac_file(f))

  file_ids = set(speech_samples.keys()) & set(txt_samples.keys())
  print(f"{len(file_ids)} files are found in LibriSpeech/{SPLIT}")
  samples = [(speech_samples[file_id], txt_samples[file_id]) for file_id in file_ids]
  return samples

Note: Following cell will load complete dataset into memory (It's fine to load complete data in memory as the test data is very small).

In [40]:
samples = fetch_sound_text_mapping()

2618 files are found in LibriSpeech/test-clean


Let's have a look at some random sample:

In [41]:
from IPython.display import Audio
import soundfile as sf
import random

audio, text = random.choice(samples)
sf.write("sample.wav", audio, 16000)

print("Text Transcription:", text, "\nAudio:")
Audio(filename="sample.wav")

Text Transcription: THE QUESTION IS WHICH OF THE TWO METHODS WILL MOST EFFECTIVELY REACH THE PERSONS WHOSE CONVICTIONS IT IS DESIRED TO AFFECT 
Audio:


Now, we will perform necessary processing on our test dataset.

`Wav2Vec2Processor(is_tokenizer=False)` will normalize raw speech w.r.to frames axis and `Wav2Vec2Processor(is_tokenizer=True)` will convert our model outputs into string & will take care of removal of special tokens (depending on your tokenizer configuration).

In [42]:
from wav2vec2 import Wav2Vec2Processor

tokenizer = Wav2Vec2Processor(is_tokenizer=True)
processor = Wav2Vec2Processor(is_tokenizer=False)

In following cell, `DO_PADDING=True` will result in padding of speech sequences to `AUDIO_MAXLEN` and labels to `LABEL_MAXLEN`. But we are setting it to `False` here because of the reasons we discussed above.

In [43]:
AUDIO_MAXLEN, LABEL_MAXLEN = 246000, 256
DO_PADDING = False

def preprocess_text(text):
  label = tokenizer(text)
  label = tf.constant(label, dtype=tf.int32)[None]
  if DO_PADDING:
    label = label[:, :LABEL_MAXLEN]
    padding = tf.zeros((label.shape[0], LABEL_MAXLEN - label.shape[1]), dtype=label.dtype)
    label = tf.concat([label, padding], axis=-1)
  return label

def preprocess_speech(audio):
  audio = tf.constant(audio, dtype=tf.float32)
  audio = processor(audio)[None]
  if DO_PADDING:
    audio = audio[:, :AUDIO_MAXLEN]
    padding = tf.zeros((audio.shape[0], AUDIO_MAXLEN - audio.shape[1]), dtype=audio.dtype)
    audio = tf.concat([audio, padding], axis=-1)
  return audio

Now we will wrap everything in `tf.data.Dataset` with the help of it's `.from_generator` method.

In [44]:
def inputs_generator():
  for speech, text in samples:
    yield preprocess_speech(speech), preprocess_text(text)

output_signature = (
    tf.TensorSpec(shape=(None),  dtype=tf.float32),
    tf.TensorSpec(shape=(None), dtype=tf.int32),
)
dataset = tf.data.Dataset.from_generator(inputs_generator, output_signature=output_signature)

In following cell, we are defining function which will take dataset as argument and will return the predictions (and corresponding labels) from model.

In [45]:
from tqdm.auto import tqdm

def infer_librispeech(dataset: tf.data.Dataset, num_batches: int = None):
  predictions, labels = [], []
  for batch in tqdm(dataset, total=num_batches, desc="LibriSpeech Inference ... "):
    speech, label = batch
    tf_out = tf_forward(speech)
    predictions.append(tokenizer.decode(tf_out.numpy().tolist(), group_tokens=True))
    labels.append(tokenizer.decode(label.numpy().squeeze().tolist(), group_tokens=False))
  return predictions, labels

Let's run above function!!

In [46]:
predictions, labels = infer_librispeech(dataset, num_batches=2618)

LibriSpeech Inference ... :   0%|          | 0/2618 [00:00<?, ?it/s]

Now, let's visualize few samples and model's predictions.

In [47]:
sample = random.choice(list(zip(samples, predictions)))
(speech, text_transcription), prediction = sample

print("ORIGINAL:", text_transcription, "\nPREDICTION:", prediction)
Audio(data=speech, rate=REQUIRED_SAMPLE_RATE)

ORIGINAL: MY FRIEND'S TEMPER HAD NOT IMPROVED SINCE HE HAD BEEN DEPRIVED OF THE CONGENIAL SURROUNDINGS OF BAKER STREET 
PREDICTION: MY FRIEND'S TEMPER HAD NOT IMPROVED SINCE HE HAD BEEN DEPRIVED OF THE CONGENIAL SURROUNDINGS OF BAKER STREET


In [48]:
sample = random.choice(list(zip(samples, predictions)))
(speech, text_transcription), prediction = sample

print("ORIGINAL:", text_transcription, "\nPREDICTION:", prediction)
Audio(data=speech, rate=REQUIRED_SAMPLE_RATE)

ORIGINAL: I WAS BOOKKEEPER SO IT WAS EASY TO GET A BLANK CHECK AND FORGE THE SIGNATURE 
PREDICTION: I WAS BIT KEEPER SO IT WAS EASY TO GET A BLANK CHECK AND FORGE THE SIGNATURE


Now, we will calculate **Word Error Rate (WER)** to be able to judge if our model performed well on most of the samples. We will be using `load_metric(...)` function from HuggingFace datasets.

In [49]:
!pip3 install -q datasets

from datasets import load_metric
wer = load_metric("wer")

[K     |████████████████████████████████| 264 kB 7.1 MB/s 
[K     |████████████████████████████████| 243 kB 31.1 MB/s 
[K     |████████████████████████████████| 118 kB 29.3 MB/s 
[?25h

Downloading:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

Let's compute the metric value in following cell:

In [50]:
wer.compute(references=labels, predictions=predictions)

0.03385703960132385