<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/multi_dataset_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Dataset Evaluation with 🤗 Transformers and Datasets

by: [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi)

Automatic Speech Recogntion (ASR) models are measured by their performance on unseen audio data. In this Colab we'll measure the performance of OpenAI's [Whisper model](https://openai.com/blog/whisper/) on **8 ASR datasets** with one script. Using streaming mode, we'll require no more than 20GB of disk space to achieve this.

## Prepare Environment

Let's begin by installing the packages we'll need to process audio datasets. We require the Unix package `ffmpeg` version 4. We'll also need the Python package `datasets`, as well as some other popular Hugging Face libraries like `transformers` and `evaluate` for our ASR pipeline.

*Note*: Do make sure to select a GPU runtime if you haven't already!

In [2]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4 && apt update && apt install -y ffmpeg
!pip install --quiet datasets transformers evaluate huggingface_hub jiwer

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Connecting to security.0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Connecting to security.0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com (185.125.190.36                                                                               Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
0% [1 InRelease gpgv 3,626 B] [Waiting for headers] [2 InRelease 14.2 kB/88.7 k                                                                               Get:3 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
0% [1 InRelease gpgv 3,626 B] [Waiting for headers] [2 InRelease 14.2 kB/88.7 k                                                                               Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [1 InRelease gpgv 3,62

We strongly advise you link the notebook to the [Hugging Face Hub](https://huggingface.co). This enables you to login and access "gated" datasets on the Hub.

Linking the notebook to the Hub is straightforward - it simply requires entering your Hub authentication token when prompted. Find your Hub authentication token [here](https://huggingface.co/settings/tokens):

In [4]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load Datasets

Audio datasets are very large. This causes two issues:
1. They require a significant amount of *storage* to download
2. They take a significant amount of *time* to download and process

The storage and time requirements present limitations to most speech researchers. However, both can be solved with 🤗 Datasets.

With streaming mode, we can load and prepare samples as we iterate over the dataset. Since the samples are loaded progressively, we can get started with a dataset without waiting for the entire dataset to download. This way, we only have the data when we need it, and not when we don't!

Using streaming mode, we'll evaluate the Whisper model on the nine test sets from the[End-to-end Speech Benchmark (ESB)](https://arxiv.org/abs/2210.13352). Typically, this would require **several hundred** GigaBytes worth of storage space to download. With streaming mode, this is all possible on a single Google Colab.

First, we'll load the nine test sets from the ESB benchmark in streaming mode:

In [8]:
from datasets import load_dataset

librispeech_clean = load_dataset("librispeech_asr", "all", split="test.clean", streaming=True)
librispeech_other = load_dataset("librispeech_asr", "all", split="test.other", streaming=True)

common_voice = load_dataset("mozilla-foundation/common_voice_11_0", "en", revision="streaming", split="test", streaming=True, use_auth_token=True)

voxpopuli = load_dataset("facebook/voxpopuli", "en", split="test", streaming=True)

tedlium = load_dataset("LIUM/tedlium", "release3", split="test", streaming=True)

gigaspeech = load_dataset("speechcolab/gigaspeech", "xs", split="test", streaming=True, use_auth_token=True)

spgispeech = load_dataset("kensho/spgispeech", "S", split="test", streaming=True, use_auth_token=True)

earnings22 = load_dataset("anton-l/earnings22_baseline_5_gram", split="test", streaming=True)

ami = load_dataset("edinburghcstr/ami", "ihm", split="test", streaming=True)

Downloading builder script:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/8.30k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/8.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.24k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/16.6k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/9.39k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/40.2k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]



Downloading builder script:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.83k [00:00<?, ?B/s]

Next, we create a dictionary of dataset names and dataset objects. This gives us an easy lookup table in our evaluation loop.

In [9]:
esb_datasets = {
    "LibriSpeech Clean": librispeech_clean,
    "LibriSpeech Other": librispeech_other,
    "Common Voice": common_voice,
    "VoxPopuli": voxpopuli,
    "TEDLIUM": tedlium,
    "GigaSpeech": gigaspeech,
    "SPGISpeech": spgispeech,
    "Earnings-22": earnings22,
    "AMI": ami
}

We then define a 'helper function' that gets the correct transcription column from our dataset. We'll use this function to automatically get the right column names when we perform multi-dataset evaluation:

In [10]:
def get_text(sample):
    if "text" in sample:
        return sample["text"]
    elif "sentence" in sample:
        return sample["sentence"]
    elif "normalized_text" in sample:
        return sample["normalized_text"]
    elif "transcript" in sample:
        return sample["transcript"]
    else:
        raise ValueError(f"Sample: {sample.keys()} has no transcript.")

## Load Whisper Model

We'll create an ASR evaluation pipeline using 🤗 Transformers [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines) method. `pipeline` will take care of the data pre-processing and the text generation. All we have to do is pass the audio inputs to pipeline and assess the returned predictions against the reference transcriptions.

We'll evaluate the official OpenAI [Whisper tiny.en](https://huggingface.co/openai/whisper-tiny.en) checkpoint. Let's load it with `pipeline`:

In [11]:
from transformers import pipeline

whisper_asr = pipeline(
    "automatic-speech-recognition", model="openai/whisper-tiny.en", device=0
)

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/151M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/844 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/999k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/185k [00:00<?, ?B/s]

## Load the Word Error Rate metric

We'll assess our system using the [Word Error Rate (WER)](https://huggingface.co/spaces/evaluate-metric/wer) metric, the 'de-facto' metric for assessing ASR systems. We'll load the WER metric from the 🤗 Evaluate library:

In [12]:
import evaluate

wer_metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Bonus: You can also try other evaluation methods like the [Character Error Rate (CER)](https://huggingface.co/spaces/evaluate-metric/cer). For the CER, update the above statement to `evaluate.load("cer")`

## Normalisation

The [Whisper paper](https://cdn.openai.com/papers/whisper.pdf) demonstrates the drastic effect that normalising the text outputs have on WER. The normalisation step is important as it removes errors unrelated to the speech recognition task, such as casing and punctuation. It also makes the formatting consistent between references and predictions by converting spelled out numbers to symbollic form (e.g. "two" -> "2") and British spellings to American (e.g. "grey" -> "gray").

We first write a function to normalise the reference of a single sample according to the Whisper English text normaliser:

In [13]:
whisper_norm = whisper_asr.tokenizer._normalize

def normalise(batch):
    batch["norm_text"] = whisper_norm(get_text(batch))
    return batch

We'll apply this function to our data using 🤗 Datasets' [`.map`](https://huggingface.co/docs/datasets/process#map) method in our evaluation pipeline.

We also need to remove any empty reference transcriptions from our dataset, as these will give a divide by 0 error in the WER calculation.

We write a function that indicates which samples to keep, and which to discard. This function, `is_target_text_in_range`, returns a boolean: reference transcriptions that are not empty return True, and those are empty return False:

In [14]:
filter_sequences = ["ignore time segment in scoring", ""]

def is_target_text_in_range(ref):
    ref = ref.strip()
    return ref not in filter_sequences

Again, we'll apply this function to our data using 🤗 Datasets' [`.map`](https://huggingface.co/docs/datasets/process#map) method in our evaluation pipeline.

## Multi-Dataset Evaluation

In this final section, we combine everything together to form the multi-dataset evaluation loop for the Whisper model.

First, we define a generator that iterates over the dataset and yields the audio samples and reference text ready for our model:

In [15]:
def data(dataset):
    for i, item in enumerate(dataset):
        yield {**item["audio"], "reference": item["norm_text"]}

We then loop over the ESB datasets and compute the individual WER scores, combining the single-dataset evaluation steps into one loop. We store the WER results in a separate list to display later.

We only evaluate on the first 128 samples for each dataset to demonstrate how this script can be used for multi-dataset evluation with streaming mode. If you want to evaluate on the entire dataset, comment out or remove this line.

In [16]:
from datasets import Audio

# set the batch size in accordance to your device
BATCH_SIZE = 16
wer_results = []

# loop over all the datasets in the ESB benchmark
for dataset_name, dataset in esb_datasets.items():
    # only for debugging, restricts the number of rows to numeric value in brackets
    dataset = dataset.take(128)

    # resample to 16kHz
    dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

    # normalise references
    dataset = dataset.map(normalise)

    # remove any empty references
    dataset = dataset.filter(is_target_text_in_range, input_columns=["norm_text"])

    # placeholders for predictions and references
    predictions = []
    references = []

    # run streamed inference
    for out in whisper_asr(data(dataset), batch_size=BATCH_SIZE):
        predictions.append(whisper_norm(out["text"]))
        references.append(out["reference"][0])

    # compute the WER
    wer = wer_metric.compute(references=references, predictions=predictions)
    wer = round(100 * wer, 2)

    wer_results.append(wer)

Reading metadata...: 16354it [00:00, 34017.57it/s]


Alright! In one code cell we managed to evaluate over nine different test sets! Let's print the results in tabular form:

In [17]:
import pandas as pd

df = pd.DataFrame({"Dataset": esb_datasets.keys(), "WER": wer_results})
df

Unnamed: 0,Dataset,WER
0,LibriSpeech Clean,4.73
1,LibriSpeech Other,16.17
2,Common Voice,63.27
3,VoxPopuli,10.22
4,TEDLIUM,5.16
5,GigaSpeech,10.62
6,SPGISpeech,6.67
7,Earnings-22,22.05
8,AMI,24.93


We ran the above evaluation script for the Whisper "tiny.en" and "small.en" checkpoints on the full datasets from the ESB benchmark. The results of the run were as follows:

| **Dataset name**  | **Whisper tiny.en** | **Whisper small.en** |
|-------------------|---------------------|----------------------|
| LibriSpeech Clean | 5.66                | 3.05                 |
| LibriSpeech Other | 15.38               | 7.53                 |
| Common Voice      | 31.17               | 15.20                |
| VoxPopuli         | 12.58               | 8.45                 |
| TEDLIUM           | 14.28               | 12.21                |
| GigaSpeech        | 14.07               | 11.36                |
| SPGISpeech        | 5.82                | 3.63                 |
| Earnings-22       | 13.79               | 16.40                |
| AMI               | 24.68               | 17.88                |