<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/audio_datasets_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Complete Guide to Audio Datasets [Colab Edition]

by: [Vaibhav (VB) Srivastav](https://twitter.com/reach_vb) and [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi)

The objective of this Colab is to reinforce the 🤗 Datasets concepts covered in the accompanying [blog post](https://huggingface.co/blog/audio-datasets) through more 'hands-on' examples. The reader is advised to read this blog post prior to running this Colab. In this Colab, we'll extend the concepts from the blog post to build an end-to-end speech recogntion pipeline.

Automatic Speech Recogntion (ASR) models are measured by their performance on unseen audio data. In this Colab we'll measure the performance of OpenAI's [Whisper model](https://openai.com/blog/whisper/) on **8 ASR datasets** with one script. Using streaming mode, we'll require no more than 20GB of disk space to achieve this.

# Prepare Environment

Let's begin by installing the packages we'll need to process audio datasets. We require the Unix package `ffmpeg` version 4. We'll also need the Python package `datasets`, as well as some other popular Hugging Face libraries like `transformers` and `evaluate` for our ASR pipeline.

*Note*: Do make sure to select a GPU runtime if you haven't already!

In [None]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4 && apt update && apt install -y ffmpeg
!pip install --quiet datasets transformers evaluate huggingface_hub jiwer

We strongly advise you link the notebook to the [Hugging Face Hub](https://huggingface.co). This enables you to login and access "gated" datasets on the Hub.

Linking the notebook to the Hub is straightforward - it simply requires entering your Hub authentication token when prompted. Find your Hub authentication token [here](https://huggingface.co/settings/tokens):

In [None]:
from huggingface_hub import login

login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


# Load & Prepare an Audio Dataset

With 🤗 Datasets, we can load and prepare an audio dataset with just one line of Python code. 

In this section we'll load the [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) dataset from SpeechColab. Make sure you've accepted the dataset's terms of use if you haven't done so already: https://huggingface.co/datasets/speechcolab/gigaspeech

### Load the dataset

Audio datasets are very large. This causes two issues:
1. They require a significant amount of *storage* to download
2. They take a significant amount of *time* to download and process

The storage and time requirements present limitations to most speech researchers. However, both can be solved with 🤗 Datasets.

With streaming mode, we can load and prepare samples as we iterate over the dataset. Since the samples are loaded progressively, we can get started with a dataset without waiting for the entire dataset to download. This way, we only have the data when we need it, and not when we don't!

Let's load the test split of the GigaSpeech dataset with streaming mode:

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "speechcolab/gigaspeech", "xs", split="test", streaming=True, use_auth_token=True
)

Great! We have the dataset ready to download the first chunk. Let's stream the first sample:

In [None]:
print(next(iter(dataset)))

{'segment_id': 'YOU1000000134_S0000042', 'speaker': 'N/A', 'text': 'ONE OF THEIR STANFORD PROFESSORS USED TO SAY <COMMA> WELL <COMMA> THE DIFFERENCE BETWEEN THE TWO OF THEM WAS THAT SERGEI WOULD JUST BURST INTO MY OFFICE WITHOUT ASKING <PERIOD> LARRY WOULD KNOCK AND THEN BURST IN <PERIOD>', 'audio': {'path': 'test_chunks_0000/YOU1000000134_S0000042.wav', 'array': array([-0.00210571, -0.00164795, -0.00253296, ...,  0.00012207,
       -0.00064087, -0.0012207 ]), 'sampling_rate': 16000}, 'begin_time': 223.662, 'end_time': 233.533, 'audio_id': 'YOU1000000134', 'title': 'YOU1000000134', 'url': 'N/A', 'source': 2, 'category': 10, 'original_full_path': 'audio/youtube/P0000/YOU1000000134.opus'}


Great! Now we can take a listen of what the audio sounds like and print the text:

In [None]:
import IPython.display as ipd

sample = next(iter(dataset))
audio = sample["audio"]

print(sample["text"])
ipd.Audio(data=audio["array"], autoplay=True, rate=audio["sampling_rate"])

ONE OF THEIR STANFORD PROFESSORS USED TO SAY <COMMA> WELL <COMMA> THE DIFFERENCE BETWEEN THE TWO OF THEM WAS THAT SERGEI WOULD JUST BURST INTO MY OFFICE WITHOUT ASKING <PERIOD> LARRY WOULD KNOCK AND THEN BURST IN <PERIOD>


Lovely! We can see that it matches the corresponding transcription as expected.

### Pre-Process the Dataset

Most ASR systems expect the audio inputs to be sampled at 16KHz. We can set the sampling rate of our audio dataset through the [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cast_column#datasets.DatasetDict.cast_column) method. This doesn't change the dataset in-place, but resamples the dataset on the fly the first time a sample is loaded.

In [None]:
from datasets import Audio

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

We then define a 'helper function' that gets the correct transcription column from our dataset. We'll use this function to automatically get the right column names when we perform multi-dataset evaluation.

In [None]:
def get_text(sample):
    if "text" in sample:
        return sample["text"]
    elif "sentence" in sample:
        return sample["sentence"]
    elif "normalized_text" in sample:
        return sample["normalized_text"]
    elif "transcript" in sample:
        return sample["transcript"]
    else:
        raise ValueError(f"Sample: {sample.keys()} has no transcript.")

# Evaluate Whisper

With the processed dataset ready, we can create an ASR evaluation pipeline using 🤗 Transformers [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines) method. `pipeline` will take care of the data pre-processing and the text generation. All we have to do is pass the audio inputs to pipeline and assess the returned predictions against the reference transcriptions!

We'll evaluate the official OpenAI [Whisper tiny.en](https://huggingface.co/openai/whisper-tiny.en) checkpoint.

P.S. You can load use `pipeline` with any ASR model on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads), including different Whisper checkpoints or even Wav2Vec2. Simply switch the model identifier for the model checkpoint you wish to evaluate.

In [None]:
from transformers import pipeline

whisper_asr = pipeline(
    "automatic-speech-recognition", model="openai/whisper-tiny.en", device=0
)

In [None]:
whisper_asr.model.config.suppress_tokens.remove(6)
whisper_asr.model.config.suppress_tokens.remove(12)

### Load the Word Error Rate metric

We'll assess our system using the [Word Error Rate (WER)](https://huggingface.co/spaces/evaluate-metric/wer) metric, the 'de-facto' metric for assessing ASR systems. We'll load the WER metric from the 🤗 Evaluate library:

In [None]:
import evaluate

wer_metric = evaluate.load("wer")

Bonus: You can also try other evaluation methods like the [Character Error Rate (CER)](https://huggingface.co/spaces/evaluate-metric/cer). For the CER, update the above statement to `evaluate.load("cer")`

### Normalisation

The [Whisper paper](https://cdn.openai.com/papers/whisper.pdf) demonstrates the drastic effect that normalising the text outputs have on WER. The normalisation step is important as it removes errors unrelated to the speech recognition task, such as casing and punctuation. It also makes the formatting consistent between references and predictions by converting spelled out numbers to symbollic form (e.g. "two" -> "2") and British spellings to American (e.g. "grey" -> "gray").

We first write a function to normalise the reference of a single sample according to the Whisper English text normaliser:

In [None]:
whisper_norm = whisper_asr.tokenizer._normalize

def normalise(batch):
    batch["norm_text"] = whisper_norm(get_text(batch))
    return batch

We then use 🤗 Datasets' [`map`](https://huggingface.co/docs/datasets/v2.6.1/en/process#map) method to apply our normalising function across the entire dataset:

In [None]:
dataset = dataset.map(normalise)

We need to remove any empty reference transcriptions from our dataset, as these will give a divide by 0 error in the WER calculation.

We write a function that indicates which samples to keep, and which to discard. This function, `is_target_text_in_range`, returns a boolean: reference transcriptions that are not empty return True, and those are empty return False:

In [None]:
def is_target_text_in_range(ref):
    if ref.strip() == "ignore time segment in scoring":
        return False
    else:
        return ref.strip() != ""

We can apply this filtering function to all of our training examples using 🤗 Datasets' [`filter`](https://huggingface.co/docs/datasets/process#select-and-filter) 
method, keeping all references that are not empty (True) and discarding those that are (False):

In [None]:
dataset = dataset.filter(is_target_text_in_range, input_columns=["norm_text"])

## Single Dataset Evaluation

Since we're in streaming mode, we won't run inference in place, but rather signal to 🤗 Datasets to perform inference _on the fly_ when the dataset is iterated.

We first define a generator that iterates over the dataset and yields the audio samples and reference text:

In [None]:
def data(dataset):
    for i, item in enumerate(dataset):
        yield {**item["audio"], "reference": item["norm_text"]}

We then set our batch size. We also restrict the number of samples for evaluation to 128 for the purpose of this blog. If you want to run on the full dataset to get the official results, comment out or remove this line from the proceeding code cell!

In [None]:
# set the batch size in accordance to your device
BATCH_SIZE = 16

# only for debugging, restricts the number of rows to numeric value in brackets
dataset = dataset.take(128)

We pass the generator to the pipeline to run inference:

In [None]:
predictions = []
references = []

# run streamed inference
for out in whisper_asr(data(dataset), batch_size=BATCH_SIZE):
    predictions.append(whisper_norm(out["text"]))
    references.append(out["reference"][0])

We can now pass on our list of references and predictions to the WER evaluate function to compute the WER:

In [None]:
wer = wer_metric.compute(references=references, predictions=predictions)
wer = round(100 * wer, 2)

print("WER:", wer)

WER: 11.67


Pretty good! If we run the Whisper tiny.en model on the full test set we can expect to achieve a WER of 14.07%. State-of-the-art models achieve 10.5% WER on the same test set (_c.f._ [GigaSpeech Leaderboard](https://github.com/SpeechColab/GigaSpeech#leaderboard)).

We could certainly improve our zero-shot result with fine-tuning. The [ESB paper](https://arxiv.org/abs/2210.13352) achieves 10.5% WER fine-tuning the [medium.en](https://huggingface.co/openai/whisper-medium.en) checkpoint on GigaSpeech, equalling state-of-the-art. See the blog post ["Fine-Tune Whisper"](https://huggingface.co/blog/fine-tune-whisper) for a guide to fine-tuning Whisper with 🤗 Transformers.

# Evalaution on 8 Datasets

Compared to evaluating on a single dataset, multi-dataset evaluation gives a better metric for the generalisation abilities of a speech recognition system (_c.f._ [End-to-end Speech Benchmark (ESB)](https://arxiv.org/abs/2210.13352)). An ASR model should not only work well on one set of audio conditions (e.g. narrated audiobooks), but should be able to handle the full spectrum of background noise, speakers, accents and domains.

First, we'll load the nine test sets from the ESB benchmark in streaming mode:

In [None]:
librispeech_clean = load_dataset("librispeech_asr", "all", split="test.clean", streaming=True)
librispeech_other = load_dataset("librispeech_asr", "all", split="test.other", streaming=True)

common_voice = load_dataset("mozilla-foundation/common_voice_11_0", "en", revision="streaming", split="test", streaming=True, use_auth_token=True)

voxpopuli = load_dataset("facebook/voxpopuli", "en", split="test", streaming=True)

tedlium = load_dataset("LIUM/tedlium", "release3", split="test", streaming=True)

gigaspeech = load_dataset("speechcolab/gigaspeech", "xs", split="test", streaming=True, use_auth_token=True)

spgispeech = load_dataset("kensho/spgispeech", "S", split="test", streaming=True, use_auth_token=True)

earnings22 = load_dataset("anton-l/earnings22_baseline_5_gram", split="test", streaming=True)

ami = load_dataset("edinburghcstr/ami", "ihm", split="test", streaming=True)



Next, we create a dictionary of dataset names and dataset objects. This gives us an easy lookup table in our evaluation loop.

In [None]:
esb_datasets = {
    "LibriSpeech Clean": librispeech_clean,
    "LibriSpeech Other": librispeech_other,
    "Common Voice": common_voice,
    "VoxPopuli": voxpopuli,
    "TEDLIUM": tedlium,
    "GigaSpeech": gigaspeech,
    "SPGISpeech": spgispeech,
    "Earnings-22": earnings22,
    "AMI": ami
}

Finally, we loop over the ESB datasets and compute the individual WER scores, combining the single-dataset evaluation steps into one loop. We store the WER results in a separate list to display later.

Again, we only evaluate on the first 128 samples for each dataset. If you want to evaluate on the entire dataset, comment out or remove this line!

In [None]:
wer_results = []

# loop over all the datasets in the ESB benchmark
for dataset_name, dataset in esb_datasets.items():
    # only for debugging, restricts the number of rows to numeric value in brackets
    dataset = dataset.take(128)

    # resample to 16kHz
    dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

    # normalise references
    dataset = dataset.map(normalise)

    # remove any empty references
    dataset = dataset.filter(is_target_text_in_range, input_columns=["norm_text"])

    # placeholders for predictions and references
    predictions = []
    references = []

    # run streamed inference
    for out in whisper_asr(data(dataset), batch_size=BATCH_SIZE):
        predictions.append(whisper_norm(out["text"]))
        references.append(out["reference"][0])

    # compute the WER
    wer = wer_metric.compute(references=references, predictions=predictions)
    wer = round(100 * wer, 2)

    wer_results.append(wer)

Reading metadata...: 16354it [00:00, 25204.56it/s]


Alright! In one code cell we managed to evaluate over nine different test sets! Let's print the results in tabular form:

In [None]:
import pandas as pd

df = pd.DataFrame({"Dataset": esb_datasets.keys(), "WER": wer_results})
df

Unnamed: 0,Dataset,WER
0,LibriSpeech Clean,4.73
1,LibriSpeech Other,16.17
2,Common Voice,63.27
3,VoxPopuli,10.22
4,TEDLIUM,5.16
5,GigaSpeech,10.62
6,SPGISpeech,6.67
7,Earnings-22,48.45
8,AMI,24.93


We ran the above evaluation script for the Whisper "tiny.en" and "small.en" checkpoints on the full datasets from the ESB benchmark. The results of the run were as follows:

| **Dataset name**  | **Whisper tiny.en** | **Whisper small.en** |
|-------------------|---------------------|----------------------|
| LibriSpeech Clean | 5.66                | 3.05                 |
| LibriSpeech Other | 15.38               | 7.53                 |
| Common Voice      | 31.17               | 15.20                |
| VoxPopuli         | 12.58               | 8.45                 |
| TEDLIUM           | 14.28               | 12.21                |
| GigaSpeech        | 14.07               | 11.36                |
| SPGISpeech        | 5.82                | 3.63                 |
| Earnings-22       | 13.79               | 16.40                |
| AMI               | 24.68               | 17.88                |

## Your chance now! 

Go ahead and repeat the above loop with a different ASR checkpoint and your choice of datasets. How does your model compare to Whisper?