# Audio Classification with SpeechBrain and Cleanlab

In this quickstart tutorial, we will use Cleanlab to find label issues in the Spoken Digit dataset (it's like MNIST for audio). The dataset contains 2,500 audio clips with English pronunciations of the digits 0 to 9 (these are the labels to predict from the audio).

**Overview of what we'll do in this tutorial:**

- Extract features from audio clips (.wav files) using a pre-trained Pytorch model (available via HuggingFace) that was previously fit to the [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) speech dataset.

- Train a cross-validated linear model using the extracted features and generate out-of-sample predicted probabilities.

- Use cleanlab to identify a list of audio clips with potential label errors.

**Data:** https://www.tensorflow.org/datasets/catalog/spoken_digit

**Pre-Trained Model:** https://huggingface.co/speechbrain/spkrec-xvect-voxceleb


## 1. Install dependencies and import them


To get started, we first need to install the following packages with `pip install`:

1. cleanlab
2. speechbrain
3. tensorflow_io


In [None]:
dependencies = ["cleanlab", "speechbrain", "tensorflow_io"]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install git+https://github.com/weijinglok/cleanlab.git@808a62cdff2e08f075f6ba651b4681cbd6f52bcc
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    missing_dependencies = []
    for dependency in dependencies:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

Let's import some of the packages needed throughout this tutorial.


In [88]:
import os
import pandas as pd
import numpy as np
import random
import tensorflow as tf
import torch

SEED = 456


def set_seed(seed=0):
    """Ensure reproducibility."""
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.cuda.manual_seed_all(seed)


set_seed(SEED)
tf.get_logger().setLevel('ERROR')  # suppress TF warnings 
pd.options.display.max_colwidth = 500
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Suppress TF info, warnings and errors


## 2. Load the data


We must first fetch the dataset. To run the below command, you'll need to have `wget` installed; alternatively you can manually navigate to the link in your browser and download from there.


In [89]:
%%capture

!wget https://github.com/Jakobovski/free-spoken-digit-dataset/archive/v1.0.9.tar.gz
!mkdir spoken_digits
!tar -xf v1.0.9.tar.gz -C spoken_digits

--2022-04-05 04:29:55--  https://github.com/Jakobovski/free-spoken-digit-dataset/archive/v1.0.9.tar.gz
Resolving github.com (github.com)... 13.114.40.48
Connecting to github.com (github.com)|13.114.40.48|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/Jakobovski/free-spoken-digit-dataset/tar.gz/refs/tags/v1.0.9 [following]
--2022-04-05 04:29:56--  https://codeload.github.com/Jakobovski/free-spoken-digit-dataset/tar.gz/refs/tags/v1.0.9
Resolving codeload.github.com (codeload.github.com)... 52.68.31.213
Connecting to codeload.github.com (codeload.github.com)|52.68.31.213|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘v1.0.9.tar.gz.5’

v1.0.9.tar.gz.5         [     <=>            ]  11.42M  10.3MB/s    in 1.1s    

2022-04-05 04:29:57 (10.3 MB/s) - ‘v1.0.9.tar.gz.5’ saved [11975353]

mkdir: cannot create directory ‘spoken_digits’: File exists


The audio data are .wav files in the **recordings/** folder. Note that the label for each audio clip (i.e. digit from 0 to 9) is indicated in the prefix of the file name (e.g. **6_nicolas_32.wav** has the label 6).


In [90]:
DATA_PATH = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/"

# Get list of .wav file names
file_paths = []

# The returned order of os.walk can be arbitrary so we use deterministic shuffle with SEED below
# See: https://stackoverflow.com/questions/18282370/in-what-order-does-os-walk-iterates-iterate)
for (dirpath, dirnames, filenames) in os.walk(DATA_PATH):

    # sort before shuffling
    filenames = sorted(filenames)

    # Use deterministic shuffle with SEED on pre-sorted list
    # See: https://stackoverflow.com/questions/19306976/python-shuffling-with-a-parameter-to-get-the-same-result)
    random.Random(SEED).shuffle(filenames)
    file_paths += [os.path.join(dirpath, file) for file in filenames if file.endswith(".wav")]

# Check out first 3 files
file_paths[:3]


['spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav']

Let's listen to some example audio clips from the dataset. The utility functions below help process the .wav file so we can listen to it in this notebook.


In [91]:
import tensorflow_io as tfio
from pathlib import Path
from IPython import display

# Utility function for loading audio files and making sure the sample rate is correct.
@tf.function
def load_wav_16k_mono(filename):
    """Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio."""
    file_contents = tf.io.read_file(filename)
    wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
    return wav


def display_example(wav_file_name, audio_rate=16000):
    """Allows us to listen to any wav file and displays its given label in the dataset."""
    wav_file_example = load_wav_16k_mono(wav_file_name)
    label = Path(wav_file_name).parts[-1].split("_")[0]
    print(f"Given label for this example: {label}")
    display.display(display.Audio(wav_file_example, rate=audio_rate))


Click the play button below to listen to this example .wav file. Feel free to change the `wav_file_name_example` variable below to listen to other audio clips in the dataset.


In [92]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_jackson_43.wav"  # change this to hear other examples
display_example(wav_file_name_example)


Given label for this example: 7


## 3. Use pre-trained SpeechBrain model to featurize audio

The [SpeechBrain](https://github.com/speechbrain/speechbrain) package offers many Pytorch neural networks that have been pretrained for speech recognition tasks. Here we instantiate an audio feature extractor using SpeechBrain's `EncoderClassifier()`. We'll use the "spkrec-xvect-voxceleb" network which has been pre-trained on the [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) speech dataset.


In [93]:
%%capture

from speechbrain.pretrained import EncoderClassifier

feature_extractor = EncoderClassifier.from_hparams(
  "speechbrain/spkrec-xvect-voxceleb",
  # run_opts={"device":"cuda"}  # Uncomment this to run on GPU if you have one (optional)
)

Next, we run the audio clips through the pre-trained model to extract vector features (aka embeddings).


In [94]:
# Create dataframe with .wav file names
df = pd.DataFrame(file_paths, columns=["wav_audio_file_path"])
df["label"] = df.wav_audio_file_path.map(lambda x: int(Path(x).parts[-1].split("_")[0]))

df.head(3)


Unnamed: 0,wav_audio_file_path,label
0,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav,7
1,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav,0
2,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav,0


In [95]:
import torchaudio


def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:
    """Feature extractor that embeds audio into a vector."""
    signal, fs = torchaudio.load(wav_audio_file_path)  # Reformat audio signal into a tensor
    embeddings = model.encode_batch(
        signal
    )  # Pass tensor through pretrained neural net and extract representation
    return embeddings


In [96]:
# Extract audio embeddings
embeddings_list = []
for i, file_name in enumerate(df.wav_audio_file_path):  # for each .wav file name
    embeddings = extract_audio_embeddings(feature_extractor, file_name)
    embeddings_list.append(embeddings.cpu().numpy())

embeddings_array = np.squeeze(np.array(embeddings_list))


Now we have our features in a 2D numpy array. Each row in the array corresponds to an audio clip. We're now able to represent each audio clip as a 512-dimensional feature vector!


In [97]:
print(embeddings_array)
print("Shape of array: ", embeddings_array.shape)


[[-14.196314    7.319463   12.478973  ...   2.289077    2.8170207
  -10.892647 ]
 [-24.898058    5.2561903  12.559641  ...  -3.5597146   9.620667
  -10.28525  ]
 [-21.709621    7.5033717   7.913801  ...  -6.81983     3.1831474
  -17.208763 ]
 ...
 [-16.084263    6.3210497  12.005459  ...   1.2161485   9.478238
  -10.682177 ]
 [-15.053809    5.242468    1.0914198 ...  -0.7833452   9.039536
  -23.569172 ]
 [-19.761091    1.1258258  16.753227  ...   3.3508904  11.598279
  -16.237118 ]]
Shape of array:  (2500, 512)


## 4. Fit linear model and compute out-of-sample predicted probabilities


A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a `sklearn` linear model on top of the extracted network embeddings.

To identify label issues, cleanlab requires a probabilistic prediction from your model for every datapoint that should be considered. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. Cleanlab is intended to only be used with **out-of-sample** predicted probabilities, i.e. on datapoints held-out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilites for every datapoint in the dataset, by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split. We can obtain cross-validated out-of-sample predicted probabilities from any classifier via the [cross_val_predict](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) wrapper provided in `sklearn`.


In [98]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

model = LogisticRegression(C=0.01, max_iter=1000, tol=1e-1, random_state=SEED)

# Generate cross-validated predicted probabilities for each datapoint
cv_pred_probs = cross_val_predict(
    estimator=model, X=embeddings_array, y=df.label.values, cv=5, method="predict_proba"
)


For each audio clip, the corresponding predicted probabilities in `cv_pred_probs` are produced by a copy of our LogisticRegression model that has never been trained on this audio clip. Hence we call these predictions _out-of-sample_. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split.


In [99]:
from sklearn.metrics import accuracy_score

predicted_labels = cv_pred_probs.argmax(axis=1)
cv_accuracy = accuracy_score(df.label.values, predicted_labels)
print(f"Cross-validated estimate of accuracy on held-out data: {cv_accuracy}")


Cross-validated estimate of accuracy on held-out data: 0.9772


## 5. Use cleanlab to find label issues


Based on the given labels and out-of-sample predicted probabilities, `cleanlab` can quickly help us identify label issues. Here we request that the indices of the identified label issues should be sorted by cleanlab's _self-confidence_ score, which measures the quality of each given label via the probability assigned it in our model's prediction.


In [100]:
import cleanlab

label_issues_indices = cleanlab.filter.find_label_issues(
    labels=df.label.values,
    pred_probs=cv_pred_probs,
    return_indices_ranked_by="self_confidence",  # ranks the label issues
)

print(label_issues_indices)


[ 516 1946  469 1871 1955 2132]


The datapoints flagged by `cleanlab` are those we worth inspecting more closely.


In [101]:
df.iloc[label_issues_indices]


Unnamed: 0,wav_audio_file_path,label
516,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav,6
1946,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav,6
469,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav,6
1871,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_theo_27.wav,6
1955,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/4_george_31.wav,4
2132,spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav,6


Let's listen to some audio clips below of label issues that were identified in this list.


In this example, the given label is **6** but it sounds like **8**.


In [102]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav"
display_example(wav_file_name_example)


Given label for this example: 6


In the three examples below, the given label is **6** but they sound quite ambiguous.


In [103]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav"
display_example(wav_file_name_example)


Given label for this example: 6


In [104]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav"
display_example(wav_file_name_example)


Given label for this example: 6


In [105]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav"
display_example(wav_file_name_example)


Given label for this example: 6


You can see that even widely-used datasets like Spoken Digit contain problematic labels. Never blindly trust your data! You should always check it for potential issues, many of which can be easily identified by `cleanlab`.
