# Audio Classification with Cleanlab and LightGBM

In this tutorial, we will use Cleanlab to find potential label errors in the Spoken Digit dataset (it's like MNIST for audio). The dataset contains 2,500 audio clips with english pronunciations of the digits 0 to 9. 

**High level overview of what we'll do in this tutorial:**

- Extract features from audio clips (.wav files) using a pre-trained model from Hugging Face trained on the [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) speech dataset.

- Train a cross-validated LightGBM model using the extracted features.

- Generate cross-validated predicted probabilities to use as input into Cleanlab.

- Generate a list of audio clips with potential label errors.

**Data:** https://www.tensorflow.org/datasets/catalog/spoken_digit

**Pre-Trained Model:** https://huggingface.co/speechbrain/spkrec-xvect-voxceleb

## **1. Install dependencies and import libraries**

To get started, let's run below to install additional packages for your Colab environment.

In [1]:
%%capture

%pip install cleanlab speechbrain tensorflow-io tensorflow lightgbm pandas matplotlib wget

Next, let's import all the packages you need for this tutorial. Note that we also set a random seed to make experiments reproducible.

In [2]:
import cleanlab
import tensorflow as tf
import tensorflow_io as tfio
import torchaudio
from speechbrain.pretrained import EncoderClassifier
import lightgbm as lgb
from sklearn.model_selection import cross_val_predict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from pathlib import Path
from IPython import display

np.random.seed(12345)

2022-03-09 16:26:07.984589: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.12/x64/lib
2022-03-09 16:26:07.984634: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## **2. Download dataset**

We can run below to download the dataset directly into your Colab workspace. Once the download is complete, you should be able to see it on the left menu bar in the "Files" tab.

In [3]:
import wget
import tarfile

url = "https://github.com/Jakobovski/free-spoken-digit-dataset/archive/v1.0.9.tar.gz"
filename = wget.download(url)

dir = 'spoken_digits'

os.mkdir(dir)

tar = tarfile.open(filename)
tar.extractall(dir)
tar.close()

The audio data (.wav files) is in the "recordings" folder. Let's run below to get a list of all the file names.

Note that the label (digits from 0 to 9) is indicated in the prefix of the file name (e.g. "6_nicolas_32.wav" has the label 6)

In [4]:
DATA_PATH = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/"

# Get list of .wav file names
file_paths = []
for (dirpath, dirnames, filenames) in os.walk(DATA_PATH):
    file_paths += [os.path.join(dirpath, file) for file in filenames if file.endswith(".wav")]

# Check out first 3 files
file_paths[:3]

['spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/5_nicolas_32.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/4_yweweler_42.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_nicolas_14.wav']

## **3. Use pre-trained audio classifier from Huggingface**

Next, let's instantiate an audio feature extractor using the `EncoderClassifier()` class. Note that we pass in the name of the pre-trained model as an argument ("speechbrain/spkrec-xvect-voxceleb"). We will use a model that has been pre-trained on the [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) speech dataset.

In [5]:
# Instantiate audio classifier
feature_extractor = EncoderClassifier.from_hparams(
  "speechbrain/spkrec-xvect-voxceleb",
  # GPU is optional
  # to enable GPU in Colab, go to: edit -> notebook settings -> choose GPU accelerator)
  # run_opts={"device":"cuda"}
)

Downloading:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/129k [00:00<?, ?B/s]

## **4. Explore the data by playing some audio clips**

Before training the model, let's listen to some of the audio clips. 

We can use the utility function below to process the .wav file so we can listen to it in this code notebook.

In [6]:
# Utility function for loading audio files and making sure the sample rate is correct.
@tf.function
def load_wav_16k_mono(filename):
    """ Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio. """
    file_contents = tf.io.read_file(filename)
    wav, sample_rate = tf.audio.decode_wav(
          file_contents,
          desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
    return wav

Click the play button below to listen to the .wav file. 

Feel free to change the `wav_file_name_example` variable below to listen to other audio clips.

In [7]:
# Single .wav file example
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_jackson_43.wav"

wav_file_example = load_wav_16k_mono(wav_file_name_example)

# Play the audio file.
display.Audio(wav_file_example, rate=16000)

2022-03-09 16:26:22.011755: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
2022-03-09 16:26:22.012167: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA




2022-03-09 16:26:22.397419: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.12/x64/lib
2022-03-09 16:26:22.397472: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-09 16:26:22.397504: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fv-az203-432): /proc/driver/nvidia/version does not exist
2022-03-09 16:26:22.397794: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## **5. Extract features from audio**

Next, run below to use our pre-trained model to extract features (aka embeddings) from the audio clips.

In [8]:
# Create dataframe with .wav file names
df = pd.DataFrame(file_paths, columns=["wav_audio_file_path"])
df["label"] = df.wav_audio_file_path.map(lambda x: int(Path(x).parts[-1].split("_")[0]))

df.head(3)

Unnamed: 0,wav_audio_file_path,label
0,spoken_digits/free-spoken-digit-dataset-1.0.9/...,5
1,spoken_digits/free-spoken-digit-dataset-1.0.9/...,4
2,spoken_digits/free-spoken-digit-dataset-1.0.9/...,7


In [9]:
# Feature extractor
def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:

    # Transform audio file into a tensor
    signal, fs = torchaudio.load(wav_audio_file_path)

    # Extract features (aka embeddings)
    embeddings = model.encode_batch(signal)

    return embeddings

In [10]:
# Extract audio embeddings
embeddings_list = []
for i, file_name in enumerate(df.wav_audio_file_path): # for each .wav file name
    embeddings = extract_audio_embeddings(feature_extractor, file_name)
    embeddings_list.append(embeddings.cpu().numpy())

embeddings_array = np.squeeze(np.array(embeddings_list))

Now we have our features in an array! 

Each row in the array corresponds to an audio clip. We're now able to represent an audio clip as a 512-dimensional feature vector!

In [11]:
embeddings_array.shape

(2500, 512)

## **6. Train LightGBM model using Cross Validation to generate predicted probabilties**

Run below to train a LightGBM model to generate cross-validated predicted probabilities.

Note that we need to use the LightGBM `Dataset` class to make sure it is in the proper format for our `LGBMClassifier`.

In [12]:
data = lgb.Dataset(embeddings_array, label=df.label.values)

For convenience, we can use the scikit-learn's `cross_val_predict` function to run k-fold cross-validation and generate cross-validated predicted probabilities.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html

This means that for each audio clip which belongs to a partition of the data (due to k-fold procedure), the predicted probabilities are scored by a model that was trained on the other partitions. This is also known as out-of-sample predictions.

In [13]:
%%capture

model = lgb.LGBMClassifier(
    n_estimators=1000,
    eval_metric="multi_logloss",
    verbose=1,
)

# Generate cross-validated predicted probabilities for each input data point
# Note: Cleanlab requires predictions to be out-of-sample (e.g. via cross-validation)
cv_pred_probs = cross_val_predict(
    estimator=model,
    X=embeddings_array,
    y=df.label.values,
    cv=5,
    method="predict_proba"
)

In [14]:
cv_pred_probs.shape

(2500, 10)

Run below to get the cross-validation accuracy.

In [15]:
cv_accuracy = (cv_pred_probs.argmax(axis=1) == df.label.values).mean()

print(f"Cross-Validation Accuracy: {cv_accuracy}")

Cross-Validation Accuracy: 0.9584


## **7. Run Cleanlab to find label errors**

Now let's run Cleanlab to find label errors! 

Use the `get_noise_indices()` function below to get an ordered list of indices corresponding to the audio clips with potential label error.

In [16]:
# Generate an ordered list of indices corresponding to the audio clips with potential label error
ordered_label_errors = cleanlab.pruning.get_noise_indices(
    s=df.label.values,
    psx=cv_pred_probs,
    sorted_index_method="normalized_margin", # Orders label errors
 )

In [17]:
ordered_label_errors

array([ 231,  756, 1993,  212, 1624, 2269,  846, 1262,  580, 1899,  578,
        516, 2053,  370, 1361])

Run below to get the candidates to inspect for label error.

In [18]:
print("Candidates to inspect for label error")
df.iloc[ordered_label_errors]

Candidates to inspect for label error


Unnamed: 0,wav_audio_file_path,label
231,spoken_digits/free-spoken-digit-dataset-1.0.9/...,6
756,spoken_digits/free-spoken-digit-dataset-1.0.9/...,3
1993,spoken_digits/free-spoken-digit-dataset-1.0.9/...,0
212,spoken_digits/free-spoken-digit-dataset-1.0.9/...,6
1624,spoken_digits/free-spoken-digit-dataset-1.0.9/...,2
2269,spoken_digits/free-spoken-digit-dataset-1.0.9/...,2
846,spoken_digits/free-spoken-digit-dataset-1.0.9/...,9
1262,spoken_digits/free-spoken-digit-dataset-1.0.9/...,7
580,spoken_digits/free-spoken-digit-dataset-1.0.9/...,6
1899,spoken_digits/free-spoken-digit-dataset-1.0.9/...,1


## **8. Listen to these examples of label errors that were found**

After inspecting the list of candidates, you should find that there are indeed label errors. 

Listen to the audio clips below of label errors that were found.

Given label is the digit **6** but sounds more like **8**

In [19]:
# Single .wav file example
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav"
wav_file_example = load_wav_16k_mono(wav_file_name_example)

# Play the audio file.
display.Audio(wav_file_example, rate=16000)



Given label is the digit **7** but sounds more like **0**

In [20]:
# Single .wav file example
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_nicolas_43.wav"
wav_file_example = load_wav_16k_mono(wav_file_name_example)

# Play the audio file.
display.Audio(wav_file_example, rate=16000)

