# Demo of WavLM fine-tuned model for speaker verification

The whole demo builds on code & model from [here](https://huggingface.co/microsoft/wavlm-base-sv).

In [50]:
from transformers import Wav2Vec2FeatureExtractor, WavLMForXVector
from datasets import load_dataset
import torch
import numpy as np
import IPython.display as ipd

## Load the dataset & handpick data samples for testing

Using Hugging Face datasets library. Downloading a small subset of LibriSpeech dataset (the full dataset is available at [https://huggingface.co/datasets/librispeech_asr](https://huggingface.co/datasets/librispeech_asr)).

[The ultimate guide for using the library with audio datasets.](https://huggingface.co/blog/audio-datasets)

In [51]:
# When loading larger datasets, make sure to check streaming=True  option (if you're not planning to use multiple times).
dataset = load_dataset("ahazeemi/librispeech10h")

In [52]:
dataset = dataset["validation"]
COLUMNS_TO_KEEP = {"audio", "speaker_id"}
dataset = dataset.remove_columns(set(dataset.column_names) - COLUMNS_TO_KEEP)
dataset

Dataset({
    features: ['audio', 'speaker_id'],
    num_rows: 2703
})

Locally, the audio is stored in .flac file, but when loaded, the dictionary contains the audio as a float array as well. That's important, because that's exactly what the model expects as input.

Also, note that sampling rate is 16kHz, which is what the model was trained on. If we were to resample it using datasets library, the resampling would be done on the fly.

In [53]:
dp0 = dataset[0]
dp0

{'audio': {'path': None,
  'array': array([ 0.00186157,  0.0005188 ,  0.00024414, ..., -0.00097656,
         -0.00109863, -0.00146484]),
  'sampling_rate': 16000},
 'speaker_id': 2277}

In [54]:
ipd.Audio(data=np.asarray(dp0["audio"]["array"]), autoplay=True, rate=16000)

In [55]:
dp_same = dataset[2]
ipd.Audio(data=np.asarray(dp_same["audio"]["array"]), autoplay=True, rate=16000)
# dp_same

In [56]:
dp_diff = dataset[100]
ipd.Audio(data=np.asarray(dp_diff["audio"]["array"]), autoplay=True, rate=16000)
# dp_diff

## Download the model

wavlm-base-sv model fine-tuned on VoxCeleb1 dataset for speaker verification. It is using an X-Vector head with an Additive Margin Softmax loss

Builds on WavLM-Base model that was pre-trained on 960h of Librispeech. 

Sampled @ 16kHz.

In [57]:
model = WavLMForXVector.from_pretrained('microsoft/wavlm-base-sv')

Some weights of the model checkpoint at microsoft/wavlm-base-sv were not used when initializing WavLMForXVector: ['wavlm.encoder.pos_conv_embed.conv.weight_g', 'wavlm.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing WavLMForXVector from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMForXVector from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMForXVector were not initialized from the model checkpoint at microsoft/wavlm-base-sv and are newly initialized: ['wavlm.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wavlm.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream

In [58]:
# A feature extractor processes the speech signal to the model's input format, however, I'm not sure if it's really necessary if using the `dataset` library and the same dataset it was trained on
# More info on what is a feature extractor: https://huggingface.co/blog/fine-tune-wav2vec2-english#create-wav2vec2-feature-extractor
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/wavlm-base-sv')


def get_normalized_embeddings(inp):
    features = feature_extractor(inp, return_tensors="pt", sampling_rate=16000)
    embeddings = model(**features).embeddings
    embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()
    return embeddings

## Calculate embeddings for the samples

In [59]:
e0 = get_normalized_embeddings(dp0["audio"]["array"])
e_same = get_normalized_embeddings(dp_same["audio"]["array"])
e_diff = get_normalized_embeddings(dp_diff["audio"]["array"])

In [60]:
print("Embeddings shape (output of the network):", e0.shape)

Embeddings shape (output of the network): torch.Size([1, 512])


## Find out if the speakers are the same

To check for similarity of speakers, we calculate cosine similarity. Threshold to decide is 0.86 in the demo. So, it clearly works.

In [61]:
cosine_sim = torch.nn.CosineSimilarity(dim=-1)
print(f"Similarity score between two recordings from the same speaker: {cosine_sim(e0, e_same)}")
print(f"Similarity score between two recordings from different speakers: {cosine_sim(e0, e_diff)}")

Similarity score between two recordings from the same speaker: tensor([0.9817], grad_fn=<SumBackward1>)
Similarity score between two recordings from different speakers: tensor([0.7273], grad_fn=<SumBackward1>)
