## Automated Speaker Verification - A Study in Deep Neural Networks and Transfer Learning 

### Abstract ###
*"Identity theft is not a joke" - Dwight Shrute, The Office*

As the world swiftly shifts towards a technological landscape the need to protect our online identity becomes imperative. Engineers have tackled this challenge through facial and fingerprint detection, although the ability to authenticate a claimed identity through analysing a spoken sample of their voice will completely transform this space. Further, considering the major advancements in virtual reality, the ability to verify one's voice and in extension recognise their speech will be mainstream in all virtual realiy environments. Our team thus sought to extend our understanding of deep learning and neural style transfer through exploring past state-of-the-art TDNN's models and how they compare with the more accurate ECAPA-TDNN model.

### Introduction ###
Till recently, x-vectors have provided state-of-the-art solutions for speaker verification tasks. Usually, after convergence, speaker embeddings can be extracted from the penultimate layer to characterise a speaker in a recording. Speaker verification can thus be accomplished by comparing two embeddings with a simple cosine distance measurement. Our model expands on this through including enhancements to the TDNN architecture and statistics pooling layer. [add more]

#### Requirements

To activate a virtual environment, run

```shell
python3 -m venv env
source env/bin/activate
```

To install the required python packages, run

```shell
pip install -r requirements.txt
```

### Transfer Learning

The DNN is trained to classify speakers using a training set of speech recorded from a large number of training speakers. To leverage feature representations from the pretrained model on the large dataset, speech recorded from each set of enrollment speakers is passed as input to the trained DNN. This enables the computations of deeper hidden features for each speaker in the enrollment set, which are then averaged to generate a compact deep embedding associated with that speaker.   

### Zero-Shot Learning



### Data Preparation

#### VoxCeleb
VoxCeleb1 is a large scale audio-visual dataset for speaker identification with 150,000 samples from over 1,251 speakers. It consists of short clips of human speech, extracted from interview videos uploaded to YouTube. VoxCeleb2 is essentially the same thing but on a much larger scale. It contains over 1 million utterances from 7000 different speakers totalling over 2000 hours of both audio and video. Each segment is at least 3 seconds long and is captured 'in the wild' with background chatter, laughter and overlapping speech. The 7000 speakers span a wide range of different ethnicities, accents, professions and ages. ([SOURCE](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/))

Our model is trained on audio files collected from both the VoxCeleb1 and VoxCeleb2 datasets and it achieves an accuracy of approximately 98-99%.

#### Data Corruption
In realistic speech processing applications, the signal recorded by the microphone is corrupted by noise and reverberation. This is particularly harmful in distant-talking (far-field) scenarios, where the speaker and the reference microphone are distant (think about popular devices such as Google Home, Amazon Echo, Kinect, and similar devices).

A common practice in neural speech processing is to start from clean speech recordings and artificially turn them into noisy ones. This process is called environmental corruption (sometimes also referred to as speech contamination). An advantage of this is that the audio can be corrupted in many different ways which increases the size of the test set. Some of those ways include Additive Noise and Reverberation.

__Additive Noise__

Samples from a data collection are added to the clean noise signals with a random Signal-to-Noise ratio. The amount of noise can be tuned to adjust the sampling range.

__Reverberation__

When speaking into a room, our speech signal is reflected multi-times by the walls, floor, ceiling, and by the objects within the acoustic environment. Consequently, the final signal recorded by a distant microphone will contain multiple delayed replicas of the original signal. All these replicas interfere with each other and significantly affect the intelligibility of the speech signal. 

Such a multi-path propagation is called reverberation. Within a given room enclosure, the reverberation between a source and a receiver is modeled by an impulse response. The reverberation is added by performing a convolution between a clean signal and an impulse response.

__Environmental Corruption Lobe__

Noise and reverberation are often combined and activated with a certain probability. The corruption operations are performed in the right order. For instance, we first introduce reverberation, and only later noise is added. We use an open-source dataset of impulse responses and noise sequences called open-rir and perform environmental corruption by sampling from it.

If we call the corruption function another time, the signal is contaminated in a different way. This allows us to implement an on-the-fly speech contamination and apply different distortions to each different input. Environmental corruption is not computationally demanding and does not slow down the training loop even when doing it on-the-fly.


#### Data Augmentation
Another way we pre-process the data is through _speech augmentation_ will also increases the size of our test data. The idea is to artificially corrupt the original speech signals to give the network the _illusion_ that we are processing a new signal. This acts as a powerful _regularizer_, that normally helps neural networks improving generalization and thus achieve better performance on test data. The augmentation techniques we use are Speed Perturbation, Time Dropout, Frequency Dropout and Clipping.

__Speed Perturbation__

With Speed perturbation, we resample the audio signal to a sampling rate that is a bit different from the original one. With this simple trick we can synthesize a speech signal that sounds a bit "faster" or "slower" than the original one. Note that not only the speaking rate is affected, but also the speaker characteristics such as pitch and formants.

__Time Dropout__

This replaces some random chunks of the original waveform with zeros. The intuition is that the neural network should provide good performance even when some piece of the signal is missing. Conceptually, this similar to dropout. The difference is that this is applied to the input waveform only. The other difference is that we drop consecutive samples rather than randomly selected elements like in dropout.

__Frequency Dropout__

Frequency dropout, instead of adding zeros in the time domain, it adds zeros in the frequency domain. This can be achieved by filtering the original signal with band-stop filters randomly selected. Similarly to drop chunk, the intuition is that the neural network should work well even when some frequency channels are missing.

__Clipping__

Another way to remove some piece of information from a speech signal is to add clipping. It a form of non-linear distortions that clamps the max absolute amplitude of the signal (thus adding a saturation effect). In the frequency domain, clipping adds harmonics in the higher part of the spectrum.

__Data Augmentation Lobe__

Similar to the environmental corruption lobe, the various data augmentation techniques are also applied and activated with a certain probability and this can similarly be adjusted on the fly.

### Network Architecture

- Network Architecture (ECAPA-TDNN architecture) (See [here](https://arxiv.org/pdf/2005.07143.pdf))
    - Channel- and context-dependent attention mechanism
    - Multi-layer Feature Aggregation (MFA)
    - AAMsoftmax loss

<center><img src="../images/model.jpg" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Block diagram of the ECAPA-TDNN model</b>
</p>

### Model Used: ECAPA-TDNN ###
#### Improvement 1: Statistical Pooling [channel-dependent attentive statistics pooling] ####
Neural networks are known to learn hierarchical structures with each layer operating on a different level of complexity. In the ECAPA-TDNN model, features are aggregated and propagated at different hierarchical levels to produce better results. The statistics pooling module is improved with channel-dependent frame attention, enabling the network to focus on different subsets of frames in each channel statistics estimation. The frames that it does focus on depends on which frames it deems important, which is achieved through the following attention mechanism:


<center><img src="../images/attentionMechanism.jpg" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Add description</b>
</p>

The scalar score is then normalised [see Normalisation] over all frames by applying the softmax function channel-wise accross time. 

<center><img src="../images/softmax.png" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Add description</b>
</p>

The weighted mean vector and channel component are then constructed as follows:

<center><img src="../images/channel.png" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Add description</b>
</p>

#### Multi-layer Feature Aggregation (MFA) ####
In the original x-vector system that our model is based off, only the final frame-layer is used for calculating the pooled statistics, although recent evidence shows that the more shallow and extensive feature maps often contribute to the most robust speaker emvbeddings. Hence we calculate the pooled statistics for every frame (Multi-layer Feature Aggregation). 

### Training a Speech Diarisation model

In [None]:
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from user_data_prepare import prepare_user_data
from speaker_verif.custom_train import SpkIdBrain, dataio_prep

: 

#### How the model works ####
Initially the model searches for all the .wav files in the specified data folder and randomly splits them into 80% for training, 10% for validation and 10% for testing purposes. The model then begins to train the network on the 80% of training data it receives through applying data preprocesing and augmentation. A Voice Activity Detection (VAD) preprocesisng step is used to detect irrelevant non-speech frames, and the augmentation is run through varying the speeds of the spoken sample.

In [None]:
import torch
import torchaudio
import speechbrain as sb

def prepare_features(self, wavs, stage):

  wavs, lens = wavs
  
  if stage == sb.Stage.TRAIN:
      if hasattr(self.modules, "env_corrupt"):
          wavs_noise = self.modules.env_corrupt(wavs, lens)
          wavs = torch.cat([wavs, wavs_noise], dim=0)
          lens = torch.cat([lens, lens])

      if hasattr(self.hparams, "augmentation"):
          wavs = self.hparams.augmentation(wavs, lens)

  # Feature extraction and normalization
  feats = self.modules.compute_features(wavs)
  feats = self.modules.mean_var_norm(feats, lens)

  return feats, lens

: 

#### Model Hyperparameters

In [None]:
# pretrain folders:
pretrained_path: speechbrain/spkrec-ecapa-voxceleb

# Training Parameters
lr: 0.001
lr_final: 0.0001
sample_rate: 16000
number_of_epochs: 35
batch_size: 32

# Feature parameters
n_mels: 80
left_frames: 0
right_frames: 0
deltas: False

out_n_neurons: 50 # maximum number of speakers
emb_dim: 512 # dimensionality of the embeddings
dataloader_options:
    batch_size: !ref <batch_size>

: 

##### Environmental Corruption

In [None]:
# Added noise and reverb come from OpenRIR dataset, automatically
# downloaded and prepared with this Environmental Corruption class.
env_corrupt: !new:speechbrain.lobes.augment.EnvCorrupt
    openrir_folder: !ref <data_folder>
    babble_prob: 0.0
    reverb_prob: 0.0
    noise_prob: 1.0
    noise_snr_low: 0
    noise_snr_high: 15

: 

##### Data Augmentation

In [None]:
# Adds speech change + time and frequency dropouts (time-domain implementation)
# # A small speed change help to improve the performance of speaker-id as well.
augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
    sample_rate: !ref <sample_rate>
    speeds: [90, 95, 100, 105, 110]

: 

##### Normalisation

In [None]:
# Mean and std normalization of the input features
mean_var_norm: !new:speechbrain.processing.features.InputNormalization
    norm_type: sentence
    std_norm: False

: 

##### Model parameters

Once the data is preprocessed and augmentation is applied, the model to learn the new speaker, adjusting the weights and biases respectively. For this, we apply 1024 channels in the convolutional frame layers. The dimension of the bottleneck in the attention module is set to 128, and the number of nodes in the final fully connected layer is 192.

In [None]:
# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
    n_mels: !ref <n_mels>
    left_frames: !ref <left_frames>
    right_frames: !ref <right_frames>
    deltas: !ref <deltas>

embedding_model: !new:speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN
    input_size: !ref <n_mels>
    channels: [1024, 1024, 1024, 1024, 3072]
    kernel_sizes: [5, 3, 3, 3, 1]
    dilations: [1, 2, 3, 4, 1]
    attention_channels: 128
    lin_neurons: 192

classifier: !new:speechbrain.lobes.models.ECAPA_TDNN.Classifier
    input_size: 192
    out_neurons: !ref <out_n_neurons>

: 

### Speechbrain: A Pytorch-based Speech Toolkit

Add some text here...

In [None]:
# Path to model hyperparameters file
hparams_file = "speaker_verif/custom_train.yaml"

# Initialize ddp (useful only for multi-GPU DDP training).
sb.utils.distributed.ddp_init_group(run_opts)

# Load hyperparameters file with command-line overrides.
with open(hparams_file) as fin:
    hparams = load_hyperpyyaml(fin, overrides)

# Create experiment directory
sb.create_experiment_directory(
    experiment_directory=hparams["output_folder"],
    hyperparams_to_save=hparams_file,
    overrides=overrides,
)

# Data preparation, to be run on only one process.
sb.utils.distributed.run_on_main(
    prepare_user_data,
    kwargs={
        "data_folder": hparams["data_folder"],
        "save_json_train": hparams["train_annotation"],
        "save_json_valid": hparams["valid_annotation"],
        "save_json_test": hparams["test_annotation"],
        "split_ratio": [80, 10, 10],
    },
)

# Load the pretrained model
if "pretrainer" in hparams:
    hparams["pretrainer"].collect_files()
    hparams["pretrainer"].load_collected(device=run_opts["device"])
else:
    print("No pretrained model found, training from scratch.")
    
# Create dataset objects "train", "valid", and "test".
datasets = dataio_prep(hparams)

# Initialize the Brain object to prepare for mask training.
spk_id_brain = SpkIdBrain(
    modules=hparams["modules"],
    opt_class=hparams["opt_class"],
    hparams=hparams,
    run_opts=run_opts,
    checkpointer=hparams["checkpointer"],
)   

# The `fit()` method iterates the training loop, calling the methods
# necessary to update the parameters of the model. Since all objects
# with changing state are managed by the Checkpointer, training can be
# stopped at any point, and will be resumed on next call.
spk_id_brain.fit(
    epoch_counter=spk_id_brain.hparams.epoch_counter,
    train_set=datasets["train"],
    valid_set=datasets["valid"],
    train_loader_kwargs=hparams["dataloader_options"],
    valid_loader_kwargs=hparams["dataloader_options"],
)

# Load the best checkpoint for evaluation
test_stats = spk_id_brain.evaluate(
    test_set=datasets["test"],
    min_key="error",
    test_loader_kwargs=hparams["dataloader_options"],
)

: 

### Verification through Inference

To verify the identity of an unknown speaker, a test utterance of the unknown speaker is passed as input to the trained DNN. A compact deep embedding associated with the unknown speaker is generated and compared with the compact deep embeddings associated with each of the enrollment speakers through calculation of Cosine Distance Similarity. (Talk about Cosine Distance - include brief background?). The distance between the compared compact deep embeddings corresponds to the likelihood that the unknown speaker belongs to the set of enrolled speakers. 

In [None]:
# Imports for inference

import os
import shutil
import glob
from random import shuffle
from torch.nn import CosineSimilarity 
from torchaudio import load as load_signal
from speechbrain.pretrained import EncoderClassifier

: 

To enable easy accessibility to the most recently trained model, we first move it to the "content/best_model" path along with associated hyperparameters and class labels. 

In [None]:
src_path = "results/speaker_id/1986/save/"  # Path to trained network checkpoints
dest_path = "content/best_model/"           # Path to store most recently trained model information 

if os.path.exists(dest_path):
    shutil.rmtree(dest_path)

os.mkdir(dest_path)
shutil.copy2("./hparams_inference.yaml", dest_path)
shutil.copy2(src_path + "label_encoder.txt", dest_path)
ckpt_files = glob.glob(src_path + "CKPT*")
if not ckpt_files:
    print("No trained checkpoints")
    exit(1)
latest_ckpt_path = max(ckpt_files, key=os.path.getctime)
for file in glob.glob(latest_ckpt_path + "/*"):
    shutil.copy2(file, dest_path)

: 

Now, to begin inference, we identify the path to the recorded test signal and the *unique* user id that the test signal should be tested against. << Briefly explain EncoderClassifier class. >>

In [None]:
# Build Classifier
classifier = EncoderClassifier.from_hparams(source="content/best_model",  hparams_file='hparams_inference.yaml', savedir="content/best_model")

: 

EXPLAIN COSINE SIMILARITY 

<center><img src="../images/cos_sim.png" style="width: 500px;"/></center>
<p style="text-align: center">
    <b>Calculation of Cosine Similarity</b>
</p>

In [None]:
# Cosine Similarity
similarity = CosineSimilarity(dim=-1, eps=1e-8) # dim=-1 refers to the last dimension (i.e. the embedding dimension)

: 

The verification process is divided into two sections: extracting vector embeddings for each voice signal and calculating its similarity to one of the recorded samples from the enrolled speaker. To allow for a better measure of speaker validation, we test the test signal against 5 randomly selected voice samples from the enrolled speaker. 

In [None]:
def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:
    """Feature extractor that embeds audio into a vector."""
    signal, _ = load_signal(wav_audio_file_path)  # Reformat audio signal into a tensor
    embeddings = model.encode_batch(
        signal
    )  # Pass tensor through pretrained neural net and extract representation
    return embeddings

def verify(s1, s2):
    global similarity
    THRESHOLD = 0.25
    score = similarity(s1, s2) # resulting tensor has scores = embedding dimensionality 
    for s in score: 
        if s > THRESHOLD: return True
    return False

test_emb = extract_audio_embeddings(classifier, test_signal_path)

spk_samples = glob.glob(f"data/user_data/raw/{spk_id}/*/*.wav")
shuffle(spk_samples)
for sample_path in spk_samples[:5]: # test on up to 5 random samples
    print(f"Testing sample against {sample_path}")
    sample_emb = extract_audio_embeddings(classifier, sample_path)
    if verify(test_emb, sample_emb):
        print("User Verified")
        exit(0)

print("Suspicious User - Access Denied")

: 

### Results

<center><img src="../images/ecapa-tdnn-7205.png" style="width: 400px;"/></center>
<p style="text-align: center">
    <b>Training Loss and Validation Loss for Pretrained Model</b>
</p>

##### System Performance

Some text here...

##### Strengths

Text here

##### Weaknesses and Limitations

<ins>Cosine Threshold Determination</ins> 

text here

<ins>Biases</ins> 

- Baseline pretrained model - 61% male, 29% female (skewed towards males) (Representation bias)
- Measurement bias
- Evaluation bias (in pretrained model and our model --> determination of output neurons is based on maximum allowed users in app)
- Preprocessing (environmental noise is not removed to create clean sample in our model) -- improve in future work

[Article](https://arxiv.org/pdf/2201.09486.pdf)

##### Possible Future Work

Text here


### Key Features to Discuss

- Network Architecture (ECAPA-TDNN architecture) (See [here](https://arxiv.org/pdf/2005.07143.pdf)) (Ahmet)
    - Channel- and context-dependent attention mechanism
    - Multi-layer Feature Aggregation (MFA)
    - AAMsoftmax loss
- Connectionist temporal classification loss (CTC loss)
- VAD
- Statistical pooling (Ahmet)
- Data Augmentation (Adding time/frequency dropouts, speed change, environmental corruption, noise addition) (Armaan)
- Dropout
- Normalisation
- Linear Learning Rate Decay and Adam Optimiser



### Conclusion ###
While speaker verification on its own cannot guarantee security, it will add strength and friction to our online identities and reduce the likelihood of incorrect authentication. A future experiment of ours is to use a similar framework for speech recognition, and eventually using neural networks to harness the ability of speaking in the voice of another person. 

#### Citation
### References

##### Datasets

@InProceedings{Nagrani17,
  author       = "Nagrani, A. and Chung, J.~S. and Zisserman, A.",
  title        = "VoxCeleb: a large-scale speaker identification dataset",
  booktitle    = "INTERSPEECH",
  year         = "2017",
}

@InProceedings{Nagrani17,
  author       = "Chung, J.~S. and Nagrani, A. and Zisserman, A.",
  title        = "VoxCeleb2: Deep Speaker Recognition",
  booktitle    = "INTERSPEECH",
  year         = "2018",
}

##### Other

@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}