## Automated Speaker Verification - A Study in Deep Neural Networks and Transfer Learning 

### Abstract ###
*"Identity theft is not a joke" - Dwight Shrute, The Office*

As the world rapidly shifts towards a technological landscape, the need to protect our online identity becomes more and more imperative. Engineers have tackled this challenge through facial and fingerprint detection, although the ability to authenticate an individual based on a small sample of their voice will completely transform this space. While security can never be guaranteed, speaker verification will add significantly more resistance to anyone attempting to impersonate someone else, and will increase the confidence of the general population in the safety of their identities. Further, considering the recent major advancements in virtual reality, speaker verification and by extension speech recognition will have a substantial influence on our everyday lives, whether it be 40, 20, or even 10 years from now. Hence, our team decided to explore this space, deepening our understanding of neural style transfer through comparing past state-of-the-art TDNN models with the more accurate ECAPA-TDNN model.

### Introduction ###
Till recently, x-vectors have provided state-of-the-art solutions for speaker verification tasks. Usually, the networks are trained on a speaker identification tasks. Then, after convergence, speaker embeddings can be extracted from the penultimate layer to characterise a speaker in a recording. Speaker verification can thus be accomplished by comparing two embeddings with a simple cosine distance measurement (more on this later). The rising popularity of the x-vector system has led to major architectural improvements and optimised performancees over the initial approach. Adding residual connections between the frame-level layers has shown to enhance embeddings, with the added benefit of enabling the back-propagation algorithm to converge quicker and avoid the vanishing gradients problem. The statistics pooling layer in the x-vector system projects a variable-length input into a fixed length by gathering information about hidden node activations across time. A temporal self-attention system is often added to this pooling layer which allows the network to focus on frames it deems important, or alternatively, can be interpreseted as a Voice Activity Detection (VAD) prepossessing step to detec irrelevant non-speech frames. Our model expands on this through including enhancements to the TDNN architecture and statistics pooling layer. Namely, this model implements channel and context dependent statistics pooling and adds a multi-layer feature aggregation and summation will significantly increase the accuracy most speaker verification tasks. The final section will conclude with a brief overview of our findings.

#### Requirements

To activate a virtual environment, run

```shell
python3 -m venv env
source env/bin/activate
```

To install the required python packages, run

```shell
pip install -r requirements.txt
```

### Transfer Learning

The DNN is trained to classify speakers using a training set of speech recorded from a large number of training speakers. To leverage feature representations from the pretrained model on the large dataset, speech recorded from each set of enrollment speakers is passed as input to the trained DNN. This enables the computations of deeper hidden features for each speaker in the enrollment set, which are then averaged to generate a compact deep embedding associated with that speaker.

### Data Preparation

##### VoxCeleb1 and VoxCeleb2

VoxCeleb1 is a large scale audio-visual dataset for speaker identification with 150,000 samples from over 1,251 speakers. It consists of short clips of human speech, extracted from interview videos uploaded to YouTube. VoxCeleb2 offers similar features but for a larger dataset. It contains over 1 million utterances from 7000 different speakers totalling over 2000 hours of both audio and video. Each segment is at least 3 seconds long and is captured 'in the wild' with background chatter, laughter and overlapping speech. The 7000 speakers span a wide range of different ethnicities, accents, professions and ages. ([VoxCeleb, 2022](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/))

Our model is trained on audio files collected from both the VoxCeleb1 and VoxCeleb2 datasets and it achieves an accuracy of approximately 98-99%.

### Network Architecture: [ECAPA-TDNN](https://arxiv.org/pdf/2005.07143.pdf)


<center><img src="../images/model.jpg" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Block diagram of the ECAPA-TDNN model</b>
</p>

#### How the model works ####
Initially the model searches for all the .wav files in the specified data folder and randomly splits them into 80% for training, 10% for validation and 10% for testing purposes. The model then begins to train the network on the 80% of training data it receives through applying data preprocesing and augmentation. A Voice Activity Detection (VAD) preprocesisng step is used to detect irrelevant non-speech frames, and the augmentation is run through varying the speeds of the spoken sample [see 3.2.2 Data Augmentation]. Once the data is preprocessed and augmentation is applied, the model starts to learn the new speaker, adjusting the weights and biases respectively. For this, we apply 1024 channels in the convolutional frame layers. The dimension of the bottleneck in the attention module is set to 128, and the number of nodes in the final fully connected layer is 192. The final channel represents the number of classes the embedding could fall under, which after applying the cosine distance similarity operation can be used to provide an accurate estimate of whether the two speakers are identical or not.

#### Model Improvement 1: Channel-dependent Attentive Statistics Pooling
Neural networks are known to learn heirarchical structures with each layer operating on a different level of complexity. In the ECAPA-TDNN model, features are aggregated and propagated at different hierarchical levels to produce better results. The statistics pooling module is improved with channel-dependent frame attention, enabling the network to focus on different subsets of frames in each channel statistics estimation. The frames that it does focus on depends on which frames it deems important, which is achieved through the following attention mechanism:


<center><img src="../images/attentionMechanism.png" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Description</b>
</p>

The scalar score is then normalised [see Normalisation] over all frames by applying the softmax function channel-wise accross time. 

<center><img src="../images/softmax.png" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Description</b>
</p>

The weighted mean vector and channel component are then constructed as follows:

<center><img src="../images/weightedMean.png" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Description</b>
</p>

<center><img src="../images/channel.png" style="width: 250px;"/></center>
<p style="text-align: center">
    <b>Description</b>
</p>

#### Model Improvement 2: Multi-layer Feature Aggregation (MFA) ####
In the original x-vector system that our model is based off, only the final frame-layer is used for calculating the pooled statistics, although recent evidence shows that the more shallow and extensive feature maps often contribute to the most robust speaker embeddings. Hence we use all the frames to calculate the final pooled statistsics (Multi-layer Feature Aggregation). For each frame, our system concatenates the output feature maps of all the SE-Res2Blocks. Once all the feature maps are concatenated, a dense layer processes the result and generates the features for attentive statistics pooling. 

### Training a Speech Diarisation model

In [None]:
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from user_data_prepare import prepare_user_data
from speaker_verif.custom_train import SpkIdBrain, dataio_prep

### Hyperparameters

In [None]:
# pretrain folders:
pretrained_path: speechbrain/spkrec-ecapa-voxceleb

# Training Parameters
lr: 0.001
lr_final: 0.0001
sample_rate: 16000
number_of_epochs: 30
batch_size: 8

out_n_neurons: 10 # number of enrolled speakers
dataloader_options:
    batch_size: !ref <batch_size>

##### Environmental Corruption

In realistic speech processing applications, the signal recorded by the microphone is corrupted by noise and reverberation. This is particularly harmful in distant-talking (far-field) scenarios, where the speaker and the reference microphone are distant (think about popular devices such as Google Home, Amazon Echo, Kinect, and similar devices).

A common practice in neural speech processing is to start from clean speech recordings and artificially turn them into noisy ones. This process is called environmental corruption (sometimes also referred to as speech contamination). An advantage of this is that the audio can be corrupted in many different ways which increases the size of the test set. Some of those ways include Additive Noise and Reverberation.

<ins>Additive Noise</ins>

Samples from a data collection are added to the clean noise signals with a random Signal-to-Noise ratio. The amount of noise can be tuned to adjust the sampling range.

<ins>Reverberation</ins>

When speaking into a room, our speech signal is reflected multi-times by the walls, floor, ceiling, and by the objects within the acoustic environment. Consequently, the final signal recorded by a distant microphone will contain multiple delayed replicas of the original signal. All these replicas interfere with each other and significantly affect the intelligibility of the speech signal.

Such a multi-path propagation is called reverberation. Within a given room enclosure, the reverberation between a source and a receiver is modeled by an impulse response. The reverberation is added by performing a convolution between a clean signal and an impulse response.

<ins>Environmental Corruption Lobe</ins>

Noise and reverberation are often combined and activated with a certain probability. The corruption operations are performed in the right order. For instance, we first introduce reverberation, and only later noise is added. We use an open-source dataset of impulse responses and noise sequences called Open-RIR and perform environmental corruption by sampling from it.

If we call the corruption function another time, the signal is contaminated in a different way. This allows us to implement an on-the-fly speech contamination and apply different distortions to each different input. Environmental corruption is not computationally demanding and does not slow down the training loop even when doing it on-the-fly.

# Added noise and reverb come from OpenRIR dataset, automatically
# downloaded and prepared with this Environmental Corruption class.
env_corrupt: !new:speechbrain.lobes.augment.EnvCorrupt
    openrir_folder: !ref <data_folder>
    babble_prob: 0.0
    reverb_prob: 0.0
    noise_prob: 1.0
    noise_snr_low: 0
    noise_snr_high: 15

##### Data Augmentation

Another way we pre-process the data is through speech augmentation will also increases the size of our test data. The idea is to artificially corrupt the original speech signals to give the network the illusion that we are processing a new signal. This acts as a powerful regularizer, that normally helps neural networks improving generalization and thus achieve better performance on test data. The augmentation techniques we use are Speed Perturbation, Time Dropout, Frequency Dropout and Clipping.

<ins>Speed Perturbation</ins>

With Speed perturbation, we resample the audio signal to a sampling rate that is a bit different from the original one. With this simple trick we can synthesize a speech signal that sounds a bit "faster" or "slower" than the original one. Note that not only the speaking rate is affected, but also the speaker characteristics such as pitch and formants.

<ins>Time Dropout</ins>

This replaces some random chunks of the original waveform with zeros. The intuition is that the neural network should provide good performance even when some piece of the signal is missing. Conceptually, this is similar to dropout. The difference is that this is applied to the input waveform only. The other difference is that we drop consecutive samples rather than randomly selected elements like in dropout.

<ins>Frequency Dropout</ins>

Frequency dropout, instead of adding zeros in the time domain, it adds zeros in the frequency domain. This can be achieved by filtering the original signal with band-stop filters randomly selected. Similarly to drop chunk, the intuition is that the neural network should work well even when some frequency channels are missing.

<ins>Clipping</ins>

Another way to remove some piece of information from a speech signal is to add clipping. It a form of non-linear distortions that clamps the max absolute amplitude of the signal (thus adding a saturation effect). In the frequency domain, clipping adds harmonics in the higher part of the spectrum.

<ins>Data Augmentation Lobe</ins>

Similar to the environmental corruption lobe, the various data augmentation techniques are also applied and activated with a certain probability and this can similarly be adjusted on the fly.

# Adds speech change + time and frequency dropouts (time-domain implementation)
# # A small speed change help to improve the performance of speaker-id as well.
augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
    sample_rate: !ref <sample_rate>
    speeds: [90, 95, 100, 105, 110]

##### Normalisation

# Mean and std normalization of the input features
mean_var_norm: !new:speechbrain.processing.features.InputNormalization
    norm_type: sentence
    std_norm: False

##### Model parameters

Once the data is preprocessed and augmented, the model is tasked to learn the new speaker by adjusting the weights and biases respectively. 

<FBanks -- @Armaan Bhullar>

For this, the ECAPA-TDNN model applies 1024 channels in the convolutional frame layers. The dimension of the bottleneck in the attention module is set to 128, and the number of nodes in the final fully connected layer is 192.

In [None]:
# Feature extraction
n_mels: 80   # 80-dimensional F-Bank Feature Maps
left_frames: 0
right_frames: 0
deltas: False

emb_dim: 192 # dimensionality of the embeddings

compute_features: !new:speechbrain.lobes.features.Fbank
    n_mels: !ref <n_mels>
    left_frames: !ref <left_frames>
    right_frames: !ref <right_frames>
    deltas: !ref <deltas>

embedding_model: !new:speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN
    input_size: !ref <n_mels>
    channels: [1024, 1024, 1024, 1024, 3072]
    kernel_sizes: [5, 3, 3, 3, 1]
    dilations: [1, 2, 3, 4, 1]
    attention_channels: 128
    lin_neurons: !ref <emb_dim>

classifier: !new:speechbrain.lobes.models.Xvector.Classifier
    input_shape: [null, null, !ref <emb_dim>]
    activation: !name:torch.nn.LeakyReLU
    lin_blocks: 1
    lin_neurons: !ref <emb_dim>
    out_neurons: !ref <out_n_neurons>

: 

### [Speechbrain](https://arxiv.org/pdf/2106.04624.pdf): A Pytorch-based Speech Toolkit

Add some text here...

<center><img src="../images/speechbrain_framework.jpg" style="width: 600px;"/></center>
<p style="text-align: center">
    <b>Basic Training Script using the SpeechBrain Framework (Ravanelli et al., 2021)</b>
</p>

In [None]:
# Path to model hyperparameters file
hparams_file = "speaker_verif/custom_train.yaml"

# Initialize ddp (useful only for multi-GPU DDP training).
sb.utils.distributed.ddp_init_group(run_opts)

# Load hyperparameters file with command-line overrides.
with open(hparams_file) as fin:
    hparams = load_hyperpyyaml(fin, overrides)

# Create experiment directory
sb.create_experiment_directory(
    experiment_directory=hparams["output_folder"],
    hyperparams_to_save=hparams_file,
    overrides=overrides,
)

# Data preparation, to be run on only one process.
sb.utils.distributed.run_on_main(
    prepare_user_data,
    kwargs={
        "data_folder": hparams["data_folder"],
        "save_json_train": hparams["train_annotation"],
        "save_json_valid": hparams["valid_annotation"],
        "save_json_test": hparams["test_annotation"],
        "split_ratio": [80, 10, 10],
    },
)

# Load the pretrained model
if "pretrainer" in hparams:
    hparams["pretrainer"].collect_files()
    hparams["pretrainer"].load_collected(device=run_opts["device"])
else:
    print("No pretrained model found, training from scratch.")
    
# Create dataset objects "train", "valid", and "test".
datasets = dataio_prep(hparams)

# Initialize the Brain object to prepare for mask training.
spk_id_brain = SpkIdBrain(
    modules=hparams["modules"],
    opt_class=hparams["opt_class"],
    hparams=hparams,
    run_opts=run_opts,
    checkpointer=hparams["checkpointer"],
)   

# The `fit()` method iterates the training loop, calling the methods
# necessary to update the parameters of the model. Since all objects
# with changing state are managed by the Checkpointer, training can be
# stopped at any point, and will be resumed on next call.
spk_id_brain.fit(
    epoch_counter=spk_id_brain.hparams.epoch_counter,
    train_set=datasets["train"],
    valid_set=datasets["valid"],
    train_loader_kwargs=hparams["dataloader_options"],
    valid_loader_kwargs=hparams["dataloader_options"],
)

# Load the best checkpoint for evaluation
test_stats = spk_id_brain.evaluate(
    test_set=datasets["test"],
    min_key="error",
    test_loader_kwargs=hparams["dataloader_options"],
)

: 

### Verification through Inference

To verify the identity of an unknown speaker, a test utterance of the unknown speaker is passed as input to the trained DNN. A compact deep embedding associated with the unknown speaker is generated and compared with the compact deep embeddings associated with each of the enrollment speakers through calculation of Cosine Distance Similarity.  The distance between the compared compact deep embeddings corresponds to the likelihood that the unknown speaker belongs to the set of enrolled speakers. 

In [None]:
# Imports for inference

import os
import shutil
import glob
from random import shuffle
from torch.nn import CosineSimilarity 
from torchaudio import load as load_signal
from speechbrain.pretrained import EncoderClassifier

To enable easy accessibility to the most recently trained model, we first move it to the "content/best_model" path along with associated hyperparameters and class labels. 

In [None]:
src_path = "results/speaker_id/1986/save/"  # Path to trained network checkpoints
dest_path = "content/best_model/"           # Path to store most recently trained model information 

if os.path.exists(dest_path):
    shutil.rmtree(dest_path)

os.mkdir(dest_path)
shutil.copy2("./hparams_inference.yaml", dest_path)
shutil.copy2(src_path + "label_encoder.txt", dest_path)
os.rename(dest_path + "label_encoder.txt", dest_path + "label_encoder.ckpt")
ckpt_files = glob.glob(src_path + "CKPT*")
if not ckpt_files:
    print("No trained checkpoints")
    exit(1)
latest_ckpt_path = max(ckpt_files, key=os.path.getctime)
for file in glob.glob(latest_ckpt_path + "/*"):
    shutil.copy2(file, dest_path)

Now, to begin inference, we identify the path to the recorded test signal and the *unique* user id that the test signal should be tested against. The Encoder Classifier class makes use of speechbrains pretrained models to extract vector features (embeddings). This essentially allows our model to make sense of the uploaded audio and compare it with the model's known speakers, allowing for the classification of the uploaded audio clip.

In [None]:
# Build Classifier
classifier = EncoderClassifier.from_hparams(source="content/best_model",  hparams_file='hparams_inference.yaml', savedir="content/best_model")

Cosine similarity critically is used to compare the new audio clip's embeddings to those of existing speakers in the model. Cosine similiarity is a very useful metric for this comparison since it compares the features of the samples irrespective of their size. Using this metric, similarity between the sample and each of the trained speakers can be evaluated, and based on whether they meet a threshold of similarity, the model affirms or denies whether the voice belongs to a particular speaker, and is able to abstract which speaker it is. 

<center><img src="../images/cos_sim.png" style="width: 500px;"/></center>
<p style="text-align: center">
    <b>Calculation of Cosine Similarity</b>
</p>

In [None]:
# Cosine Similarity
similarity = CosineSimilarity(dim=-1, eps=1e-8) # dim=-1 refers to the last dimension (i.e. the embedding dimension)

The verification process is divided into two sections: extracting vector embeddings for each voice signal and calculating its similarity to one of the recorded samples from the enrolled speaker. To allow for a better measure of speaker validation, we test the test signal against 5 randomly selected voice samples from the enrolled speaker. 

In [None]:
def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:
    """Feature extractor that embeds audio into a vector."""
    signal, _ = load_signal(wav_audio_file_path)  # Reformat audio signal into a tensor
    output_probs, score, index, text_lab = model.classify_batch(signal)
    print("Possible user_ids", text_lab)
    embeddings = model.encode_batch(
        signal
    )  # Pass tensor through pretrained neural net and extract representation
    return embeddings, text_lab

def verify(s1, s2):
    global similarity
    THRESHOLD = 0.25
    score = similarity(s1, s2) # resulting tensor has scores = embedding dimensionality 
    for s in score: 
        if s > THRESHOLD: return True
    return False

test_emb, possible_ids = extract_audio_embeddings(classifier, test_signal_path)

if spk_id in possible_ids: 
    return True

spk_samples = glob.glob(f"data/user_data/raw/{spk_id}/*/*.wav")
shuffle(spk_samples)
for sample_path in spk_samples[:5]: # test on up to 5 random samples
    print(f"Testing sample against {sample_path}")
    sample_emb = extract_audio_embeddings(classifier, sample_path)
    if verify(test_emb, sample_emb):
        print("User Verified")
        exit(0)

print("Suspicious User - Access Denied")

### Results

<center><img src="../images/pretrained_model.png" style="width: 400px;"/></center>
<p style="text-align: center">
    <b>Training Loss and Validation Loss for Pretrained Model</b>
    <br>
    <b1><i>VoxCeleb1 and VoxCeleb2</i></b1>
</p>

A performance overview of the pretrained model suggested a validation error rate of approximately 1.66% (see [train_log.txt](results/speaker_id/1986/train_log.txt)), reflecting relatively high precision of the training cycle. 

<center><img src="../images/no_transfer_learning.png" style="width: 400px;"/><img src="../images/transfer_learning.png" style="width: 400px;"/></center>
<p style="text-align: center">
    <b>Training Loss and Validation Loss for Trained Model</b>
    <br>
    <b1><i>10 speakers with 10 training samples each</i></b1>
</p>

The accuracy of both models approximately resulted in a validation error rate of 16.7%, which may have been due to only 10 training samples being provided for each trained speaker. It is possible that the accuracy of the model would increase with the number of training samples and this functionality was added to our model (trained users can input more data to improve the accuracy of their model). 

It was also surprising that while both performance statistics show little difference in training and validation loss errors, qualititative observations illustrated greater difference in speaker embedding accuracy in verifying the correct speaker. It was observed that despite the classifier being trained on the same training samples for the same speakers, the absence of transfer learning resulted in poorer speaker feature representations that resulted in inaccurate speaker classifications. In particular, when tested against a random voice sample that the model had not been trained against, the model without transfer learning verified the speaker despite it being incorrectly classified. In contrast, when trained with transfer learning, the speaker embeddings were correctly classified by the model. This supported our hypothesis of the benefits of transfer learning in improving the quality of speaker feature representations, and hence the accuracy of the speaker verification system. However, qualitative observations did not provide sufficient evidence that transfer learning decreased the training time of the network. This is potentially due to the small dataset that the model was trained against in this project.

##### Computational Cost

Several system constraints were realised during the training stages of our model. In particular, with one speaker training sample of 10 inputs, and a batch size of 160 after data augmentation, the estimated training parameters in our model approximated to 20.8 million. This  

##### Strengths

Text here

##### Weaknesses and Limitations

A key determinant of the accuracy of speaker diarisation models is the quality of the speaker embeddings used to distinguish speaker samples by speaker id.  

<ins>Data Preprocessing</ins> 

A key weakness in the samples collected by enrolled speakers in the current version of our application is in its inability to remove background noise from training samples. While the pretrained model is trained on clean voice samples, the absence of noise reduction algorithms is considered to be a critical source of error in this trained model's ability to produce accurate feature representations. However, this is addressed as a source of future improvement in the model. In particular, the [Noisereduce](https://pypi.org/project/noisereduce/) Python module offers a way of estimating the noise threshold for each frequency band in a signal. This threshold is then used to compute a mask, which gates noise below the frequency-varying threshold (Sainburg, 2022).

<ins>Data Splitting</ins> 

In the delegation of data to the training, validation and testing sets, we make use of a classic 80-10-10 data split for each set respectively. This poses unreliability in the performance statistics since it is randomly determined and based on chance. This problem raises further concern when this multi-class problem is evaluated against multiple speakers since it affect the representation of each speaker in the test set and leads to unreliable conclusions on model performance. As an improvement for future models, we propose a simple k-fold cross-validation resampling procedure that evaluates K models for K data splits and combines performance measures from each model as an average. While this method is computationally expensive, it theoretically would be advantageous for smaller datasets ([SciKit, 2022]((https://scikit-learn.org/stable/modules/cross_validation.html) )).

<ins>Model Complexity</ins> 

Several empirical observations made during parameter selection suggested that improving the complexity of the model and introducing a greater number of speaker samples would lead to more accurate embeddings of speaker features, and hence, the speaker verification system. Additionally, it is apparent that both models (with and without transfer learning) exhibit underfitting of the trained dataset. In addition, based on the validation loss alone, the results motivate a discussion of the problem of negative transfer. This relates to the problem that transfer learning decreases the performance accuracy of a new model (instead of increasing it). However, as mentioned in the qualititative observations of the accuracy of speaker feature embeddings 

<ins>Cosine Threshold Determination</ins> 

text here

<ins>Size of Training Samples</ins> 

 ([Liu, et al., 2022](https://arxiv.org/pdf/2202.01624.pdf))

<ins>Biases</ins> 

Suresh and Guttag's *[Framework for Understanding Sources of Harm](https://dl.acm.org/doi/fullHtml/10.1145/3465416.3483305)* through the ML cycle draws attention to several sources of bias related harms that could affect the reliability of automated speech recognition models. Examining VoxCeleb's database revealed that training samples in the pretrained model are skewed towards males, with 61% male speakers and 29% had a US nationality ([Hutiri & Ding, 2022]((https://arxiv.org/pdf/2201.09486))). This form of representation bias may result in poor generalisation for the female subset in our model. Additionally, the design of speaker features and labels used in this prediction problem is also susceptible to measurement biases since it affects the speaker's representation in the dataset. 

Another key form of bias arises in the mismatch between the application context and usage environment, and the problem space that is conceptualised during the development of our model. From the initial problem motivation, speaker verification systems are conceptually required to minimise false positives since they serve as an authentication tool. For this project, however, VoxCeleb evaluation sets are inadequate in evaluating speaker verification performance for commonly sounding voices (for instance, speakers that share similar accents due to their geographical locale or skilled voice impressionists).

##### Possible Future Work

Text here


### Key Features to Discuss

- Network Architecture (ECAPA-TDNN architecture) (See [here](https://arxiv.org/pdf/2005.07143.pdf)) (Ahmet)
    - Channel- and context-dependent attention mechanism
    - Multi-layer Feature Aggregation (MFA)
    - AAMsoftmax loss
- Connectionist temporal classification loss (CTC loss)
- VAD
- Statistical pooling (Ahmet)
- Data Augmentation (Adding time/frequency dropouts, speed change, environmental corruption, noise addition) (Armaan)
- Dropout
- Normalisation
- Linear Learning Rate Decay and Adam Optimiser



### Conclusion ###
While speaker verification on its own cannot guarantee security, it will add strength and friction to our online identities and reduce the likelihood of incorrect authentication. The improvements we lay out in our model are the aggregation and propagation of features at different heirarchical levels, and multi-layer feature aggregation, both of which prove to increase the accuracy of our speaker verification system. A future experiment of ours is to use a similar framework for speech recognition, and eventually using neural networks to harness the ability of speaking in the voice of another person. 

#### Citation
### References

*Harini Suresh and John Guttag. 2021. [A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle.](https://dl.acm.org/doi/fullHtml/10.1145/3465416.3483305) In EAAMO ’21:  Equity and Access in Algorithms, Mechanisms, and Optimization.*

*Hutiri, W. T. & Ding, A. Y., 2022. Bias in Automated Speaker Recognition. FAccT. [arXiv:2201.09486](https://arxiv.org/pdf/2201.09486)*

*Liu, T., Das, R. K., Lee, K. A. & Li, H., 2022. MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances. ICASSP 2022, Volume 3. [arXiv:2202.01624](https://arxiv.org/pdf/2202.01624.pdf)*

*Sainburg, T., 2022. [NoiseReduce 2.0.1, Noise reduction in python using spectral gating.](https://pypi.org/project/noisereduce/) [Online] [Accessed 15 November 2022].*

*SciKit, 2022. 3.1. [Cross-validation: evaluating estimator performance.](https://scikit-learn.org/stable/modules/cross_validation.html) [Online] [Accessed 11 November 2022].*


##### Datasets

*Nagrani, A., Chung, J. S. & Zisserman, A., 2017. VoxCeleb, A Large-Scale Speaker Identification Dataset. Interspeech. [ 	arXiv:1706.08612](https://arxiv.org/pdf/1706.08612.pdf)*

*Nagrani, A., Chung, J. S. & Zisserman, A., 2018. VoxCeleb2: Deep Speaker Recognition. Interspeech. [ 	arXiv:1806.05622](https://arxiv.org/pdf/1806.05622.pdf)*

##### Other

*Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio, 2021. SpeechBrain: A General-Purpose Speech Toolkit. [arXiv:2106.04624](https://arxiv.org/pdf/2106.04624.pdf).*