## Automated Speaker Verification - A Study in Deep Neural Networks and Transfer Learning 

*"Identity theft is not a joke" - Dwight Shrute, The Office*

Automated speaker verification (ASV), in the modern day, is omnipresent in smart devices and in services offered by call centres. It serves as a biometric means of authenticating a claimed identity by analysing a spoken sample of their voice. 
our team sought to extend our understanding of neural style transfer by exploring how convolutional neural networks could be used to harness the ability to speak in the voice of another person.

#### Requirements

To activate a virtual environment, run

```shell
python3 -m venv env
source env/bin/activate
```

To install the required python packages, run

```shell
pip install -r requirements.txt
```

#### Resources

https://arxiv.org/pdf/2005.07143.pdf

https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio

https://becominghuman.ai/convoice-real-time-zero-shot-voice-style-transfer-with-convolutional-network-4c7b7fff66c9

https://arxiv.org/pdf/1703.10135.pdf

https://arxiv.org/pdf/1905.05879v2.pdf

https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a



#### Transfer Learning

The DNN is trained to classify speakers using a training set of speech recorded from a large number of training speakers. (Talk about Vox-Celeb dataset). To leverage feature representations from the pretrained model on the large dataset, speech recorded from each set of enrollment speakers is passed as input to the trained DNN. This enables the computations of deeper hidden features for each speaker in the enrollment set, which are then averaged to generate a compact deep embedding associated with that speaker.   

#### Verification through Inference

To verify the identity of an unknown speaker, a test utterance of the unknown speaker is passed as input to the trained DNN. A compact deep embedding associated with the unknown speaker is generated and compared with the compact deep embeddings associated with each of the enrollment speakers through calculation of Cosine Distance Similarity. (Talk about Cosine Distance - include brief background?). The distance between the compared compact deep embeddings corresponds to the likelihood that the unknown speaker belongs to the set of enrolled speakers. 

#### Key Features to Discuss

- Network Architecture (ECAPA-TDNN architecture) (See [here](https://arxiv.org/pdf/2005.07143.pdf))
    - Channel- and context-dependent attention mechanism
    - Multi-layer Feature Aggregation (MFA)
    - AAMsoftmax loss
- Connectionist temporal classification loss (CTC loss)
- VAD
- Statistical pooling
- Data Augmentation (Adding time/frequency dropouts, speed change, environmental corruption, noise addition)
- Dropout
- Normalisation
- Linear Learning Rate Decay and Adam Optimiser

