## Automated Speaker Verification - A Study in Deep Neural Networks and Transfer Learning 

*"Identity theft is not a joke" - Dwight Shrute, The Office*

As the world swiftly shifts towards a technological landscape the need to protect our online identity becomes imperative. Engineers have tackled this challenge through facial and fingerprint detection, although the ability to authenticate a claimed identity through analysing a spoken sample of their voice will completely transform this space. Further, Considering the major advancements of virtual reality, the ability to verify one's voice and in extension recognise their speech will be mainstream in any virtual environments. Our team thus sought to extend our understanding of neural style transfer by exploring how an xvector model could be used for speaker verification.

A future experiment of ours is to use a similar framework for speech recognition, and eventually using neural networks to harness the ability of speaking in the voice of another person. 

#### Requirements

To activate a virtual environment, run

```shell
python3 -m venv env
source env/bin/activate
```

To install the required python packages, run

```shell
pip install -r requirements.txt
```

#### Resources

https://arxiv.org/pdf/2005.07143.pdf

https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio

https://becominghuman.ai/convoice-real-time-zero-shot-voice-style-transfer-with-convolutional-network-4c7b7fff66c9

https://arxiv.org/pdf/1703.10135.pdf

https://arxiv.org/pdf/1905.05879v2.pdf

https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a



#### Transfer Learning

The DNN is trained to classify speakers using a training set of speech recorded from a large number of training speakers. (Talk about Vox-Celeb dataset). To leverage feature representations from the pretrained model on the large dataset, speech recorded from each set of enrollment speakers is passed as input to the trained DNN. This enables the computations of deeper hidden features for each speaker in the enrollment set, which are then averaged to generate a compact deep embedding associated with that speaker.   

#### Zero-Shot Learning



#### Data Preparation

Our pretrained model is trained on audio files collected from the VoxCeleb1 + VoxCeleb2 dataset, consisting of speech samples from over 7000 different speakers of a wide range of ethnicities, accents, professions and ages ([SOURCE](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/)). Achieves an accuracy of approximately 98-99%.

#### Verification through Inference

To verify the identity of an unknown speaker, a test utterance of the unknown speaker is passed as input to the trained DNN. A compact deep embedding associated with the unknown speaker is generated and compared with the compact deep embeddings associated with each of the enrollment speakers through calculation of Cosine Distance Similarity. (Talk about Cosine Distance - include brief background?). The distance between the compared compact deep embeddings corresponds to the likelihood that the unknown speaker belongs to the set of enrolled speakers. 

#### Key Features to Discuss

- Network Architecture (ECAPA-TDNN architecture) (See [here](https://arxiv.org/pdf/2005.07143.pdf)) (Ahmet)
    - Channel- and context-dependent attention mechanism
    - Multi-layer Feature Aggregation (MFA)
    - AAMsoftmax loss
- Connectionist temporal classification loss (CTC loss)
- VAD
- Statistical pooling (Ahmet)
- Data Augmentation (Adding time/frequency dropouts, speed change, environmental corruption, noise addition) (Armaan)
- Dropout
- Normalisation
- Linear Learning Rate Decay and Adam Optimiser



### Conclusion ###
While speaker verification on its own cannot guarantee security, it will add strength and friction to our online identities and reduce the likelihood of incorrect authentication. 

#### Citation

##### Datasets

@InProceedings{Nagrani17,
  author       = "Nagrani, A. and Chung, J.~S. and Zisserman, A.",
  title        = "VoxCeleb: a large-scale speaker identification dataset",
  booktitle    = "INTERSPEECH",
  year         = "2017",
}

@InProceedings{Nagrani17,
  author       = "Chung, J.~S. and Nagrani, A. and Zisserman, A.",
  title        = "VoxCeleb2: Deep Speaker Recognition",
  booktitle    = "INTERSPEECH",
  year         = "2018",
}

##### Other

@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}