Dumping ground for miscellaneous ML experiments with focus on FSL.
- Python 3.8
- PyTorch Lightning
- PyTorch
- conda
Using conda to manage dependencies. Detailed list of dependencies in
environment.yml
and requirements.txt
.
EDUSKUNTA.md contains a guide for compiling speaker recognition dataset consisting of Finnish speech from the The Plenary Sessions of the Parliament of Finland dataset.
snn/librispeech/
contains multiple speaker recognition
networks for one-shot learning on LibriSpeech
dataset.
- Thin-ResNet34, fast-ResNet34, SAP and ASP implementations adapted from https://github.com/clovaai/voxceleb_trainer.
- NetVLAD and GhostVLAD pieced together from https://github.com/lyakaap/NetVLAD-pytorch/, https://github.com/Nanne/pytorch-NetVlad/, https://github.com/sitzikbs/netVLAD/.
- Using the learning rate finder from PyTorch Lightning.
- AdamW optimizer[2], with 1cycle learning rate policy[3, 4].
--model snn | snn-capsnet | snn-angularproto | snn-softmaxproto
:--signal_transform melspectrogram | spectrogram | mfcc
: The signal representation to feed to the ResNet, defaults to 'melspectrogram'.--n_mels n
: Number of Mels to use for the Mel spectrogram or MFCC, defaults to 40.--n_fft n
: The value of n_fft to use when constructing the spectrogram.
--resnet_type thin | fast
: Choose either thin-ResNet34 or fast-ResNet34, defaults tothin
.--resnet_aggregation_type SAP | ASP | NetVLAD | GhostVLAD
: Choose the type of aggregation (or pooling) to use for the ResNet output, defaults toSAP
.--resnet_n_out n
: Adjust the size of the ResNet output tensor,512
by default.
--augment
: Enable augmentation using audiomentations.--torch_augment
: Enable augmentation by torch-audiomentations.--specaugment
: Enable spectogram frequency and time masking as per SpecAugment[6].
--num_speakers n
: Number of speakers to include in the training set, defaults to 0 which selects all available.--num_train n
: Number of random samples to take from the training set, defaults to the training set size but can be set higher.--train_batch_size n
: The batch size to use specifically only for training.
Simple end-to-end Siamese neural network using binary cross-entropy loss and basic learning distance measure.
Neural network using metric learning with angular prototypical loss function[7].
Options:
--num_ways k
: Number of speakers (or classes) to include in each training step.--num_shots n
: Number of samples to use per speaker.
Like snn-angularproto
, but using softmax prototypical
loss[8].
Experimenting based on ideas from paper by Hajavi et al. [5].
- CapsNet implementation copied from https://github.com/adambielski/CapsNet-pytorch.
snn/omniglot/
: Convolutional SNN for one-shot learning on
Omniglot
dataset[1].
- Heavily based on reimplementations of the paper at https://github.com/kevinzakka/one-shot-siamese and https://github.com/fangpin/siamese-pytorch.
- Using the learning rate finder from PyTorch Lightning.
- AdamW optimizer[2], with 1cycle learning rate policy[3, 4].
python -m <model>.<dataset>.train --help
Example: train model snn/omniglot/
using 1 GPU:
python -O -m snn.omniglot.train --gpus 1 --num_workers 4 --batch_size 128
--max_epochs 50
- Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." In ICML deep learning workshop, vol. 2. 2015.
- Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." arXiv preprint arXiv:1711.05101 (2017). https://arxiv.org/abs/1711.05101.
- Smith, Leslie N., and Nicholay Topin. "Super-convergence: Very fast training of neural networks using large learning rates." In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications. Vol. 11006. International Society for Optics and Photonics, 2019. https://arxiv.org/abs/1708.07120.
- https://sgugger.github.io/the-1cycle-policy.html
- Hajavi, Amirhossein, and Ali Etemad. "Siamese Capsule Network for End-to-End Speaker Recognition In The Wild." arXiv preprint arXiv:2009.13480 (2020). https://arxiv.org/abs/2009.13480.
- Park, Daniel S., Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, Quoc V. Le, and Yonghui Wu. "Specaugment on large scale datasets." In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6879-6883. IEEE, 2020. https://arxiv.org/abs/1904.08779.
- Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang. "In defence of metric learning for speaker recognition." Interspeech. 2019. https://arxiv.org/abs/2003.11982.
- Heo, Hee Soo and Lee, Bong-Jin and Huh, Jaesung and Chung, Joon Son. "Clova baseline system for the {VoxCeleb} Speaker Recognition Challenge 2020." arXiv preprint. 2020. https://arxiv.org/abs/2009.14153