This example demonstrates the use of pretrained models trained on Audioset to fine-tune a smaller sound classification dataset, namely esc50 dataset.
The tricks and tips are as follows:
- Use pre-trained model Cnn14 from audioset_tagging_cnn[1]. The model is pre-trained on audioset, which is the largest weakly-labelled(only class info, without exact event time location) sound event dataset.
- No weight-decaying is used
- In training, spectrogram is of 384 frames. Random cropping is used.
- In test/eval, all 501 frames are used (so the result is also deterministic.
- Same lr decreasing scheduler as in baseline.
- Use large dropout,
First you need to download the ESC50 dataset following this link. Then run the following to test on ESC50 dataset.
python test.py -a <audio_folder> -m <meta_file> -d gpu
Without any tricks, this example achieved average acc 0.937 across 5 folds, ranking No. 3 (as of 2021-08) in the leader board.
TBD
# install paddleaudio
git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleAudio
pip install -e .
git clone https://github.com/ranchlai/sound_classification.git
cd sound_classification
pip install -r requirements.txt
[1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley. "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition." arXiv preprint arXiv:1912.10211 (2019).
[2] Urban Sound Tagging using Multi-Channel Audio Feature with Convolutional Neural Networks