EDA and modeling on ≈150 samples of sounds speaking numbers form 1 to 5 recorded by 10 people.
Accuracy on large (>10k) dataset: 94%.
Accuracy on given small dataset: 93%
Task was more or less chalenging because of small dataset.
Augmentation techniques:
- increase/decrease pitch
- increase/decrease speed
- stretching
- frequency and time masking
- white noice injection
- time shifting
- overlay 2 samples (quiet and louder)
- pre/post noise padding
Splitting data is done by speakers 5-5.
Training graph :