GitHub - vinsis/speech-commands-recognition: Single word speech recognition using PyTorch

About

This is an attempt to learn features from raw audio directly without any feature extraction. I also tried not to use MaxPooling and relied solely on Global Average Pooling followed by a fully connected layer. After a few quick attempts I was able to achieve ~88% accuracy. Top scorers on Kaggle scored ~90% accuracy (although I suspect the test set used was different). The model I used is dead simple:

ModelCNN(
  (main): Sequential(
    (0): Conv1d(1, 32, kernel_size=(90,), stride=(6,))
    (1): ReLU(inplace)
    (2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True)
    (3): Conv1d(32, 64, kernel_size=(31,), stride=(6,))
    (4): ReLU(inplace)
    (5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True)
    (6): Conv1d(64, 128, kernel_size=(11,), stride=(3,))
    (7): ReLU(inplace)
    (8): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True)
    (9): Conv1d(128, 256, kernel_size=(7,), stride=(2,))
    (10): ReLU(inplace)
    (11): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True)
    (12): Conv1d(256, 512, kernel_size=(5,), stride=(2,))
    (13): ReLU(inplace)
    (14): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
    (15): AvgPool1d(kernel_size=(47,), stride=(47,), padding=(0,), ceil_mode=False, count_include_pad=True)
  )
  (fc): Linear(in_features=512, out_features=30, bias=True)
)

Why Global Average Pooling?

The combination of global average pooling followed by fully connected layer can be used to detect where in the entire file the word is being spoken.

Timeline of network design

As my first approach, I did not want to:

Manually extract features (fourier transform, mel spectogram et al) from the audio clips
Use max-pooling I decided to use fully convolutional neural network for word identification.

1. First working model

My first model refused to converge. The loss would just linger where it started.

2. Working model that converged

I added a batch normalization. But the error still refused to go down. Increasing the learning rate to 0.01 did the magic. Here is what the model looked like at this point.

3. Making the model easy to interpret

In order to make the model easy to interpret and easy to intuit, I decided to apply global average pooling followed by a dense layer. In this case, the pooled layers would act as weighted features which would be fed to the final dense layer to get the output. (I was thinking those features to be somehow representative of phonemes making up a word.)

Model at this point

During the training, error would plateau at around 0.4. At this point, after 5 epochs, I was able to get an accuracy of ~80% on validation set. Not great but promising!

4. Increasing the number of parameters

Since the error would plateau during training (in spite of tweaking the learning rate and applying a learning rate scheduler), I decided to increase the number of parameters. I increased the number of channels in each convolution operation.

Model with increased number of channels

This led to a faster convergence (~85% after 3 epochs). However, the error would still plateau at around 0.35 - 0.4.

It was easy to get to this point but after this point, making further improvements got challenging. I was running my model on my Mac without a GPU and it was getting slower and slower.

5. Increasing the number of parameters even more

To make sure my model wasn't constrained by the number of parameters, I decided to go ballistic and significantly increased the number of parameters. I was able to achieve accuracies of 88.5% and 87.5% on validation and test sets within six epochs.

I trained the model further with learning rate 1e-6 but it didn't improve. I guess I am done experimenting.

Improving accuracy further

Here are some of the things one can try. Note that these steps are written keeping in mind that I am still trying to refrain from MaxPooling and pre-processing audio (using fourier transform etc):

Changing the model parameters (kernel size, stride, number of dense layers etc)
Tweaking hyperparameters like momentum, L2 weight decay
Adding noise to training data to make the learning model more robust

Speech Commands Recognition

Single word speech recognition using PyTorch

Data source

Warden P. Speech Commands: A public dataset for single-word speech recognition, 2017. Available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

Note: Since the unzipped data is > 2GB in size, I have only uploaded a small sample of the entire dataset. In each folder, there are only ten files (instead of 1000+).

Speech Commands Data Set v0.01

This is a set of one-second .wav audio files, each containing a single spoken English word. These words are from a small set of commands, and are spoken by a variety of different speakers. The audio files are organized into folders based on the word they contain, and this data set is designed to help train simple machine learning models.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
speech_commands_v0.01		speech_commands_v0.01
weights		weights
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml
loader.py		loader.py
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Why Global Average Pooling?

Timeline of network design

1. First working model

2. Working model that converged

3. Making the model easy to interpret

4. Increasing the number of parameters

5. Increasing the number of parameters even more

Improving accuracy further

Speech Commands Recognition

Data source

Speech Commands Data Set v0.01

About

Uh oh!

Releases

Packages

Languages

vinsis/speech-commands-recognition

Folders and files

Latest commit

History

Repository files navigation

About

Why Global Average Pooling?

Timeline of network design

1. First working model

2. Working model that converged

3. Making the model easy to interpret

4. Increasing the number of parameters

5. Increasing the number of parameters even more

Improving accuracy further

Speech Commands Recognition

Data source

Speech Commands Data Set v0.01

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages