## Preface
This notebooks aims to build a light-weight CNN.
It uses specgrams of resampled wav files(rate 8000) as inputs.
Due to Kaggle cloud hardware limitations, this script is a 'crippled' version of the original one.
In order to get LB 0.74, you need to set epoch to 5, set chop_audio(num=1000) and double all Conv layer parameters.
I haven't tuned the parameters for the CNN model here.

## File Structure
This script assumes data are stored in following strcuture:
speech
├── test            
│   └── audio #test wavfiles
├── train           
│   ├── audio #train wavfiles
└── model #store models
│
└── out #store sub.csv

## Possible Improvements
Since this is only a light-weight CNN, it's performance is limited.
Here are some ways to improve it's performance.
1. Use original wav files instead resampled ones.
2. Create more 'silence' wav files using chop_audio.
3. Build deeper CNN or use RNN.
4. Train for longer epochs

## After Words
It's still a long way to reach LB 0.88.
In fact, I doubt CNN would ever reach that high.
Feel free to share your ideas in the comment sections about using CNN to label wav files :)

## Appendix
Thanks __DavidS__ and __Alex Ozerin__ for their great notebooks!

In [13]:
import os
import numpy as np
from scipy.fftpack import fft
from scipy.io import wavfile
from scipy import signal
from glob import glob
import re
import pandas as pd
import gc
from scipy.io import wavfile

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm

The original sample rate is 16000, and we will resample it to 8000 to reduce data size.

In [2]:
L = 16000
legal_labels = 'yes no up down left right on off stop go silence unknown'.split()

#src folders
root_path = r'.'
out_path = r'.'
model_path = r'.'
train_data_path = os.path.join(root_path, 'train', 'audio')
test_data_path = os.path.join(root_path, 'test', 'audio')

Here are custom_fft and log_specgram functions written by __DavidS__.

In [3]:
def custom_fft(y, fs):
    T = 1.0 / fs
    N = y.shape[0]
    yf = fft(y)
    xf = np.linspace(0.0, 1.0/(2.0*T), N//2)
    # FFT is simmetrical, so we take just the first half
    # FFT is also complex, to we take just the real part (abs)
    vals = 2.0/N * np.abs(yf[0:N//2])
    return xf, vals

def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

Following is the utility function to grab all wav files inside train data folder.

In [4]:
def list_wavs_fname(dirpath, ext='wav'):
    print(dirpath)
    fpaths = glob(os.path.join(dirpath, r'*/*' + ext))
    pat = r'.+/(\w+)/\w+\.' + ext + '$'
    labels = []
    for fpath in fpaths:
        r = re.match(pat, fpath)
        if r:
            labels.append(r.group(1))
    pat = r'.+/(\w+\.' + ext + ')$'
    fnames = []
    for fpath in fpaths:
        r = re.match(pat, fpath)
        if r:
            fnames.append(r.group(1))
    return labels, fnames

__pad_audio__ will pad audios that are less than 16000(1 second) with 0s to make them all have the same length.

__chop_audio__ will chop audios that are larger than 16000(eg. wav files in background noises folder) to 16000 in length. In addition, it will create several chunks out of one large wav files given the parameter 'num'.

__label_transform__ transform labels into dummies values. It's used in combination with softmax to predict the label.

In [5]:
def pad_audio(samples):
    if len(samples) >= L: return samples
    else: return np.pad(samples, pad_width=(L - len(samples), 0), mode='constant', constant_values=(0, 0))

def chop_audio(samples, L=16000, num=20):
    for i in range(num):
        beg = np.random.randint(0, len(samples) - L)
        yield samples[beg: beg + L]

def label_transform(labels):
    nlabels = []
    for label in labels:
        if label == '_background_noise_':
            nlabels.append('silence')
        elif label not in legal_labels:
            nlabels.append('unknown')
        else:
            nlabels.append(label)
    return pd.get_dummies(pd.Series(nlabels))

Next, we use functions declared above to generate x_train and y_train.
label_index is the index used by pandas to create dummy values, we need to save it for later use.

In [6]:
labels, fnames = list_wavs_fname(train_data_path)

new_sample_rate = 8000
y_train = []
x_train = []

for label, fname in tqdm(list(zip(labels, fnames))):
    sample_rate, samples = wavfile.read(os.path.join(train_data_path, label, fname))
    samples = pad_audio(samples)
    if len(samples) > 16000:
        n_samples = chop_audio(samples)
    else: n_samples = [samples]
    for samples in n_samples:
        resampled = signal.resample(samples, int(new_sample_rate / sample_rate * samples.shape[0]))
        _, _, specgram = log_specgram(resampled, sample_rate=new_sample_rate)
        y_train.append(label)
        x_train.append(specgram)
x_train = np.array(x_train)
x_train = x_train.reshape(tuple(list(x_train.shape) + [1]))
y_train = label_transform(y_train)
label_index = y_train.columns.values
y_train = y_train.values
y_train = np.array(y_train)
del labels, fnames
gc.collect()

./train/audio


HBox(children=(IntProgress(value=0, max=64727), HTML(value='')))






7

In [7]:
class WavDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __len__(self):
        return self.x.shape[0]
    
    def __getitem__(self, idx):
        return {
            'x': torch.from_numpy(self.x[idx]).permute(2, 0, 1).type(torch.FloatTensor),
            'y': self.y[idx].argmax()
        }

CNN declared below.
The specgram created will be of shape (99, 81), but in order to fit into Conv2D layer, we need to reshape it.

In [8]:
class Model(nn.Module):
    def __init__(self, num_classes):
        super(Model, self).__init__()
        self.bn1 = nn.BatchNorm2d(1)
        self.conv1_1 = nn.Conv2d(1, 8, 2)
        self.conv1_2 = nn.Conv2d(8, 8, 2)
        self.drop1 = nn.Dropout(p=0.2)

        self.conv2_1 = nn.Conv2d(8, 16, 3)
        self.conv2_2 = nn.Conv2d(16, 16, 3)
        self.drop2 = nn.Dropout(p=0.2)
        
        self.conv3 = nn.Conv2d(16, 32, 3)
        self.drop3 = nn.Dropout(p=0.2)
        
        self.fc4 = nn.Linear(2240, 128)
        self.bn4 = nn.BatchNorm1d(128)
        
        self.fc5 = nn.Linear(128, 128)
        self.bn5 = nn.BatchNorm1d(128)
        
        self.fc6 = nn.Linear(128, nclass)
        
    def forward(self, x):
        x = self.bn1(x)
        x = F.relu(self.conv1_1(x))
        x = F.relu(self.conv1_2(x))
        x = F.max_pool2d(x, 2)
        x = self.drop1(x)
        
        x = F.relu(self.conv2_1(x))
        x = F.relu(self.conv2_2(x))
        x = F.max_pool2d(x, 2)
        x = self.drop2(x)
        
        x = F.relu(self.conv3(x))
        x = F.max_pool2d(x, 2)
        x = self.drop3(x)
        
        x = x.view(-1, self.num_flatten_features(x))
        
        x = F.relu(self.fc4(x))
        x = self.bn4(x)
        
        x = F.relu(self.fc5(x))
        x = self.bn5(x)
        
        x = F.softmax(self.fc6(x))        
        return x

    def num_flatten_features(self, x):
        size = x.size()[1:]
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

In [9]:
num_epochs = 1
learning_rate = 0.001
nclass = 12
batch_size = 16

In [10]:
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.1)

In [11]:
train = WavDataset(x_train, y_train)
val = WavDataset(x_val, y_val)

train = DataLoader(train, batch_size=batch_size, shuffle=True)
val = DataLoader(val, batch_size=batch_size, shuffle=True)

In [16]:

model = Model(nclass)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    
    for i, batch in tqdm(enumerate(train)):
        x_batch = Variable(batch['x'])
        y = Variable(batch['y'])
        
        optimizer.zero_grad()
        outputs = model(x_batch)
        
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
        
        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Iter [{i + 1}/{len(y_train) // batch_size}] Loss: {loss.data[0]:.3f}')

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))



Epoch [1/1], Iter [100/3647] Loss: 2.073
Epoch [1/1], Iter [200/3647] Loss: 2.119
Epoch [1/1], Iter [300/3647] Loss: 2.121
Epoch [1/1], Iter [400/3647] Loss: 1.933
Epoch [1/1], Iter [500/3647] Loss: 1.870
Epoch [1/1], Iter [600/3647] Loss: 1.994
Epoch [1/1], Iter [700/3647] Loss: 1.994
Epoch [1/1], Iter [800/3647] Loss: 2.181
Epoch [1/1], Iter [900/3647] Loss: 1.807
Epoch [1/1], Iter [1000/3647] Loss: 1.744
Epoch [1/1], Iter [1100/3647] Loss: 1.994
Epoch [1/1], Iter [1200/3647] Loss: 1.806
Epoch [1/1], Iter [1300/3647] Loss: 2.244
Epoch [1/1], Iter [1400/3647] Loss: 2.056
Epoch [1/1], Iter [1500/3647] Loss: 1.994
Epoch [1/1], Iter [1600/3647] Loss: 1.994
Epoch [1/1], Iter [1700/3647] Loss: 2.181
Epoch [1/1], Iter [1800/3647] Loss: 1.806
Epoch [1/1], Iter [1900/3647] Loss: 2.056
Epoch [1/1], Iter [2000/3647] Loss: 2.244
Epoch [1/1], Iter [2100/3647] Loss: 2.056
Epoch [1/1], Iter [2200/3647] Loss: 2.181
Epoch [1/1], Iter [2300/3647] Loss: 1.931
Epoch [1/1], Iter [2400/3647] Loss: 2.119
E

In [18]:
model.eval()
correct = 0
total = 0

for batch in tqdm(val):
    x_batch = Variable(batch['x'])
    y_true = batch['y'].numpy()
    y_pred = model(x_batch).data.max(1)[1]
    
    correct += (y_true == y_pred).sum()
    total += len(y_true)

correct / total

HBox(children=(IntProgress(value=0, max=406), HTML(value='')))






0.6317656129529684