[[이유한님] 캐글 코리아 캐글 스터디 커널 커리큘럼](https://kaggle-kr.tistory.com/32)

[TensorFlow Speech Recognition Challenge](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge)

[Light-Weight CNN LB 0.74](https://www.kaggle.com/alphasis/light-weight-cnn-lb-0-74)

# Preface
This notebooks aims to build a light-weight CNN.

It uses specgrams of resampled wav files(rate 8000) as inputs.

Due to Kaggle cloud hardware limitations, this script is a 'crippled' version of the original one.

In order to get LB 0.74, you need to set epoch to 5, set chop_audio(num=1000) and double all Conv layer parameters.

Although this script is a slight imrpovement over Alex Ozerin's baseline, I believe by using original wav files(16000 sample rate) one can achieve higher scores.

# File Structure
This script assumes data are stored in following strcuture:

speech

├── test

│ └── audio #test wavfiles

├── train

│ ├── audio #train wavfiles

└── model #store models

│

└── out #store sub.csv

# Improve This Script
Since this is only a light-weight CNN, it's performance is limited. Here are some ways to improve it's performance.

1. Use original wav files instead resampled ones.
2. Create more 'silence' wav files using chop_audio.
3. Build deeper CNN or use RNN.
4. Train for longer epochs

# After Words
It's still a long way to reach LB 0.88.

In fact, I doubt CNN would ever reach that high.

Feel free to share your ideas in the comment sections about using CNN to label wav files :)

# Appendix
Thanks **DavidS** and **Alex Ozerin** for their great notebooks!

In [1]:
import warnings

warnings.filterwarnings("ignore")

In [2]:
import os
import numpy as np
from scipy.fftpack import fft
from scipy.io import wavfile
from scipy import signal
from glob import glob
import re
import pandas as pd
import gc
# from scipy.io import wavfile

from keras import optimizers, losses, activations, models
from keras.layers import Convolution2D, Dense, Input, Flatten, Dropout, MaxPooling2D, BatchNormalization
from sklearn.model_selection import train_test_split
import keras

2025-02-19 09:48:10.229152: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739926090.240777    8718 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739926090.243989    8718 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-19 09:48:10.256297: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


The original sample rate is 16000, and we will resample it to 8000 to reduce data size.

In [3]:
L = 16000
legal_labels = "yes no up down left right on off stop go silence unknown".split()

# src folders
root_path = r".."
out_path = r"."
model_path = r"."
train_data_path = os.path.join(root_path, "input", "train", "audio")
test_data_path = os.path.join(root_path, "input", "test", "audio")

Here are custom_fft and log_specgram functions written by **DavidS**.

In [4]:
def custom_fft(y, fs):
    T = 1.0 / fs
    N = y.shape[0]
    yf = fft(y)
    xf = np.linspace(0.0, 1.0 / (2.0 * T), N // 2)
    # FFT is simmertrical, so we take just the the first half
    # FFT is also complex, to we take just the real part (abs)
    vals = 2.0 / N * np.abs(yf[0:N//2])

    return xf, vals

def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size = sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                            fs=sample_rate,
                                            window="hann",
                                            nperseg=nperseg,
                                            noverlap=noverlap,
                                            detrend=False)
    
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

Following is the utility function to grab all wav files inside train data folder.

In [5]:
def list_wavs_fname(dirpath, ext="wav"):
    print(dirpath)
    fpaths = glob(os.path.join(dirpath, r"*/*" + ext))
    pat = r".+/(\w+)/\w+\." + ext + "$"
    labels = []
    for fpath in fpaths:
        r = re.match(pat, fpath)
        if r:
            labels.append(r.group(1))
    pat = r".+/(\w+\." + ext + ")$"
    fnames = []
    for fpath in fpaths:
        r = re.match(pat, fpath)
        if r:
            fnames.append(r.group(1))
    
    return labels, fnames

**pad_audio** will pad audios that are less than 16000(1 second) with 0s to make them all have the same length.

**chop_audio** will chop audios that are larger than 16000(eg. wav files in background noises folder) to 16000 in length. In addition, it will create several chunks out of one large wav files given the parameter 'num'.

**label_transform** transform labels into dummies values. It's used in combination with softmax to predict the label.

In [6]:
def pad_audio(samples):
    if len(samples) >= L:
        return samples
    else:
        return np.pad(samples, pad_width=(L - len(samples), 0), mode="constant", constant_values=(0, 0))

def chop_audio(samples, L=16000, num=20):
    for i in range(num):
        beg = np.random.randint(0, len(samples) - L)
        yield samples[beg: beg + L]

def label_transform(labels):
    nlabels = []
    for label in labels:
        if label == "_background_noise_":
            nlabels.append("silence")
        elif label not in legal_labels:
            nlabels.append("unknown")
        else:
            nlabels.append(label)

    return pd.get_dummies(pd.Series(nlabels))

Next, we use functions declared above to generate x_train and y_train. label_index is the index used by pandas to create dummy values, we need to save it for later use.

In [12]:
labels, fnames = list_wavs_fname(train_data_path)

new_sample_rate = 8000
y_train = []
x_train = []

for label, fname in zip(labels, fnames):
    sample_rate, samples = wavfile.read(os.path.join(train_data_path, label, fname))
    samples = pad_audio(samples)
    if len(samples) > 16000:
        n_samples = chop_audio(samples)
    else: n_samples = [samples]
    for samples in n_samples:
        resampled = signal.resample(samples, int(new_sample_rate / sample_rate * samples.shape[0]))
        _, _, specgram = log_specgram(resampled, sample_rate=new_sample_rate)
        y_train.append(label)
        x_train.append(specgram)
x_train = np.array(x_train)
x_train = x_train.reshape(tuple(list(x_train.shape) + [1]))
y_train = label_transform(y_train)
label_index = y_train.columns.values
y_train = y_train.values
y_train = np.array(y_train)
del labels, fnames
gc.collect()

../input/train/audio


99

In [9]:
labels, fnames = list_wavs_fname(train_data_path)

new_sample_rate = 8000
y_train = []
x_train = []

for label, fname in zip(labels, fnames):
    sample_rate, samples = wavfile.read(os.path.join(train_data_path, label, fname))
    samples = pad_audio(samples)
    if len(samples) > 16000:
        n_samples = chop_audio(samples)
    else:
        n_samples = [samples]
    
    for samples in n_samples:
        resampled = signal.resample(samples, int(new_sample_rate / sample_rate * samples.shape[0]))
        _, _, specgram = log_specgram(resampled, sample_rate=new_sample_rate)
        y_train.append(label)
        x_train.append(specgram)
x_train = np.array(x_train)
# x_train = x_train.reshape(tuple(list(x_train.shape) + [1]))
x_train = x_train.reshape(tuple(x_train.shape) + (1,))
y_train = label_transform(y_train)
# label_index = y_train.columns.values
# y_train = y_train.values
# y_train = np.array(y_train)
# del labels, fnames
# gc.collect()

../input/train/audio


In [13]:
print(y_train)

[]


https://www.kaggle.com/alphasis/light-weight-cnn-lb-0-74