* Importing necessary Libraries
* Data Source 
* 1. Audio Features
    * Features extraction
    * Visualization
* Data preprocessing 
* Datset Investigation
* 2. VAD
* 3. Anamoly Detection 
* 4. Frequency components across the words

------------------------------------

Source for this work 
- Speech representation and data exploration - DAVIDS -  https://www.kaggle.com/davids1992/speech-representation-and-data-exploration?scriptVersionId=1924001
- voice activity detection example -ANDRE HOLZNER · - https://www.kaggle.com/holzner/voice-activity-detection-example
- Voice Activity Detection with webrtcVAD|7z archive -ATUL ANAND {JHA} - https://www.kaggle.com/atulanandjha/voice-activity-detection-with-webrtcvad-7z-archive

### Importing necessary Libraries

In [None]:
import os
from os.path import isdir, join
from scipy.io import wavfile
from subprocess import check_output
from pathlib import Path
import pandas as pd


# Math
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile
import librosa

from sklearn.decomposition import PCA

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
import librosa.display

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd

%matplotlib inline


### Install Python packages in an internet-enabled notebook

In [None]:
!pip install pyunpack
!pip install patool

 # Data Source 
 
 Unpack .7z file


In [None]:
from pyunpack import Archive
import shutil
if not os.path.exists('/kaggle/working/train/'):
    os.makedirs('/kaggle/working/train/')
Archive('/kaggle/input/train.7z').extractall('/kaggle/working/train/')
# for dirname, _, filenames in os.walk('/kaggle/working/train/'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))


after you are finished working with the images you can delete them so that your commit will succeed (max number of files in working directory for a commit = 500)

In [None]:
shutil.make_archive('train/', 'zip', 'train')

In [None]:
# deleting unwanted extracted files to avoid memory overflow (maxlimit files = 500) while commiting.
!rm -rf kaggle/working/train/*

In [None]:
# Loading the trainig Input file.
train_audio_path = "/kaggle/working/train/train/audio"

In [None]:
"""
# It is just a checker code to validate the presence of file.

print(check_output(["ls", "../input/train/audio"]).decode("utf8"))
print(os.listdir("../input/train"))

"""


print(check_output(["ls", "/kaggle/working/train/train/audio"]).decode("utf8"))
print(os.listdir("/kaggle/working/train/train/audio/yes"))

In [None]:
# Example input file to be used here...
filename = '/yes/00f0204f_nohash_0.wav'

In [None]:
dirs = [f for f in os.listdir(train_audio_path) if isdir(join(train_audio_path, f))]
dirs.sort()
print('Number of labels: ' + str(len(dirs)))

# Audio Features


## Features extraction
A generalized feature extraction algorithm for an audio data sample be like that:

1. Resampling
2. VAD
3. Maybe padding with 0 to make signals be equal length
4. Log spectrogram (or MFCC, or PLP)
5. Features normalization with mean and std
6. Stacking of a given number of frames to get temporal information


 sample_rate, samples = wavfile.read(str(train_audio_path) + filename) 

 The above code line works fine for everything except **Librosa** library MFCC functionality. So, we'll read wave files using librosa only.
 
 Must to read samples in librosa format. Other wise "librosa" error:data must be in floating format

In [None]:
samples, sample_rate = librosa.load(str(train_audio_path)+filename)

## Visualization

There are two theories of a [human hearing - place](https://en.wikipedia.org/wiki/Place_theory_(hearing) (frequency-based) and [temporal](https://en.wikipedia.org/wiki/Temporal_theory_(hearing) In speech recognition, I see two main tendencies - to input spectrogram (frequencies), and more sophisticated features MFCC - Mel-Frequency Cepstral Coefficients, PLP. You rarely work with raw, temporal data.


### 1.1 Spectogram 

Define a function that calculates spectrogram.

Note, that we are taking logarithm of spectrogram values. It will make our plot much more clear, moreover, it is strictly connected to the way people hear. We need to assure that there are no 0 values as input to logarithm.


In [None]:
def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

Frequencies are in range (0, 8000) according to Nyquist theorem.

Let's plot it:

In [None]:
freqs, times, spectrogram = log_specgram(samples, sample_rate)

fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(211)
ax1.set_title('Raw wave of ' + filename)
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, sample_rate/len(samples), sample_rate), samples)

ax2 = fig.add_subplot(212)
ax2.imshow(spectrogram.T, aspect='auto', origin='lower', 
           extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.set_title('Spectrogram of ' + filename)
ax2.set_ylabel('Freqs in Hz')
ax2.set_xlabel('Seconds')

normalizing the audio data. Always a good plan if we gonna feed it into NN.

In [None]:
mean = np.mean(spectrogram, axis=0)
std = np.std(spectrogram, axis=0)
spectrogram = (spectrogram - mean) / std

There is an interesting fact to point out. We have ~160 features for each frame, frequencies are between 0 and 8000. It means, that one feature corresponds to 50 Hz. However, frequency resolution of the ear is 3.6 Hz within the octave of 1000 – 2000 Hz It means, that people are far more precise and can hear much smaller details than those represented by spectrograms like above.

### MFCC

If you want to get to know some details about MFCC take a look at this great tutorial. MFCC explained You can see, that it is well prepared to imitate human hearing properties.

You can calculate Mel power spectrogram and MFCC using for example librosa python package.


In [None]:
# From this tutorial
# https://github.com/librosa/librosa/blob/master/examples/LibROSA%20demo.ipynb
S = librosa.feature.melspectrogram(samples, sr=sample_rate, n_mels=128)

# Convert to log scale (dB). We'll use the peak power (max) as reference.
log_S = librosa.power_to_db(S, ref=np.max)

plt.figure(figsize=(12, 4))
librosa.display.specshow(log_S, sr=sample_rate, x_axis='time', y_axis='mel')
plt.title('Mel power spectrogram ')
plt.colorbar(format='%+02.0f dB')
plt.tight_layout()

#### Now delta- mfcc

In [None]:
mfcc = librosa.feature.mfcc(S=log_S, n_mfcc=13)

# Let's pad on the first and second deltas while we're at it
delta2_mfcc = librosa.feature.delta(mfcc, order=2)

plt.figure(figsize=(12, 4))
librosa.display.specshow(delta2_mfcc)
plt.ylabel('MFCC coeffs')
plt.xlabel('Time')
plt.title('MFCC')
plt.colorbar()
plt.tight_layout()

### Spectrogram in 3d 

In [None]:
# data = [go.Surface(z=spectrogram.T)]
# layout = go.Layout(
#     title='Specgtrogram of "yes" in 3d',
#     scene = dict(
#     yaxis = dict(title='Frequencies', range=freqs),
#     xaxis = dict(title='Time', range=times),
#     zaxis = dict(title='Log amplitude'),
#     ),
# )
# fig = go.Figure(data=data, layout=layout)
# py.iplot(fig)

In classical systems, MFCC or similar features are taken as the input to the system instead of spectrograms.

However, in end-to-end (often neural-network based) systems, the most common input features are probably raw spectrograms, or mel power spectrograms. For example MFCC decorrelates features, but NNs deal with correlated features well. 

## 2. Data Preprocessing 

### Silence Removal
Although the words are short, there is a lot of silence in them. A decent VAD can reduce training size a lot, accelerating training speed significantly. Let's cut a bit of the file from the beginning and from the end. and listen to it again (based on a plot above, we take from 4000 to 13000):

In [None]:
## without silence removal
ipd.Audio(samples, rate=sample_rate)

In [None]:
# With manual silence removal
samples_cut = samples[4000:13000]
ipd.Audio(samples_cut, rate=sample_rate)

We can agree that the entire word can be heard. It is impossible to cut all the files manually and do this basing on the simple plot. But you can use for example webrtcvad package to have a good VAD.

Let's plot it again, together with guessed alignment of 'y' 'e' 's' graphems

In [None]:
freqs, times, spectrogram_cut = log_specgram(samples_cut, sample_rate)

fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(211)
ax1.set_title('Raw wave of ' + filename)
ax1.set_ylabel('Amplitude')
ax1.plot(samples_cut)

ax2 = fig.add_subplot(212)
ax2.set_title('Spectrogram of ' + filename)
ax2.set_ylabel('Frequencies * 0.1')
ax2.set_xlabel('Samples')
ax2.imshow(spectrogram_cut.T, aspect='auto', origin='lower', 
           extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.text(0.06, 1000, 'Y', fontsize=18)
ax2.text(0.17, 1000, 'E', fontsize=18)
ax2.text(0.36, 1000, 'S', fontsize=18)

xcoords = [0.025, 0.11, 0.23, 0.49]
for xc in xcoords:
    ax1.axvline(x=xc*16000, c='r')
    ax2.axvline(x=xc, c='r')

 ### Resampling - dimensionality reduction
 
 - reduce the dimensionality of our data is to resample recordings.
 - smaller training size.
 
You can hear that the recording don't sound very natural, because they are sampled with 16k frequency, and we usually hear much more. 
However, the most speech related frequencies are presented in smaller band. That's why you can still understand another person talking to the telephone, where GSM signal is sampled to 8000 Hz.

Summarizing, we could resample our dataset to 8k. We will discard some information that shouldn't be important, and we'll reduce size of the data.

**FFT (Fast Fourier Transform)** 

In [None]:
def custom_fft(y, fs):
    T = 1.0 / fs
    N = y.shape[0]
    yf = fft(y)
    xf = np.linspace(0.0, 1.0/(2.0*T), N//2)
    vals = 2.0/N * np.abs(yf[0:N//2])  # FFT is simmetrical, so we take just the first half
    # FFT is also complex, to we take just the real part (abs)
    return xf, vals

Let's read some recording, resample it, and listen. We can also compare FFT, Notice, that there is almost no information above 4000 Hz in original signal.

In [None]:
# filename = '/happy/0b09edd3_nohash_0.wav'
filename ='/yes/00f0204f_nohash_0.wav'
new_sample_rate = 8000

sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
resampled = signal.resample(samples, int(new_sample_rate/sample_rate * samples.shape[0]))

In [None]:
# without resampling 
ipd.Audio(samples, rate=sample_rate)

In [None]:
# with resampling 
ipd.Audio(resampled, rate=new_sample_rate)

In [None]:
# At original Sampling 

xf, vals = custom_fft(samples, sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(sample_rate) + ' Hz')
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()

In [None]:
# After resampling to reduce traning dat size 

xf, vals = custom_fft(resampled, new_sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(new_sample_rate) + ' Hz')
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()

## Data Set Investigation

Numver of Records 

In [None]:
dirs = [f for f in os.listdir(train_audio_path) if isdir(join(train_audio_path, f))]
dirs.sort()
print('Number of labels: ' + str(len(dirs)))

In [None]:
# Calculate
number_of_recordings = []
for direct in dirs:
    waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
    number_of_recordings.append(len(waves))

# Plot
data = [go.Histogram(x=dirs, y=number_of_recordings)]
trace = go.Bar(
    x=dirs,
    y=number_of_recordings,
    marker=dict(color = number_of_recordings, colorscale='dense', showscale=True
    ),
)
layout = go.Layout(
    title='Number of recordings in given label',
    xaxis = dict(title='Words'),
    yaxis = dict(title='Number of recordings')
)
py.iplot(go.Figure(data=[trace], layout=layout))

split the dataset in a way that one speaker doesn't occur in both train and test sets

In [None]:
filenames = ['/yes/00f0204f_nohash_0.wav', '/yes/8830e17f_nohash_2.wav']
for filename in filenames:
    sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
    xf, vals = custom_fft(samples, sample_rate)
    plt.figure(figsize=(12, 4))
    plt.title('FFT of speaker ' + filename[4:11])
    plt.plot(xf, vals)
    plt.xlabel('Frequency')
    plt.grid()
    plt.show()

In [None]:
filenames = ['on/004ae714_nohash_0.wav', 'on/0137b3f4_nohash_0.wav']

print('Speaker ' + filenames[0][4:11])
# Female Speaker
ipd.Audio( join(train_audio_path, filenames[0]), 
          rate=8000)

In [None]:
print('Speaker ' + filenames[1][4:11])
# Male Speaker
ipd.Audio(join(train_audio_path, filenames[1]))

In [None]:
filename = '/yes/01bb6a2a_nohash_1.wav'
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
freqs, times, spectrogram = log_specgram(samples, sample_rate)

plt.figure(figsize=(10, 7))
plt.title('Spectrogram of ' + filename)
plt.ylabel('Freqs')
plt.xlabel('Time')
plt.imshow(spectrogram.T, aspect='auto', origin='lower', 
           extent=[times.min(), times.max(), freqs.min(), freqs.max()])
plt.yticks(freqs[::16])
plt.xticks(times[::16])
plt.show()

Recordings length

In [None]:
os.listdir(join(train_audio_path, direct))

In [None]:
!ls -la

In [None]:
print(train_audio_path)
print(direct)

all the files have 1 second duration:

In [None]:
os.listdir(train_audio_path)

In [None]:
num_of_shorter = 0
for direct in dirs:
    waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
    for wav in waves:
#         try:
            sample_rate, samples = wavfile.read(train_audio_path +'/' +direct + '/' + wav)
            if samples.shape[0] < sample_rate:
                num_of_shorter += 1
#         except:
#             print("this gets executed only if there is an error")
              
print('Number of recordings shorter than 1 second: ' + str(num_of_shorter))
# example file :'/kaggle/working/train/train/audio_/background_noise_/doing_the_dishes.wav'

###  Mean spectrograms and FFT

In [None]:
to_keep = 'yes no up down left right on off stop go'.split()
dirs = [d for d in dirs if d in to_keep]

print(dirs)

for direct in dirs:
    vals_all = []
    spec_all = []

    waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
    for wav in waves:
        sample_rate, samples = wavfile.read(train_audio_path +'/' + direct + '/' + wav)
        if samples.shape[0] != 16000:
            continue
        xf, vals = custom_fft(samples, 16000)
        vals_all.append(vals)
        freqs, times, spec = log_specgram(samples, 16000)
        spec_all.append(spec)

    plt.figure(figsize=(14, 4))
    plt.subplot(121)
    plt.title('Mean fft of ' + direct)
    plt.plot(np.mean(np.array(vals_all), axis=0))
    plt.grid()
    plt.subplot(122)
    plt.title('Mean specgram of ' + direct)
    plt.imshow(np.mean(np.array(spec_all), axis=0).T, aspect='auto', origin='lower', 
               extent=[times.min(), times.max(), freqs.min(), freqs.max()])
    plt.yticks(freqs[::16])
    plt.xticks(times[::16])
    plt.show()

 **Guassian Miztures modelling** 
 
 Kaldi library, that can model words (or smaller parts of words) with GMMs and model temporal dependencies with Hidden Markov Models.

In [None]:
def violinplot_frequency(dirs, freq_ind):
    """ Plot violinplots for given words (waves in dirs) and frequency freq_ind
    from all frequencies freqs."""

    spec_all = []  # Contain spectrograms
    ind = 0
    # taking first 8 words only to keep the plots clean and unclumsy.
    for direct in dirs[:8]:
        spec_all.append([])

        waves = [f for f in os.listdir(join(train_audio_path, direct)) if
                 f.endswith('.wav')]
        for wav in waves[:100]:
            sample_rate, samples = wavfile.read(
                train_audio_path + '/' + direct + '/' + wav)
            freqs, times, spec = log_specgram(samples, sample_rate)
            spec_all[ind].extend(spec[:, freq_ind])
        ind += 1

    # Different lengths = different num of frames. Make number equal
    minimum = min([len(spec) for spec in spec_all])
    spec_all = np.array([spec[:minimum] for spec in spec_all])

    plt.figure(figsize=(13,7))
    plt.title('Frequency ' + str(freqs[freq_ind]) + ' Hz')
    plt.ylabel('Amount of frequency in a word')
    plt.xlabel('Words')
    sns.violinplot(data=pd.DataFrame(spec_all.T, columns=dirs[:8]))
    plt.show()

In [None]:
violinplot_frequency(dirs, 20)

## 2. Voice Activity Detection ( VAD )

### use the webrtcvad library to identify segments as speech or not

In [None]:
!pip install webrtcvad

In [None]:
import webrtcvad

#### reading the samples and sample_rate feature again to make them compatible with the webrtcvad library. ( it reads at sample_rate = 16000, 32000, 48000; but we had sample_rate = 22050 with librosa)

In [None]:
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)

In [None]:
vad = webrtcvad.Vad()
# set aggressiveness from 0 to 3
vad.set_mode(3)

convert samples to raw 16 bit per sample stream needed by webrtcvad( there are other options available too , like 32 )

In [None]:
import struct
raw_samples = struct.pack("%dh" % len(samples), *samples)

run the detector on windows of 30 ms (from https://github.com/wiseman/py-webrtcvad/blob/master/example.py)

In [None]:
window_duration = 0.03 # duration in seconds
samples_per_window = int(window_duration * sample_rate + 0.5)
bytes_per_sample = 2

Detect Speech instances in an audio

In [None]:
segments = []

for start in np.arange(0, len(samples), samples_per_window):
    stop = min(start + samples_per_window, len(samples))
    
    is_speech = vad.is_speech(raw_samples[start * bytes_per_sample: stop * bytes_per_sample], 
                              sample_rate = sample_rate)

    segments.append(dict(
       start = start,
       stop = stop,
       is_speech = is_speech))

plot segment identifed as speech

In [None]:
plt.figure(figsize = (10,7))
plt.plot(samples)

ymax = max(samples)


for segment in segments:
    if segment['is_speech']:
        plt.plot([ segment['start'], segment['stop'] - 1], [ymax * 1.1, ymax * 1.1], color = 'orange')

plt.xlabel('sample')
plt.grid()

 Listen to the speech only segments

In [None]:
speech_samples = np.concatenate([ samples[segment['start']:segment['stop']] for segment in segments if segment['is_speech']])

import IPython.display as ipd
ipd.Audio(speech_samples, rate=sample_rate)

#### Till now we have processed for a single audio of any one word : <span style="color : blue;">YES</span> here.

#### Now, its time to have an overall view on other words also. So, lets visualize frequency components for other words as well.

# 3. Anomaly detection

 lower the dimensionality of the dataset and interactively check for any anomaly. We'll use PCA for dimensionality reduction:

In [None]:
fft_all = []
names = []
for direct in dirs:
    waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
    for wav in waves:
        sample_rate, samples = wavfile.read(train_audio_path+ '/' + direct + '/' + wav)
        if samples.shape[0] != sample_rate:
            samples = np.append(samples, np.zeros((sample_rate - samples.shape[0], )))
        x, val = custom_fft(samples, sample_rate)
        fft_all.append(val)
        names.append(direct + '/' + wav)

fft_all = np.array(fft_all)

# Normalization
fft_all = (fft_all - np.mean(fft_all, axis=0)) / np.std(fft_all, axis=0)

In [None]:
# Dimemsionality reduction
pca = PCA(n_components=3)
fft_all = pca.fit_transform(fft_all)

def interactive_3d_plot(data, names):
    scatt = go.Scatter3d(x=data[:, 0], y=data[:, 1], z=data[:, 2], mode='markers', text=names)
    data = go.Data([scatt])
    layout = go.Layout(title="Anomaly detection")
    figure = go.Figure(data=data, layout=layout)
    py.iplot(figure)
    
interactive_3d_plot(fft_all, names)

Some anomalied listed below

In [None]:
print('Recording go/0487ba9b_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'go/0487ba9b_nohash_0.wav'))

In [None]:
print('Recording yes/e4b02540_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'yes/e4b02540_nohash_0.wav'))

In [None]:
print('Recording seven/e4b02540_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'seven/b1114e4f_nohash_0.wav'))

# 4. Frequency components across the words

### plotting for first 8 words only to avoid clumsy tight plots.

In [None]:
violinplot_frequency(dirs, 20)

In [None]:
violinplot_frequency(dirs, 50)

In [None]:
violinplot_frequency(dirs, 120)

# 5. Testing with WebRTC input and Recorded Wav files 

# 6. trigger word detection(transformer)

Copyright 2019 The TensorFlow Authors.
        #@title Licensed under the Apache License, Version 2.0 (the "License");
        # you may not use this file except in compliance with the License.
        # You may obtain a copy of the License at
        #
        # https://www.apache.org/licenses/LICENSE-2.0
        #
        # Unless required by applicable law or agreed to in writing, software
        # distributed under the License is distributed on an "AS IS" BASIS,
        # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        # See the License for the specific language governing permissions and
        # limitations under the License.

References for this work

- EABDSMD- https://www.kaggle.com/samadi10/trigger-word-detection-transformer 

--------------------------------------------------
**Reading material**

* Encoder-decoder: https://arxiv.org/abs/1508.01211
* RNNs with CTC loss: https://arxiv.org/abs/1412.5567
* For me, 1 and 2 are a sensible choice for this competition, especially if you do not have background in SR field. They try to be end-to-end solutions. Speech recognition is a really big topic and it would be hard to get to know important things in short time.
* Classic speech recognition : http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/tutorial%20on%20hmm%20and%20applications.pdf

* Kaldi Tutorial for dummies, with a problem similar to this competition in some way.

* Very deep CNN - Large Vocabulary Continuous Speech Recognition Systems (LVCSR). 
