In [None]:
# !pip install librosa
# !pip install ipywidgets

This notebook covers the exploratory data analysis as well as different features that can be extracted from the libraries like [librosa](https://librosa.org/doc/latest/index.html). Ther has already been several notebooks based on EDA for this challenge. But since this is my first competition based on audio recognition, so I took the full opportunity to start from scratch and has been a great learning experience so far. Hope you'll like it.

In [None]:
# Necessary imports
%matplotlib notebook

import os
import librosa
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import ipywidgets as widgets
import seaborn as sns
from sklearn import preprocessing

# plotting style
sns.set_style("whitegrid")
sns.set_palette("muted")

# format plot options
plt.rcParams["xtick.labelsize"] = 14
plt.rcParams["ytick.labelsize"] = 14
plt.rcParams["patch.force_edgecolor"] = True

## A little background
A wave, in general, can be thought of as some kind of disturbance propagating in space and time. For light waves, these disturbances are nothing but changing electric and magnetic field with position as well as time. Similarly, for sound waves it is the change in the air pressure with time and position. 

Mathematically, all of these waves (*at some fixed observation point*) can be represented as a function of time as $$sin(2 \pi f t)$$ where $f$ is the frequency of the wave.

## What does adding two sin waves do?
One property of this wave function is that it obeys principle of superposition meaning summing two or more sin waves with different amplitude will give us a sin wave with some frequency. This is precisely the case when people add more than one instrument and create music. These different instruments often have different frequencies and hence the resulting sin wave has frequency different from the individual frequencies. 

Each one of the audio files in this challenge contain sound waves which are a sum of lot of sin waves with different frequencies.

In [None]:
# help from: https://kapernikov.com/ipywidgets-with-matplotlib/
# output = widgets.Output()

fig, ax = plt.subplots(3, 1, figsize=(6, 10), constrained_layout=True)

# generate X-values
x = np.linspace(0, 2 * np.pi, 100)

def sin_fn1(x, w1, a1):
    return a1 * np.sin(w1 * x)

def sin_fn2(x, w2, a2):
    return a2 * np.sin(w2 * x)

def update_fn1(w1=1.0, a1=1.0):
    [l.remove() for l in ax[0].lines]
    ax[0].set_ylim([-4, 4])
    ax[0].plot(x, (sin_fn1(x, w1, a1)), color='royalblue')
    ax[0].set_title(r'$a_{1} \sin(w_{1} x)$')

def update_fn2(w2=1.0, a2=1.0):
    [l.remove() for l in ax[1].lines]
    ax[1].set_ylim([-4, 4])
    ax[1].plot(x, sin_fn2(x, w2, a2), color='crimson')
    ax[1].set_title(r'$a_{2} \sin(w_{2} x)$')

@widgets.interact(w1=(0, 10, 1), w2=(0, 10, 1), a1=(0, 5, 1), a2=(0, 5, 1))
def update(w1=1.0, w2=1.0, a1=1.0, a2=1.0):
    [l.remove() for l in ax[2].lines]
    update_fn1(w1, a1)
    update_fn2(w2, a2)
    ax[2].set_ylim([-6, 6])
    ax[2].plot(x, (sin_fn1(x, w1, a1) + sin_fn2(x, w2, a2)), color='forestgreen')
    ax[2].set_title(r'$a_{1} \sin(w_{1} x) + a_{2} \sin (w_{2} x)$')

_Unfortunately, the rendered notebook does not show the ipywidgets interactive plot. So had to hide the code as well as the plot._

In [None]:
# define various paths
root_dir = '../input/rfcx-species-audio-detection'
train_audio = os.path.join(root_dir, 'train')
test_audio = os.path.join(root_dir, 'test')

train_tp = os.path.join(root_dir, 'train_tp.csv')
train_fp = os.path.join(root_dir, 'train_fp.csv')

## Some EDA

In [None]:
# train true positive dataset
tp_df = pd.read_csv(train_tp)
tp_df.head()

In [None]:
tp_df.info()

In [None]:
%matplotlib inline
# explore the target column, i.e. species_id
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(12, 10))
sns.countplot(ax=ax[0], x='species_id', data=tp_df, alpha=0.6, color='navy')
ax[0].set_title('Distribution of Species for true positive data', fontsize=15)

sns.countplot(ax=ax[1], x='species_id', hue='songtype_id', data=tp_df, alpha=0.7)
ax[1].set_title('Distribution of Species for true positive data w.r.t songtype', fontsize=15)
fig.tight_layout()
plt.show()

As evident from above plots, species are almost uniformly distributed in the true positive dataset. Also, other than species with `species_id` 16, 17, and 23, all the species have just one `songtype` which basically is the type of sounds produced by a given species.

In [None]:
# train false positive dataset
fp_df = pd.read_csv(train_fp)
fp_df.head()

In [None]:
fp_df.info()

In [None]:
# explore the target column, i.e. species_id
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(12, 10))
sns.countplot(ax=ax[0], x='species_id', data=fp_df, alpha=0.6, color='navy')
ax[0].set_title('Distribution of Species for true positive data', fontsize=15)

sns.countplot(ax=ax[1], x='species_id', hue='songtype_id', data=fp_df, alpha=0.7)
ax[1].set_title('Distribution of Species for true positive data w.r.t songtype', fontsize=15)
fig.tight_layout()
plt.show()

Although the labels in false positive dataset are verified by the experts to _not_ contain the [flagged species/songtype](https://www.kaggle.com/c/rfcx-species-audio-detection/discussion/197866) but still the overall trend stays more or less the same as true positive.

In [None]:
# playing audio
from IPython.display import Audio

In [None]:
# select some random samples from train set
def get_random_samples(data_path, num_samples=1):
    data_list = os.listdir(data_path)
    indices = np.random.choice(len(data_list), num_samples)
    sample_audio = [os.path.join(data_path, data_list[idx]) for idx in indices]
    
    return sample_audio

In [None]:
# get random train samples
audio_samples_train = get_random_samples(train_audio)

In [None]:
Audio(audio_samples_train[0])

In [None]:
from librosa import display
train_amp, train_sr = librosa.load(audio_samples_train[0])
record_id = os.path.basename(audio_samples_train[0]).split('.')[0]

print(f'Recording ID: {record_id}')
print(f'Total number of samples in the recording: {len(train_amp)}')
print(f'Sampling rate of the recording: {train_sr}')

print('Checking the data corresponding to the recording ID')
record_df = tp_df[tp_df["recording_id"] == record_id]

if record_df.empty:
    record_df = fp_df[fp_df["recording_id"] == record_id]
print(record_df)

print('Now plotting the recording waveform...')
plt.figure(figsize=(14, 6))
librosa.display.waveplot(y=train_amp, sr=train_sr, color='navy', alpha=0.5)
plt.xlabel('Time (seconds)-->')
plt.ylabel('<-- Amplitude -->')
plt.show()

`train_amp` is just an array of amplitudes whose length gives total number of samples whereas `train_sr` is the sampling rate.

__Sampling rate__ or __Sampling frequency__ is the rate at which we are capturing amplitudes. In other words, it is just the number of data points recorded per second. These amplitudes can be electric or magnetic fileds in the case of light waves, current or voltage in the case of digital signals and pressure or displacement values for the case of sound waves.

The above plot gives the time-domain representation of the signal. Higher amplitude just corresponds to loudness in the signal and the points of zero amplitude represent silence. Other than giving information about the variation of the loudness of the audio signal with time, time-domain does not convey other important information about the signal.

We can rather try to understand the signal in frequency domain which reveals much more information about the signal. The reason for that is the sound that we are hearing results from the superposition of many different audio signals with different frequencies. 

This is our motivation for working in the frequency domain since we can decompose the wave into its constituent frequencies. The mathematical tools used for this is called __Fourier Transform__. Since our signal is composed of several discrete samples, we will use Fast Fourier Transform instead of usual Fourier Transform.

Librosa has a special method which performs Fourier transform, it's called `stft` (Short Time Fourier Transform). Using `stft` we can generate a plot which shows the signal in frequency-time space, this plot is commonly referred to as __Spectrogram__.

## Spectrograms

Using Librosa, we can easily create spectrograms using Short Time Fourier Transform. Following two plots show spectrograms for train audio on both Hertz (linear) as well as log scale.

In [None]:
train_spec = librosa.stft(train_amp)
print(train_spec.shape)

# convert amplitude into decibel scale
train_db = librosa.amplitude_to_db(abs(train_spec))
print(train_db.shape)

fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(14, 12), sharex=True)
img = librosa.display.specshow(train_db, sr=train_sr, x_axis='time', 
                         y_axis='hz', ax=ax[0], cmap='twilight')
ax[0].set(title='Linear-frequency power spectrogram')
ax[0].label_outer()

librosa.display.specshow(train_db, sr=train_sr, x_axis='time', 
                         y_axis='log', ax=ax[1], cmap='twilight')
ax[1].set(title='Log-frequency power spectrogram')
ax[1].label_outer()
fig.colorbar(img, ax=ax, format='%+2.f dB')
plt.show()

Clearly, frequencies in`log` scale show much more detail than the linear scale. However, in the linear scale, the perceptual distance between 300 Hz and 500 Hz might not seem equal to the distance between 11400 Hz and 11600Hz to human ears even though the difference is the same. So it is quite common so describe the spectrogram in __mel__ scale which is similar to the `log` scale and it is defined as 
$$
f_{m} = 2595\: \mathrm{log}_{10}\left( 1 + \frac{f}{700}\right)
$$
Mel scale is just a result of the above non-linear transformation. It renders the frequencies, that are at equal distances from each other, also felt by humans as if they are at equal distances. This is because human ear does not perceive frequencies on a linear scale.

## Mel-Spectrograms

In [None]:
plt.figure(figsize=(14, 6))
librosa.display.specshow(train_db, sr=train_sr, x_axis='time', 
                         y_axis='mel', cmap='twilight')
plt.colorbar(format='%+2.0f dB')
plt.show()

## Spectral Centroids
To quote the wikipedia, spectral centroid gives us the "impression of brightness of sound". Just like in physics where center of mass gives the location of the point where the whole mass of the body can be thought to be concentrated, spectral centroid can be visualized as the as the center of mass of the frequencies in a sound spectrum (typically a spectrogram). Mathematically, it is the weighted sum of the frequencies present in the signal.

In [None]:
spectral_centroids = librosa.feature.spectral_centroid(train_amp, sr=train_sr)[0]
print(spectral_centroids.shape)

# extract the time and frame indices
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)


plt.figure(figsize=(14, 6))
librosa.display.specshow(train_db, sr=train_sr, x_axis='time', 
                         y_axis='log', cmap='twilight')

plt.plot(t, spectral_centroids, color='yellow', 
         label='Spectral Centroid')
plt.title("Spectral Centroids for Train sample")
plt.colorbar(format='%+2.0f dB')
plt.legend(loc='upper right', fontsize=16, facecolor='gray')
plt.show()

## Spectral Roll-off
We will calculate some more features about the dataset. As described in Librosa documentation, it corresponds to that frequency for which a given spectrogram frequency bin has at least a given percent of spectrum energy (called roll_percent, 0.85 by default) in this stft frame (2584 in above spectrogram) is contained in this bin and the bins below. In simpler words, the roll-off frequency is defined as the frequency under which some percentage (cutoff) of the total energy of the spectrum is contained. Setting the roll percent to approximately 1 (or 0) can be used to find the maximum (or minimum) frequency.

In [None]:
plt.figure(figsize=(14, 6))
spectral_rolloff_full = librosa.feature.spectral_rolloff(train_amp, sr=train_sr,
                                                         roll_percent=0.95)[0]
spectral_rolloff_empty = librosa.feature.spectral_rolloff(train_amp, sr=train_sr, 
                                                          roll_percent=0.01)[0]
librosa.display.specshow(train_db, sr=train_sr, x_axis='time', 
                         y_axis='log', cmap='twilight')
plt.plot(t, spectral_rolloff_full, color='white', 
         label='Roll-off frequency (0.95)')
plt.plot(t, spectral_rolloff_empty, color='yellow', 
         label='Roll-off frequency (0.01)')
plt.title("Spectral Roll-off for Train sample")
plt.colorbar(format='%+2.0f dB')
plt.legend(loc='lower right', fontsize=16, facecolor='gray')
plt.show()

## MFCC (Mel-Frequency Cepstral Coefficients)

MFCCs of a signal are a small set of features (usually 15-35) which are used to describe the overall shape of the spectral envelope for each time frame. Consider the signal for each time frame as a histogram showing the distribution of frquencies. MFCCs basically are the bins of that histogram. It is explained in more detail [here](http://www.speech.cs.cmu.edu/15-492/slides/03_mfcc.pdf).

In [None]:
mfccs = librosa.feature.mfcc(y=train_amp, sr=train_sr)
print(mfccs.shape)

# display the scaled mfccs
plt.figure(figsize=(14, 6))
librosa.display.specshow(mfccs, sr=train_sr, x_axis='time', 
                         cmap='twilight')
plt.title('MFCCs', fontsize=15)
plt.colorbar()
plt.show()

## Chroma STFT
Chroma features is a set of 12 elements feature vectors. This 12 element feature corresponds to the energy contained in each pitch class  (7 major notes + 5 sharps). This is mainly used to detect similarity between music and ASR.

In [None]:
def plot_feature(signal, y_axis='chroma', x_axis='time', 
                 title='Feature space', format=None):
    plt.figure(figsize=(14,6))
    librosa.display.specshow(signal, y_axis=y_axis, 
                             x_axis=x_axis, cmap='twilight')
    plt.colorbar(format=format)
    plt.title(title)
    return plt.show()

In [None]:
# chroma stft
# get the energy spectrum
signal = np.abs(librosa.stft(train_amp))
print(signal.shape)

chroma_stft = librosa.feature.chroma_stft(S=signal, sr=train_sr)
print(chroma_stft.shape)

# now plot the feature
plot_feature(chroma_stft, title='Chroma STFT')

## Chroma-CQT
Like Mel scale, the constant Q-transform uses a logarithm scale for the frequencies.

In [None]:
# chroma cqt
chroma_cqt = librosa.feature.chroma_cqt(y=train_amp, sr=train_sr)
print(chroma_cqt.shape)

# now plot the feature
plot_feature(chroma_cqt, title='Chroma CQT')

## Chroma-CENS
__Chroma Energy Normalized Statistics__ is used to smoothen local deviations in tempo, pitch, etc by taking statistics over large time windows. It is also used for audio matching and similarity.

In [None]:
# chroma cens
chroma_cens = librosa.feature.chroma_cens(y=train_amp, sr=train_sr)
print(chroma_cens.shape)

# now plot the feature
plot_feature(chroma_cens, title='Chroma CENS')

## Poly-Features
Thes features are just the polynomial coefficients obtained by fitting polynomials to the spectral envelope in each of the time-frame of the spectrogram.

In [None]:
# poly features
signal_poly = librosa.feature.poly_features(S=signal, sr=train_sr, 
                                            order=2)
print(signal_poly.shape)

## Tonnetz Features
This method of feature extraction projects the chroma features onto a 6-dimensional basis corresponding to perfect-fifths and major and minor third as two-dimensional coordinates. Like spectral centroid, it computes the tonal centroids represented as the 6-dimensional basis.

In [None]:
# tonnetz features
signal_tonnetz = librosa.feature.tonnetz(y=train_amp, sr=train_sr, 
                                         chroma=chroma_cqt)
print(signal_tonnetz.shape)

plot_feature(signal_tonnetz, title='Tonal Centroids', y_axis='tonnetz')

All of these features can be aggregated over the given frame and can be used in linear and tree-based models to provide a baseline for more sophisticated methods. 

### References:
1. Librosa [documentation](https://librosa.org/doc/main/feature.html)
2. https://musicinformationretrieval.com/index.html
3. https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
4. https://www.kdnuggets.com/2020/02/audio-data-analysis-deep-learning-python-part-1.html