<h1><center>Rainforest Connection Species Audio Detection</center></h1>
<h2><center>Automate the detection of bird and frog species in a tropical soundscape</center></h2>

<center><img src="https://storage.googleapis.com/kaggle-competitions/kaggle/21669/logos/header.png?t=2020-10-28-04-28-01"></center>

# About the Organizers

- **Rainforest Connection (RFCx)** created the world’s first scalable, real-time monitoring system for protecting and studying remote ecosystems. 
- Unlike visual-based tracking systems like drones or satellites, RFCx relies on acoustic sensors that monitor the ecosystem soundscape at selected locations year round. 
- RFCx technology has advanced to support a comprehensive biodiversity monitoring program that allows local partners to measure progress of wildlife restoration and recovery through principles of adaptive management. 
- The RFCx monitoring platform also has the capacity to create *convolutional neural network (CNN)* models for analysis.
- More about them can be found [here](https://www.rfcx.org/)

# 1. Brief Description

- The presence of rainforest species is a good indicator of the impact of climate change and habitat loss. As it's easier to hear these species than see them, it’s important to use acoustic technologies that can work on a global scale. Real-time information, such as provided through machine learning techniques, could enable **early-stage detection of human impacts** on the environment. This result could drive more effective conservation management decisions.
- Traditional methods of assessing the diversity and abundance of species are costly and limited in space and time. And while **automatic acoustic identification via deep learning** has been successful, models require a large number of training samples per species. This limits applicability to ***rarer species***, which are central to conservation efforts. Thus, methods to *automate high-accuracy species detection in noisy soundscapes with limited training data* are the solution.

# 1.1 About the Competition

- In this competition, you’ll automate the detection of bird and frog species in tropical soundscape recordings. You'll create your models with **limited, acoustically complex training data**. 
- <span style="color:red">Rich in more than bird and frog noises, expect to hear an insect or two</span>, which your model will need to filter out.
- The resulting real-time information could enable earlier detection of human environmental impacts, making environmental conservation more swift and effective.

Q: <span style="color:red">Oh boy, who reads all that? These are mentioned on the competition page too. Ohkay :( Let's talk in layman's terms.</span>  
A: <span style="color:orange">We need to detect audio of some number of species(rare, yeah) in the given audio recordings.</span>  
Q: <span style="color:red">What will happen if we will do so?</span>   
A: <span style="color:orange">It will help the organizers in some way to efficiently process conservation proceedings for such rare species.</span>  
Q: <span style="color:red">Well, tell us about the data then.</span>  
> <span style="color:blue">We need to predict the probability of all the species present in each test audio file. Some test audio files contain a single species while others contain multiple. The predictions are to be done at the audio file level, i.e., no start/end timestamps are required.</span>

Q: <span style="color:red">Oh man, why can't you speak enlgish, always quoting organizers, huh?</span>  
A: <span style="color:orange">Well, speaking of that, I got you covered [here](https://www.kaggle.com/c/rfcx-species-audio-detection/discussion/200757). Lazy to click on the link, I got it covered below :)</span>

# 1.2 Submission Formats

 We have got 24 classes and some 9000 training samples in total.  No, this much data is not sufficient to understand the **Label weighted LRAP (label ranking average precision)** metric. First, have a look at the below data chunk.  

|   |  recording_id |	species_id  | songtype_id | t_min  | f_min    | t_max  | f_max     |
|---|-------------- |---------------|-------------|--------|----------|--------|-----------| 
| 0 | 00204008d     | 21            |      1      |13.8400 |3281.2500 | 14.9333| 4125.0000 |
| 1 | 00204008d     | 8	            |      1	  |24.4960 |3750.0000 | 28.6187| 5531.2500 |
| 2 | 00204008d     | 4             |      1      |15.0027 |2343.7500 | 6.8587 | 4218.7500 |    


What did you see? Well, if it didn't catch your attention, Let me explain as per my understanding. For single `recording id` ***00204008d***, we have got 3 species ids in that, i.e. 21, 8, and 4. What does that mean? The recording, which means the audio sample having this mentioned id, has audio of species 24, species 8, and species 4. Hence, our ground truth for this data sample would be:
>  [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]. 

But what we need to predict? We need to predict the probability of the presence of each class in this audio sample. Hence our (hypothetical) predicted vector for this sample would be:
> [0.01, 0.01, 0.01, 0.84, 0.01, 0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.78]. 

So, yeah, it's a **multilabel classification** problem. 

Now quoting the submission format from the competition page:
```
recording_id,s0,...,s23
000316da7,0.1,....,0.3
003bc2cb2,0.0,...,0.8
...
```

And we need to do this for what, 1992 samples? Okay, that being said, let's worry about the evaluation metric.

# 1.3 Evaluation Metric

- The competition metric is the label-weighted [label-ranking average precision](https://scikit-learn.org/stable/modules/model_evaluation.html#label-ranking-average-precision), which is a generalization of the mean reciprocal rank measure for the case where there can be multiple true labels per test item.

- The **label-weighted** part means that the overall score is the average over all the labels in the test set, where each label receives equal weight (by contrast, plain LRAP gives each test observation equal weight, thereby discounting the contribution of individual labels when when an observation has multiple labels). 
- In other words, each test observation is weighted by the number of ground truth labels found in the observation.

# 1.3.1 More on the Evaluation Metrics

**Label ranking average precision (LRAP)** averages over the samples the answer to the following question: `for each ground truth label, what fraction of higher-ranked labels were true labels?`

- This performance measure will be higher if you are able to give a better rank to the labels associated with each sample. 

> The obtained score is always strictly greater than 0, and the best value is 1. 

- If there is exactly one relevant label per sample, **label ranking average precision is equivalent to the mean reciprocal rank.**

Formally, given a binary indicator matrix of the ground truth labels $y \in \left\{0, 1\right\}^{n_\text{samples} \times n_\text{labels}}$  and the score associated with each label $\hat{f} \in \mathbb{R}^{n_\text{samples} \times n_\text{labels}}$, the average precision is defined as   

$LRAP(y, \hat{f}) = \frac{1}{n_{\text{samples}}}
  \sum_{i=0}^{n_{\text{samples}} - 1} \frac{1}{||y_i||_0}
  \sum_{j:y_{ij} = 1} \frac{|\mathcal{L}_{ij}|}{\text{rank}_{ij}}$

where $\mathcal{L}_{ij} = \left\{k: y_{ik} = 1, \hat{f}_{ik} \geq \hat{f}_{ij} \right\}$, $\text{rank}_{ij} = \left|\left\{k: \hat{f}_{ik} \geq \hat{f}_{ij} \right\}\right|$,  computes the cardinality of the set (i.e., the number of elements in the set), and $||\cdot||_0$ is the $\ell_0$ “norm” (which computes the number of nonzero elements in a vector).


<h1 style="color:blue">Don't forget to upvote if you like this notebook. And let me know your thoughts and findings in the comments.</h1>

# 2. Deep Dive into Data

> In this competition, you are given audio files that include sounds from numerous species. Your task is, for each test audio file, to predict the probability that each of the given species is audible in the audio clip. While the training files contain both the species identification as well as the time the species was heard, the time localization is not part of the test predictions.

Again quoted the organizers, but that's what we have to follow :) So yeah, I hope the first part is clear, or you can refer to the section 1.2 and 1.3 of this notebook for better understanding about the format and evaluation metrics. Cool? Let's understand the last sentence. C'mon, read again.
<span style="color:brown">In the training files (means audio recordings), we do have the timestamps information where the particular species is heard. Clear? But for test samples, we don't have that data.</span> Well, we will use it for better insgihts later ;)

***Note that the training data also includes false positive label occurrences to assist with training.***

Let's look at the files and its description.

# 2.1 Files

- **train_tp.csv** - training data of true positive species labels, with corresponding time localization
- **train_fp.csv** - training data of false positives species labels, with corresponding time localization
- **sample_submission.csv** - a sample submission file in the correct format; note each species column has an <span style="color:brown">s</span> prefix.
- **train/** - the training audio files
- **test/** - the test audio files; the task is to predict the species found in each audio file
- **tfrecords/{train,test}** - competition data in the TFRecord format, which includes **recording_id**, **audio_wav** (encoded in 16-bit PCM format), and **label_info** (for train only), which provides a `,` -delimited string of the columns below (minus recording_id), where multiple labels for a recording_id are `;` -delimited.

# 2.1.1 Columns
- **recording_id** - unique identifier for recording
- **species_id** - unique identifier for species
- **songtype_id** - unique identifier for songtype
- **t_min** - start second of annotated signal
- **f_min** - lower frequency of annotated signal
- **t_max** - end second of annotated signal
- **f_max** - upper frequency of annotated signal
- **is_tp** - [tfrecords only] an indicator of whether the label is from the train_tp (1) or train_fp (0) file.

# 2.2 Imports

In [None]:
import gc
import os
import random
import numpy as np
import pandas as pd
from tqdm import tqdm

import seaborn as sns
import IPython.display as ipD
import matplotlib.pyplot as plt
import matplotlib.patches as ptc

import tensorflow as tf

from sklearn.preprocessing import StandardScaler, MinMaxScaler

mm = MinMaxScaler()
ss = StandardScaler()

import librosa
import librosa.display as LD

%matplotlib inline

In [None]:
test_folder = "../input/rfcx-species-audio-detection/test"
tfrecords = "../input/rfcx-species-audio-detection/tfrecords"
train_folder = "../input/rfcx-species-audio-detection/train"
sample_submission = "../input/rfcx-species-audio-detection/sample_submission.csv"
train_tp = "../input/rfcx-species-audio-detection/train_tp.csv"
train_fp = "../input/rfcx-species-audio-detection/train_fp.csv"

In [None]:
def seedAll(seed):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ["PYTHONHASHSEED"]=str(seed)

seedAll(2021)

## Helpers Army :)

In [None]:
def species_id_dis(df, flag, gby=False):
    plt.figure(figsize=(18, 6))
    if not gby:
        sns.countplot(x="species_id", data=df)
        plt.title(f"Species ID distribution for {flag}")
    else:
        sns.countplot(x="species_id", hue="songtype_id", data=trtp)
        plt.title(f"Species ID distribution for {flag} grouped by song_type")      
        
    plt.show()
    pass

def pie_st(df, flag, col="songtype_id"):
    plt.figure(figsize=(8, 8))
    wegdes, texts, autotexts = plt.pie(df[col].value_counts(), 
            startangle=45, 
            wedgeprops={"linewidth":1, "edgecolor":"black"}, 
            autopct='%1.f%%', 
            shadow=True,
            textprops= dict(color="black"),
            explode=(0.2, 0.2))
    plt.legend(wegdes, df[col].value_counts().index,
              title="song type",
              loc="center",
              bbox_to_anchor=(1, 0, 0, 0))
    plt.setp(autotexts, size=14, weight="bold")
    plt.title(f"Song types distribution for {flag}")
    plt.show()
    pass

def outlier_viz(df, col, flag):
    fig, ax = plt.subplots(1, 2, figsize=(20, 6))
    sns.boxplot(x=col, data=df, orient="h", ax=ax[0])
    sns.distplot(df[col], kde=True, ax=ax[1])
    
    if col=="clip_duration":
        fig.suptitle(f"Distribution of length of audio clips containing any species in {flag}")
    else:
        fig.suptitle(f"Distribution of Frequency Ranges containing any species in {flag}")
    fig.show()
    pass

def spidwise_viz(df, col, flag):
    plt.figure(figsize=(20, 20))
    sns.boxplot(x=col, y="species_id", data=df, orient="h")
    
    if col=="clip_duration":
        plt.title(f"species_id-wise Distribution of length of audio clips containing any species in {flag} Samples")
    else:
        plt.title(f"species_id-wise Distribution of Frequency Ranges containing any species in {flag} Samples")
    plt.show()
    pass

def soidwise_viz(col):
    fig, axs = plt.subplots(1, 2, figsize=(22, 8))
    
    if col=="clip_duration":
        flag = "length of audio clips"
    else:
        flag = "Frequency Ranges"
        
    sns.boxplot(y=col, x="songtype_id", data=trtp, ax=axs[0])
    axs[0].set_title(f"species_id-wise Distribution of {flag} containing any species in True positive Samples")

    sns.boxplot(y=col, x="songtype_id", data=trfp, ax=axs[1])
    axs[1].set_title(f"species_id-wise Distribution of {flag} containing any species in False positive Samples")

    fig.show()
    pass

def pivot_data(df, col, gby="species_id"):
    pct_25 = lambda x: np.percentile(x, 25)
    pct_75 = lambda x: np.percentile(x, 75)
    pct_75.__name__ = "75%"
    pct_25.__name__ = "25%"
    display(df.pivot_table(col, gby, aggfunc=["count", "min", pct_25, "mean", "median", pct_75, "max"]).style.background_gradient(cmap="plasma"))
    pass

# 3. Explarotary Data Analysis

Just an information for the readers, as I am an electronics major, I have much more inclination towards Digital signal processing, and I have taken this one as one of my research project, so I will be updating my resources, domain knowledge here or in discussion threads as the competition progresses.

# 3.1 Train True Positives

In [None]:
trtp = pd.read_csv(train_tp)
trtp

So our target col is `species_id`. Let's have a look at the class counts.

But what does `songtype_id` represent? 

In [None]:
species_id_dis(trtp, "True Positives")

In [None]:
species_id_dis(trtp, "True Positives", gby=True)

In [None]:
pie_st(trtp, "True Positives", col="songtype_id")

# 3.1.1 Target count distribution

- An implication of class imbalance reflects for specied_id 23 with 100 counts.
- We do not have songtype 4 for any of these classes except species 16, 17 and 23.
- For species_id 16, there exists only songtype of 4.
- For species_id 23, there seems to be a perfect balance for both of the song types.

But again, what is songtype_id? Let's have a look at them.

# 3.2 Train False Positives

In [None]:
trfp = pd.read_csv(train_fp)
trfp

In [None]:
species_id_dis(trfp, "False Positives")

In [None]:
species_id_dis(trfp, "False Positives", gby=True)

In [None]:
pie_st(trfp, "False Positives", col="songtype_id")

# <h1 style="color:gold">CuriousityN1</h1>

## Let's see how many samples are there, those present in train_tp as well as train_fp?

In [None]:
print("Total Train audio samples: ", len(os.listdir(train_folder)))
print("Total Test audio samples: ", len(os.listdir(test_folder)))
print("Number of samples present both in Train True Positives and Train False Positives: ", len(set(trfp.recording_id.tolist()).intersection(trtp.recording_id.tolist())))
print("Number of unique audio samples in Train True positives: ", trtp.recording_id.nunique())
print("Number of unique audio samples in Train False positives: ", trfp.recording_id.nunique())

# 3.2.1 Clip Duration Analysis

## Species ID-wise

In [None]:
trtp["clip_duration"] = trtp["t_max"] - trtp["t_min"]
trfp["clip_duration"] = trfp["t_max"] - trfp["t_min"]

In [None]:
outlier_viz(trtp, "clip_duration", "True Positives")

In [None]:
spidwise_viz(trtp, "clip_duration", "True Positives")

In [None]:
pivot_data(trtp, "clip_duration")

In [None]:
outlier_viz(trfp, "clip_duration", "False Positives")

In [None]:
spidwise_viz(trfp, "clip_duration", "False Positives")

In [None]:
pivot_data(trfp, "clip_duration")

# 3.2.2 Frequency Range Analysis

In [None]:
trtp["freq_range"] = trtp["f_max"] - trtp["f_min"]
trfp["freq_range"] = trfp["f_max"] - trfp["f_min"]

In [None]:
outlier_viz(trtp, "freq_range", "True Positives")

In [None]:
spidwise_viz(trtp, "freq_range", "True Positives")

In [None]:
pivot_data(trtp, "freq_range")

In [None]:
outlier_viz(trfp, "freq_range", "False Positives")

In [None]:
spidwise_viz(trfp, "freq_range", "False Positives")

In [None]:
pivot_data(trfp, "freq_range")

## Songtype_id-wise

In [None]:
soidwise_viz("clip_duration")

In [None]:
pivot_data(trtp, col="clip_duration", gby="songtype_id")

In [None]:
pivot_data(trfp, col="clip_duration", gby="songtype_id")

In [None]:
soidwise_viz("freq_range")

In [None]:
pivot_data(trtp, col="freq_range", gby="songtype_id")

In [None]:
pivot_data(trfp, col="freq_range", gby="songtype_id")

# 3.3 Analysing Audio data with [Librosa](https://librosa.org/doc/latest/index.html)

# 3.3.1 Load and Examine the audio sample

In [None]:
# choose a sample from train and test
tr = os.path.join(train_folder, os.listdir(train_folder)[np.random.randint(0, len(os.listdir(train_folder)))])
ts = os.path.join(test_folder, os.listdir(test_folder)[np.random.randint(0, len(os.listdir(test_folder)))])

# load the np array and the samping rate
trx, trsr = librosa.load(tr)
tsx, tssr = librosa.load(ts)
recId_train = (tr.split("/")[-1]).split(".")[0]
recId_test = (ts.split("/")[-1]).split(".")[0]

print("="*10, "Training Sample", "="*10)
display(trfp[trfp["recording_id"]==recId_train] if recId_train in trfp["recording_id"].tolist() else trtp[trtp["recording_id"]==recId_train])
print("="*10, "Test Sample", "="*10)
print("Test Data: ", recId_test)

So in our picked sample(recId=efcc3bd16), we have one single species i.e species_20. It's heard between 49.232s to 52.672s and the min and max freq components are 2343.75hz and 5718.75Hz respectively. Let's explore the priliminary features.

In [None]:
# print the shape and the sampling rate
print("=====Train sample=======")
print(trx.shape, trsr)
print("=====Test sample=======")
print(tsx.shape, tssr)

So, yes, each data sample is an 1-D Vector with shape (1323000, ) and sampling rate 22050 (? how does librosa do that?). Oh and if you haven't noticed yet, the audio files are just named as `recording_id.flac`. So, let's have a walk around.

# 3.3.2 Play the Audio

### Train sample

In [None]:
ipD.Audio(tr)

### Test sample

In [None]:
ipD.Audio(ts)

# 3.3.3 Waveplot of the audio samples

### Train sample

In [None]:
plt.figure(figsize=(14, 5))
LD.waveplot(trx, sr=trsr)
plt.show()

### Test Sample

In [None]:
plt.figure(figsize=(14, 5))
LD.waveplot(tsx, sr=tssr)
plt.show()

# 3.3.4 Short Time Fourier Transform(STFT)

But what is a Short time fourier transform? What happened to General DFT? Let's understand.  

A signal may contain one or more frequency components. But the signal representation doesn't tell about the frequency components present in a signal. To do so, we need **Fourier Transform**. It tells about the frequency components present in a signal. 

Then what's the problem? Why do we even need STFT? 

Because there remains a fundamental trade-off between time and frequency. Here is a quick explanatory [video](https://www.youtube.com/watch?v=g1_wcbGUcDY). The time represenation obfuscates frequency, and so does frequency representation obfuscates frequency. In no single representation, we have a clear picture of both of them. Here comes the idea of Short time fourier transform.

<span style="color:blue">**Short-time Fourier transform (STFT)**</span> is a sequence of Fourier transforms of a <span style="color:blue">windowed signal</span>. STFT provides the <span style="color:red">time-localized frequency information for situations in which frequency components of a signal vary over time</span>, whereas the <span style="color:blue">standard Fourier transform</span> provides the <span style="color:orange">frequency information averaged over the entire signal time interval.</span>

In simpler words, we take **fixed-length Windows of signals** from the original signal and apply fourier transform to each window and then take the sum over all the windows. Here I am attachingn a few resources to follow:

1. YouTube Video: https://www.youtube.com/watch?v=g1_wcbGUcDY
2. Science Direct: https://www.sciencedirect.com/topics/engineering/short-time-fourier-transform

Still, there remains ambiguities over selecting a window length, shape of the window, window filters, procssings, etc. We willl dig deep as the competition progresses.

And this stft is best visuallized by spectogram plot. But what is a spectogram? It's a spectrum of frequency of a signal. Okay, let's worry about hwo to obtain it and what to do iwth this feature?

### Train sample

In [None]:
TRX = librosa.stft(trx)
print("Shape of the stft: ", TRX.shape)
# convert into db
TRXdb = librosa.amplitude_to_db(abs(TRX))

plt.figure(figsize=(14, 5))
librosa.display.specshow(TRXdb, sr=trsr, x_axis='time', y_axis='hz')
plt.colorbar()
plt.figure(figsize=(14, 5))
librosa.display.specshow(TRXdb, sr=trsr, x_axis='time', y_axis='log')
plt.title("In log scale")
plt.colorbar()
plt.show()

### Test Sample

In [None]:
# convert into fourier transform
TSX = librosa.stft(tsx)
print("Shape of the stft: ", TSX.shape)

# convert into bd
TSXdb = librosa.amplitude_to_db(abs(TSX))
plt.figure(figsize=(14, 5))
librosa.display.specshow(TSXdb, sr=tssr, x_axis='time', y_axis='hz')
plt.colorbar()
plt.figure(figsize=(14, 5))
librosa.display.specshow(TSXdb, sr=tssr, x_axis='time', y_axis='log')
plt.colorbar()
plt.show()

# 3.3.5 Spectral Centroids

Let's understand what's a Spectral Centroid. As the name sugests, it's the **centroid of Spectral components**. Well,that's obvious from the name, what else? It indicates where the ”centre of mass” for a sound is located and is calculated as the <span style="color:red">weighted mean of the frequencies</span> present in the sound. Or in simple words: It gives the <span style="color:orange">measure of the brightness of a sound</span>. The individual centroid of a spectral frame is defined as the average frequency weighted by amplitudes, divided by the sum of the amplitudes.

### Train sample

In [None]:
spectral_centroids = librosa.feature.spectral_centroid(trx, sr=trsr)[0]
print("Shape of the spectral centroids: ", spectral_centroids.shape)

# extract the time and frame indices
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)

plt.figure(figsize=(10, 4))
LD.waveplot(trx, sr=trsr, alpha=0.4)
plt.title("Spectral Centroids for Train sample")
plt.show()

plt.figure(figsize=(10, 4))
LD.waveplot(trx, sr=trsr, alpha=0.4)
plt.plot(t, ss.fit_transform(spectral_centroids.reshape(-1, 1)))
plt.plot(t, mm.fit_transform(spectral_centroids.reshape(-1, 1)))
plt.legend(["Audio Signal", "sc_ss", "sc_mm"][::-1])
plt.title("Normalized Spectral Centroid Visualization for Train sample")
plt.show()

### Test sample

In [None]:
spectral_centroids = librosa.feature.spectral_centroid(tsx, sr=tssr)[0]
print("Shape of the spectral centroids: ", spectral_centroids.shape)

# extract the time and frame indices
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)

plt.figure(figsize=(10, 4))
LD.waveplot(tsx, sr=tssr, alpha=0.4)
plt.title("Spectral Centroids for Test sample")
plt.show()

plt.figure(figsize=(10, 4))
LD.waveplot(tsx, sr=tssr, alpha=0.4)
plt.plot(t, ss.fit_transform(spectral_centroids.reshape(-1, 1)))
plt.plot(t, mm.fit_transform(spectral_centroids.reshape(-1, 1)))
plt.legend(["Audio Signal", "sc_ss", "sc_mm"][::-1])
plt.title("Normalized Spectral Centroid Visualization for Test sample")
plt.show()

# 3.3.6 Spectral Roll-off

In [None]:
spectral_rolloff = librosa.feature.spectral_rolloff(trx+0.01, sr=trsr)
frames = range(len(spectral_rolloff))
t = librosa.frames_to_time(frames)

plt.figure(figsize=(12, 4))
LD.waveplot(trx, sr=trsr, alpha=0.4)
plt.plot(t, ss.fit_transform(spectral_rolloff), color='r')
plt.show()

In [None]:
spectral_rolloff = librosa.feature.spectral_rolloff(tsx, sr=tssr)
frames = range(len(spectral_rolloff))
t = librosa.frames_to_time(frames)

plt.figure(figsize=(10, 4))
LD.waveplot(tsx, sr=tssr, alpha=0.4)
plt.plot(t, ss.fit_transform(spectral_rolloff), color="r")
plt.show()

# 3.3.7 Mel-Frequency Cepstral Coefficients(MFCCs)

Everyone is using MFCC, huh? This must be super cool. Well, let's dive deep together. `You want to conquer something, break it down to pieces`: Don't who said this, or probably I quoted it randomly, but let's follow the traditional approach of understanding a new concept.

**MFCC**: <span style="color:orange">Mel</span> <span style="color:blue">Frequency</span> <span style="color:red">Cepstral</span> <span style="color:brown">Coefficients</span>

# <span style="color:orange">Mel</span>

- **Mel scale** is a scale that relates **the perceived frequency of a tone** to **the actual measured frequency**. 
- It scales the frequency in order to match more closely what the human ear can hear (humans are better at identifying small changes in speech at lower frequencies). This scale has been derived from sets of experiments on human subjects. 

Let me give you an intuitive explanation of what the mel scale captures:

The range of human hearing is 20Hz to 20kHz. Imagine a tune at 300 Hz. This would sound something like the standard dialer tone of a land-line phone. Now imagine a tune at 400 Hz (a little higher pitched dialer tone). Now compare the distance between these two howsoever this may be perceived by your brain. 

Now imagine a 900 Hz signal (similar to a microphone feedback sound) and a 1kHz sound. The perceived distance between these two sounds may seem greater than the first two although the actual difference is the same (100Hz). 

The mel scale tries to capture such differences. A frequency measured in Hertz (f) can be converted to the Mel scale using the following formula :
$$Mel(f) = 2595\log(1 + \frac{f}{700})$$

# <span style="color:blue">Frequency</span>

Well, you know that I know that you know I know :) But, why Frequency domain analysis? Explained [here](https://www.kaggle.com/c/rfcx-species-audio-detection/discussion/201129).

# <span style="color:red">Cepstral</span>

On a lighter note, `"speC"[::-1] + "tral"` ;) Well, jokes apart, or was it a joke? Nope xD. Here's why and how?  
Let's understand what is **cepstrum?** For a very basic understanding, cepstrum is the **information of rate of change in spectral bands**. In the conventional analysis of time signals, any periodic component (for eg, echoes) shows up as sharp peaks in the corresponding frequency spectrum (ie, Fourier spectrum(obtained by fourier transform)).

On taking the log of the magnitude of this Fourier spectrum, and then again taking the spectrum of this log by a cosine transformation, we observe a peak wherever there is a periodic element in the original time signal. **TL;DR**,`cosine_transform(log(mag(fourier_transformation)))`.

Since we apply a transform on the frequency spectrum itself, the resulting spectrum is neither in the frequency domain nor in the time domain and hence [Bogert et al.](https://www.fceia.unr.edu.ar/prodivoz/Oppenheim_Schafer_2004.pdf) decided to call it the ***quefrency domain***. And this spectrum of the log of the spectrum of the time signal was named **cepstrum**.

# <span style="color:brown">Coefficients</span>

Coefficients are nothing but what makes up the cepstrum.

Being saig, Any sound generated by humans is determined by the shape of their vocal tract (including tongue, teeth, etc). If this shape can be determined correctly, any sound produced can be accurately represented. The envelope of the **time power spectrum of the speech signal** is representative of the vocal tract and MFCCs accurately represents this envelope. Here is a diagram explaining the whole process:
![cepstralCoeff](https://miro.medium.com/max/788/1*dWnjn5LLS0j8St53ACwqSg.jpeg)

The Cepstral coeffs mentioned in the block diagram are nothing but the MFCCs.

But why **power spectrum**? I searched and found this.

What Andrew Ng said:
> phonemes, which are the smallest components of sound, didn’t matter. To a certain extent, in fact, what matters is that voice is mostly a (quasi) periodic signal if time intervals are small enough. 

This leads to the idea of ignoring the phase of the signal and only using its power spectrum as a source of information. The fact that sound can be reconstructed from its power spectrum (with the Griffin-Lim algorithm or a neural vocoder, for example) proves that this is the case. I will come back to the discussion later.

### Train sample

In [None]:
mfccs = librosa.feature.mfcc(trx, sr=trsr)
print(mfccs.shape)
#Displaying  the MFCCs:
plt.figure(figsize=(15, 7))
LD.specshow(mfccs, sr=trsr, x_axis='time')
plt.show()

### Test Sample

In [None]:
mfccs = librosa.feature.mfcc(tsx, sr=tssr)
print(mfccs.shape)
#Displaying  the MFCCs:
plt.figure(figsize=(15, 7))
LD.specshow(mfccs, sr=tssr, x_axis='time')
plt.show()

# Mel Spectrogram

In [None]:
melspec = librosa.feature.melspectrogram(trx, sr=trsr)
print(melspec.shape)
plt.figure(figsize=(10, 4))
librosa.display.specshow(librosa.power_to_db(melspec, ref=np.max),
                         y_axis='mel',
                         x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram')
plt.tight_layout()

## Chroma feature

### Train sample

In [None]:
hop_length=12
chromagram = librosa.feature.chroma_stft(trx, sr=trsr, hop_length=hop_length)
plt.figure(figsize=(15, 5))
LD.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')
plt.show()

### Test Sample

In [None]:
chromagram = librosa.feature.chroma_stft(tsx, sr=tssr, hop_length=hop_length)
plt.figure(figsize=(15, 5))
LD.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')
plt.show()

# 3.4 Sample Submission

In [None]:
samSub = pd.read_csv(sample_submission)
samSub

# 4. Explore TFRecords

In [None]:
train_tfrec = "../input/rfcx-species-audio-detection/tfrecords/train"
test_tfrec = "../input/rfcx-species-audio-detection/tfrecords/test"

train_tfrecs = sorted(tf.io.gfile.glob(train_tfrec + "/*.tfrec"))
test_tfrecs = sorted(tf.io.gfile.glob(test_tfrec + "/*.tfrec"))

print("Number of train tfrecords: ", len(train_tfrecs))
print("Number of test tfrecords: ", len(test_tfrecs))

Read a tfrec sample to inspect. Following is the syntax for reading a tfrec data in tf.  
`tf.data.TFRecordDataset([filename])`

In [None]:
sample_proto_train_tfrec = tf.data.TFRecordDataset([train_tfrecs[0]])
sample_proto_train_tfrec

Now, let's go back to the data description and see what they have mentioned about the tfrecords.

> competition data in the TFRecord format, which includes **recording_id**, **audio_wav** (encoded in 16-bit PCM format), and **label_info** (for train only), which provides a `,` -delimited string of the columns (except recording_id), where multiple labels for a recording_id are `;` -delimited.

As we can see there are 148 samples there in each tfrecords , ofc the last samples named `../input/rfcx-species-audio-detection/tfrecords/train/31-139.tfrec` and for test its 63 and `../input/rfcx-species-audio-detection/tfrecords/test/31-39.tfrec` hence 39 respectively. Okay, let's just have a look below then we will proceed to decode the tfrecs.

In [None]:
print("Number of samples in one single record: ", sample_proto_train_tfrec.reduce(np.int64(), lambda x, _: x+1).numpy())

Okay, now that we know our tfrecords contain 3 things (for training samples and 2 for testing), let's create a feature description for each sample inside our tfrecord.

***For more info on tfrecords, head over to [tensorflow documentation.](https://www.tensorflow.org/tutorials/load_data/tfrecord#tfrecord_files_using_tfdata)***

Or mention in the comment, I will write a detailed explanation if required.

# 4.1 Parse TFRecords

In [None]:
CUT_TIME = 10 # cutting window in seconds
SAMPLE_TIME = 6
SR = 48000
FMAX = 24000
FMIN = 40
# feature description for the tfrecords
# this will parsed as arguments into tf.io.parse_single_example
feature_description = {
    'recording_id': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'audio_wav': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'label_info': tf.io.FixedLenFeature([], tf.string, default_value=''),
}
parse_dtype = {
    'audio_wav': tf.float32,
    'recording_id': tf.string,
    'species_id': tf.int32,
    'songtype_id': tf.int32,
    't_min': tf.float32,
    'f_min': tf.float32,
    't_max': tf.float32,
    'f_max':tf.float32,
    'is_tp': tf.int32
}

# define a fun to read the encoded tfrec
@tf.function
def _parse_fun(sample):
    sample = tf.io.parse_single_example(sample, feature_description) # this returns a dicionary of the features for a single tfrec
    audio, _ = tf.audio.decode_wav(sample["audio_wav"], desired_channels=1)
    label_info = tf.strings.split(sample["label_info"], ";")
    labels = tf.strings.regex_replace(label_info, '"', '')
    
    @tf.function
    def _cut_audio(label):
        items = tf.strings.split(label, sep=',')
        spid = tf.squeeze(tf.strings.to_number(items[0], tf.int32))
        soid = tf.squeeze(tf.strings.to_number(items[1], tf.int32))
        tmin = tf.squeeze(tf.strings.to_number(items[2]))
        fmin = tf.squeeze(tf.strings.to_number(items[3]))
        tmax = tf.squeeze(tf.strings.to_number(items[4]))
        fmax = tf.squeeze(tf.strings.to_number(items[5]))
        tp = tf.squeeze(tf.strings.to_number(items[6], tf.int32))

        tmax_s = tmax * tf.cast(SR, tf.float32)
        tmin_s = tmin * tf.cast(SR, tf.float32)
        cut_s = tf.cast(CUT_TIME * SR, tf.float32)
        all_s = tf.cast(60 * SR, tf.float32)
        tsize_s = tmax_s - tmin_s
        cut_min = tf.cast(
            tf.maximum(0.0, 
                tf.minimum(tmin_s - (cut_s - tsize_s) / 2,
                           tf.minimum(tmax_s + (cut_s - tsize_s) / 2,
                                      all_s) - cut_s)
            ), tf.int32
        )
        cut_max = cut_min + CUT_TIME * SR
        
        _sample = {
            'audio_wav': tf.reshape(audio[cut_min:cut_max], [CUT_TIME*SR]),
            'recording_id': sample['recording_id'],
            'species_id': spid,
            'songtype_id': soid,
            't_min': tmin - tf.cast(cut_min, tf.float32)/tf.cast(SR, tf.float32),
            'f_min': fmin,
            't_max': tmax - tf.cast(cut_min, tf.float32)/tf.cast(SR, tf.float32),
            'f_max': fmax,
            'is_tp': tp
        }
        return _sample
    
    samples = tf.map_fn(_cut_audio, labels, dtype=parse_dtype)
    return samples
    pass

In [None]:
parsed_tfrecs_sample = sample_proto_train_tfrec.map(_parse_fun).unbatch()

Now we have samples of our tfrec file decoded and saved in parsed_tfrecs_sample in a serialized manner, which can be iterated over using next iter as following: 

In [None]:
sample = next(iter(parsed_tfrecs_sample))

for key, val in sample.items():
    print(key, ":", val)

# 4.2 Cut down audio to smaller clips

In [None]:
@tf.function
def cut_audio(sample, istrain=True):
    # random cutting for train samples
    if istrain:
        cut_min = tf.random.uniform([],
                                    maxval=(CUT_TIME-SAMPLE_TIME) * SR,
                                    dtype=tf.int32)
    else:
        # center cropping for validation data
        cut_min = (CUT_TIME - SAMPLE_TIME) * SR//2
    cut_max = cut_min + SAMPLE_TIME * SR
    cutaudio = tf.reshape(
        sample["audio_wav"][cut_min:cut_max], [SAMPLE_TIME * SR]
    )

    result = {}
    result.update(sample)
    result["audio_wav"] = cutaudio
    result["t_min"] = tf.maximum(0.0, sample["t_min"] - tf.cast(cut_min, tf.float32)/SR)
    result["t_max"] = tf.maximum(0.0, sample["t_max"] - tf.cast(cut_min, tf.float32)/SR)
    return result
    pass

# 4.3 Create log mel-spectogram features

In [None]:
@tf.function
def waveToSpec(sample):
    mel_power = 2
    stfts = tf.signal.stft(sample["audio_wav"],
                           frame_length=2048,
                           frame_step=512,
                           fft_length=2048)
    spectograms = tf.abs(stfts) ** mel_power

    # convert into mel scale
    mel_weight = tf.signal.linear_to_mel_weight_matrix(
        num_mel_bins=224,  # or can be said as the no of MFCCs, though theoretically, ideal value should be in 30-50 range, let's try with 224 for image size
        num_spectrogram_bins=stfts.shape[-1],
        sample_rate=SR,
        lower_edge_hertz=FMIN,
        upper_edge_hertz=FMAX
    )
    mel_spectrograms = tf.tensordot(
        spectograms, mel_weight, 1
    )
    mel_spectrograms.set_shape(spectograms.shape[:-1].concatenate(mel_weight.shape[-1:]))
    log_mel_spectograms = tf.math.log(mel_spectrograms + 1e-6)

    results = {
        "audio_spec": tf.transpose(log_mel_spectograms)  # of shape (num_mel_spec_bins, num_frames)
    }
    results.update(sample)
    return results
    pass

In [None]:
@tf.function
def filter_tp(sample):
    """

    :param sample: Processed dictionary from _parse_function
    :return: boolean, whether belongs to true positive or false positive
    """
    return sample["is_tp"] == 1
    pass

In [None]:
@tf.function
def create_annot(sample):
    target = tf.one_hot(sample["species_id"],
                        24,
                        on_value=sample["is_tp"],
                        off_value=0)

    return {
        "input": sample["audio_spec"],  # obtained from creating spectograms from audio np arrays
        "target": tf.cast(target, tf.float32)
    }
    pass

In [None]:
@tf.function
def toImage(logmelSpec):
    # expand one dimension axis to be treated as image
    image = tf.expand_dims(logmelSpec, axis=-1)
    image = tf.image.resize(image, (224, 512))
    image = tf.image.per_image_standardization(image)

    # no augmentation at this stage
    image = (image - tf.reduce_mean(image))/(tf.reduce_max(image) * tf.reduce_min(image)) * 255.0
    image = tf.image.grayscale_to_rgb(image)
    # image = cfg.model_params["preprocess"](image)
    return image
    pass


@tf.function
def preprocess_img(sample):
    image = toImage(sample["input"])
    return image, sample["target"]
    pass

In [None]:
# let's extract the samples from the tfrecs and create our initial processed dataset
spec_dataset = parsed_tfrecs_sample.filter(filter_tp).map(cut_audio).map(waveToSpec)
img_dataset = spec_dataset.map(create_annot).map(preprocess_img)

In [None]:
fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(20, 10))
for i, s in enumerate(spec_dataset.take(3)):
    axs[0, i].imshow(s['audio_spec'])
    axs[0, i].set_title(s['recording_id'].numpy().decode("UTF-8") + "->" + "s-" +str(s["species_id"].numpy()))
    LD.waveplot(s["audio_wav"].numpy(), sr=SR, ax=axs[1, i])
    LD.specshow(s['audio_spec'].numpy(), x_axis="time", y_axis="mel", sr=SR, fmax=FMAX, fmin=FMIN, ax=axs[2, i], cmap="magma")
    axs[2, i].add_patch(ptc.Rectangle(xy=(s["t_min"], s["f_min"]), height=s["f_max"]-s["f_min"], width=s["t_max"]-s["t_min"], fill=False))
    axs[2, i].text(s["t_min"], s["f_min"], "s-" +str(s["species_id"].numpy()), horizontalalignment='left', verticalalignment='bottom', fontsize=16)                   
plt.show()

<h1 style="color:red">Thank you for reading the notebook :)</h1>