# Table of Contents

* [Exploratory Data Analysis](#Header)
    - [Metadata](#Metadata)
        - [Missing Values](#MissingValues)
        - [Species](#Species)
        - [Date and Time](#DateTime)
        - [Recordists](#Recordists)
        - [Location](#Location)
* [Audio Feature Extraction](#AudioFeatureExtraction)
    - [Waveform](#Waveform)
    - [Autocorrelation](#Autocorrelation)
    - [Spectrogram](#Spectrogram)
    - [Chromagram](#Chromagram)
    - [Spectral](#Spectral)
        - [Centroid](#Centroid)
        - [Bandwidth](#Bandwidth)
        - [Contrast](#Contrast)
        - [Flatness](#Flatness)
        - [Rolloff](#Rolloff)
    - [MFCC](#MFCC)

* [Afterword](#Thanks)

<a id="Environment"></a>
## Environment

In [None]:
import os
import sys
import librosa
import librosa.display
import librosa.feature
import numpy as np
import pandas as pd
import plotly.express as xp
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import IPython.display as ipd
from sklearn.preprocessing import minmax_scale

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

<a id="Header"></a>
# Exploratory Data Analysis

<a id="Metadata"></a>
## Metadata

The train.csv file contains the metadata of the recording sample for that entry. From that metadata, we get these relevant features:

- ebird_code
- date/time
- location
- recordist
- filename

In [None]:
train = pd.read_csv('../input/birdsong-recognition/train.csv')
train.head()

In [None]:
train.info()

<a id="MissingValues"></a>
### Missing Values

We need to check to see if any of our relevant features are missing values and, if so, what to do about that.

In [None]:
missing = train.isna().sum().sort_values(ascending=False)

In [None]:
missing = missing[missing != 0]
xp.bar(x=missing.index, y=missing, text=missing, title='Missing Values by Feature', labels={'x':'Features', 'y':'Quantity'})

- Luckily, none of our relevant data is missing

<a id="Species"></a>
### Species
Let's explore the distribution of the bird species among samples. We'll use the ebird code instead of the full species name.

In [None]:
counts = train['ebird_code'].value_counts()

In [None]:
xp.bar(x=counts.index, y=counts, title='Species Distribution (by ebird code)', labels={'x':'Ebird Code', 'y':'Quantity'})

- Exactly 100 samples for about half of species in question
- Redhead is minimum at 9 samples

<a id="DateTime"></a>
### Date and Time
The date and time of the recording could have an impact on which bird is making the call. Some birds may usually call only at certain times, and some birds are only in certain locations during certain times of the year.

In [None]:
# split datetime into separate dataframe
datetime = train[['date', 'time']]
datetime.date = pd.to_datetime(datetime.date, errors='coerce').dropna()
datetime['hour'] = pd.to_numeric(datetime.time.str.split(':', expand=True)[0], errors='coerce')

In [None]:
ax1 = datetime.date.value_counts().sort_values().plot(figsize=(10,6), title='Recordings by Date')

ax1.set_xlabel('Date')
ax1.set_ylabel('Quantity')
plt.show()

- Majority of recordings taken in the past decade
- Interesting spike around 2003
- Cyclical spikes after 2013

In [None]:
ax2 = datetime['hour'].value_counts().sort_index().plot(figsize=(10,6), title='Recordings by Time', kind='bar', figure=plt.figure())

ax2.set_xlabel('Hour')
ax2.set_ylabel('Quantity')
plt.show()

- Most recordings taken between 6AM and 12PM
- Gradual decrease as the day moves on from 8AM

<a id="Recordists"></a>
### Recordists
Who recorded the samples? This could be important as certain recordists may have a particular interest in certain birds.

In [None]:
ax3 = train['recordist'].value_counts().sort_values(ascending=False).head(20).sort_values().plot(figsize=(10, 6), title='Recordings by Recordist', figure=plt.figure(), kind='barh', fontsize=9)

ax3.set_xlabel('Hour')
ax3.set_ylabel('Quantity')
plt.tight_layout()
plt.show()

- Majority of recordings made by only two people

<a id="Location"></a>
### Location
Certain birds only inhabit certain areas. Therefore, we need to take location into account.

In [None]:
counts = train['country'].value_counts().sort_values(ascending=False).head(10).sort_values()

In [None]:
xp.bar(y=counts.index, x=counts, title='Number of Recordings by Country', labels={'y':'Country', 'x':'Quantity'}, orientation='h')

In [None]:
coords = train.groupby(['latitude', 'longitude'], as_index=False)['ebird_code'].agg('count')
coords = coords[coords.latitude != 'Not specified']
coords = coords[coords.longitude != 'Not specified']
xp.scatter_geo(lat=coords['latitude'], lon=coords['longitude'], title='Recording Locations')

- Vast majority of data comes from North America, specifically from USA
- Very little data from Africa and Asia

<a id="AudioFeatureExtraction"></a>
# Audio Feature Extraction

<a id="Sample"></a>
## Sample
First, let's take 5 audio samples from the first 5 birds. We'll look at the waveforms and listen to the songs.

In [None]:
bird_codes = train.ebird_code.unique()[:5]

audio = []
for bird in range(len(bird_codes)):
    filename = train[train['ebird_code'] == bird_codes[bird]]['filename'].iloc[0]
    path = os.path.join('../input/birdsong-recognition/train_audio/', bird_codes[bird], filename)
    
    # wave plot
    plt.figure(figsize=(15,10))
    plt.subplot(len(bird_codes), 1, bird+1)
    data, srate = librosa.load(path)
    librosa.display.waveplot(data, sr=srate)
    plt.gca().set_title(bird_codes[bird])
    plt.xticks([],[])
    plt.xlabel('')
    plt.show()
    
    # audio display
    audio = ipd.Audio(path)
    ipd.display(audio)

<a id="Features"></a>
## Features

After doing some research on audio signal classification, I have come up with the following features to extract from the audio files:

- [Waveform](#Waveform)
- [Autocorrelation](#Autocorrelation)
- [Spectrogram](#Spectrogram)
- [Chromagram](#Chromagram)
- [Spectral](#Spectral)
    - [Centroid](#Centroid)
    - [Bandwidth](#Bandwidth)
    - [Contrast](#Contrast)
    - [Flatness](#Flatness)
    - [Rolloff](#Rolloff)
- [MFCC](#MFCC)

We'll do a sample feature extraction of bird code 'ameavo' as an example. (filename XC99571.mp3)

<a id="Waveform"></a>
### Waveform

In [None]:
data, srate = librosa.load('../input/birdsong-recognition/train_audio/ameavo/XC99571.mp3')

In [None]:
# plot waveform as refresher
plt.figure(figsize=(15,5))
librosa.display.waveplot(data, sr=srate)
plt.gca().set_title('ameavo')
plt.xticks([],[])
plt.xlabel('')
plt.show()

<a id="Autocorrelation"></a>
### Autocorrelation

Autocorrelation compares a signal with a lagged version of itself. It's main purpose is to find repeated patterns in a sample that might be hidden by noise.

In [None]:
autocorrelation = librosa.autocorrelate(data, max_size=5000)

In [None]:
plt.figure(figsize=(15,5))
plt.plot(autocorrelation)
plt.gca().set_title('Autocorrelation by Lag Time')
plt.xlabel('Lag')
plt.show()

- autocorrelation very quickly falls off reaching almost 0 after a lag of about 500

<a id="Spectrogram"></a>
### Spectrogram

The spectrogram is a visual representation of a signal's spectrum of frequencies over time. 

In [None]:
spectrogram = librosa.stft(data)

In [None]:
plt.figure(figsize=(20,10))
librosa.display.specshow(librosa.amplitude_to_db(abs(spectrogram)), sr=srate, x_axis='time', y_axis='hz')
plt.xlabel('Time', fontsize=20)
plt.ylabel('Frequency Band')
plt.colorbar()
plt.title('Spectrogram', fontsize=20)
plt.show()

- Pitch seems to hover around 2000 to 3500 Hz most of the time
- Some spikes to 5500-7000 Hz

<a id="Chromagram"></a>
### Chromagram

The Chromagram is a visual representation of a signal's chroma feature. The chroma feature at any point in time is the intensity for each chroma value in the set {C, C♯, D, D♯, E , F, F♯, G, G♯, A, A♯, B}. These values are the rows of the chromagram.

In [None]:
chroma = librosa.feature.chroma_stft(data, sr=srate)

In [None]:
plt.figure(figsize=(20,10))
librosa.display.specshow(chroma, x_axis='time', y_axis='chroma')
plt.xlabel('Time', fontsize=20)
plt.ylabel('Chroma Value', fontsize=20)
plt.colorbar()
plt.clim(0,1)
plt.title('Chromagram', fontsize=20)
plt.show()

<a id="Spectral"></a>
### Spectral Features

<a id="Centroid"></a>
#### Spectral Centroid

Spectral Centroid is a measurement of the "center of gravity" of the signal and is a common metric of timbre in a sound sample. It's essentially the dominant frequency at each point.

In [None]:
centroid = librosa.feature.spectral_centroid(data)[0]

In [None]:
plt.figure(figsize=(15,5))
librosa.display.waveplot(data, sr=srate)
plt.plot(librosa.frames_to_time(range(len(centroid))), minmax_scale(centroid), color='g')
plt.gca().set_title('Spectral Centroid by Frame')
plt.xlabel('Frame')
plt.show()

<a id="Bandwidth"></a>
#### Spectral Bandwidth

Spectral bandwidth represents the range between the lowest and highest frequency bands of the signal at a certain time.

In [None]:
bandwidth = librosa.feature.spectral_bandwidth(data, sr=srate)[0]

In [None]:
plt.figure(figsize=(15,5))
librosa.display.waveplot(data, sr=srate)
plt.plot(librosa.frames_to_time(range(len(bandwidth))), minmax_scale(bandwidth))
plt.gca().set_title('Spectral Bandwidth by Time')
plt.xlabel('Time')
plt.show()

- pure noise portions of sample are higher in bandwidth

<a id="Contrast"></a>
#### Spectral Contrast

Spectral contrast compares the max and min frequency values for each frequency band at a point in time. Thus, spectral contrast gives a robust measure of relative spectral characteristics.

In [None]:
contrast = librosa.feature.spectral_contrast(data, sr=srate)

In [None]:
plt.figure(figsize=(20,10))
librosa.display.specshow(contrast, x_axis='time')
plt.xlabel('Time', fontsize=20)
plt.colorbar()
plt.title('Spectral Contrast', fontsize=20)
plt.ylabel('Frequency Band', fontsize=20)
plt.show()

- highest contrast occurs in edge frequency bands

<a id="Flatness"></a>
#### Spectral Flatness

Spectral flatness compares the arithmetic and geometric means of the power spectrum. It is most often used to identify and separate tones versus noise.

In [None]:
flatness = librosa.feature.spectral_flatness(data)

In [None]:
plt.figure(figsize=(20,10))
librosa.display.specshow(flatness, x_axis='time')
plt.xlabel('Time', fontsize=20)
plt.colorbar()
plt.clim(0,1)
plt.title('Spectral Flatness', fontsize=20)
plt.ylabel('Frequency Band', fontsize=20)
plt.show()

- Maximum value of .2 at points
- Low noise in general

<a id="Rolloff"></a>
#### Spectral Rolloff

Spectral rolloff is the frequency under which a specified percentage of the energy lies

In [None]:
rolloff = librosa.feature.spectral_rolloff(data, sr=srate)[0]

In [None]:
plt.figure(figsize=(15,5))
librosa.display.waveplot(data, sr=srate)
plt.plot(librosa.frames_to_time(range(len(rolloff))), minmax_scale(rolloff))
plt.gca().set_title('Spectral Bandwidth by Time')
plt.xlabel('Time')
plt.show()

<a id="MFCC"></a>
### MFCC

Mel-Frequency Cepstral Coefficients are a collection of coefficients that together give a representation of the overall spectral envelope of a signal. Probably the most common and important feature of audio signal processing in machine learning.

In [None]:
mfcc = librosa.feature.mfcc(data, sr=srate, n_mfcc=30)

In [None]:
plt.figure(figsize=(20,10))
librosa.display.specshow(minmax_scale(mfcc, axis=1), x_axis='time')
plt.xlabel('Time', fontsize=20)
plt.colorbar()
plt.clim(0,1)
plt.title('Mel-Frequency Cepstral Coefficients', fontsize=20)
plt.show()

print()
print('MFCCs calculated: %d' % mfcc.shape[0])

<a id="Thanks"></a>
# Thank You for Reading!

I am still very much new to data science, and I'm jumping in head-first. This is meant as a learning experience to help me learn some signal processing and audio classification techniques as well as a simple EDA and FE for those who aren't well-versed in audio processing. I invite any and all constructive feedback!

Thanks again! Hope this is helpful to you.

# Resources
Here are the major resources that I used while doing my research.

- [Autocorrelation Wiki](https://en.wikipedia.org/wiki/Autocorrelation)
- [Sanket Doshi - Music Feature Extraction in Python](https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d)
- [Spectral Features (IPython Notebook)](https://musicinformationretrieval.com/spectral_features.html#:~:text=Spectral%20contrast%20considers%20the%20spectral,difference%20in%20each%20frequency%20subband.)