# Classification

In this notebook we walk through an example of classifying individual zebra finches using acoustic parameters extracted from their calls.

Material here is adapted in part from https://github.com/theunissenlab/BioSoundTutorial

In [None]:
import librosa
import numpy as np
import pandas as pd
import sklearn
import vocalpy as voc

In [None]:
wav_paths = voc.paths.from_dir(
    './data/Elie-Theunissen-2016-zebra-finch-song-library-subset/',
    'wav'
)

In [None]:
wav_paths[0]

We make a helper function to get the bird IDs from the filenames.  

We will use this below when we want to predict the bird ID from the extracted features.

In [None]:
def bird_id_from_path(wav_path):
    """Helper functoin that gets a bird ID from a path"""
    return wav_path.name.split('_')[0]

In [None]:
bird_id_from_path(wav_paths[0])

We use a list comprehension to get the ID from all 91 files.

In [None]:
bird_ids = [
    bird_id_from_path(wav_path)
    for wav_path in wav_paths
]

## Feature extraction

Now we extract the acoustic features we will use to classify.  

For this example we use the temporal and spectral features from `soundsig`, since those are relatively quick to extract. For an example that uses fundamental frequency estimation, see https://github.com/theunissenlab/BioSoundTutorial/blob/master/BioSound4.ipynb

In [None]:
callback = voc.feature.soundsig.predefined_acoustic_features
params = dict(ftr_groups=("temporal", "spectral"))
extractor = voc.FeatureExtractor(callback, params)

In [None]:
sounds = []
for wav_path in wav_paths:
    data, samplerate = librosa.load(wav_path)
    data = librosa.to_mono(data)
    sounds.append(
        voc.Sound(data, samplerate)
    )

In [None]:
features_list = extractor.extract(sounds, parallelize=True)

## Data preparation

Now what we want to get from our extracted features is two NumPy arrays, `X` and `y`.  

These represent the samples $X_i$ in our dataset with their features $x$, and the labels for those samples $y_i$. In this case we have a total of $m=$91 samples (where $i \in 1, 2, ... m$).

We get these arrays as follows (noting there are always multiple ways to do things when you're programming):
- Take the `data` attribute of the `Features` we got back from the `FeatureExtractor` and convert it to a `pandas.DataFrame` with one row: the scalar set of features for exactly one sound
- Use `pandas` to concatenate all those `DataFrame`s, so we end up with 91 rows
- Add a column to this `DataFrame` with the IDs of the birds -- we then have $X$ and $y$ in a single table we could save to a csv file, to do further analysis on later
- We get $X$ by using the `values` attribute of the `DataFrame`, which is a numpy array
- We get $y$ using `pandas.factorize`, that converts the unique set of strings in the `"id"` column into integer class labels: i.e., since there are 4 birds, for every row we get a value from $\{0, 1, 2, 3\}$

In [None]:
df = pd.concat(
    [features.data.to_pandas()
    for features in features_list]
)

In [None]:
df.head()

In [None]:
df["id"] = pd.array(bird_ids, dtype="str")
y, _ = df["id"].factorize()
X = df.values[:, :-1]  # -1 because we don't want 'id' column

## Fitting a Random Forest classifier

Finally we will train a classifer from `scikit-learn` to classify these individuals.

In [None]:
import sklearn.model_selection

In [None]:
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(
    X, y, stratify=y, train_size=0.8
)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

In [None]:
print(
    f"Accuracy: {clf.score(X_val, y_val) * 100:0.2f}%"
)