# Classification

In this notebook we walk through an example of classifying individual zebra finches using acoustic parameters extracted from their calls.

Material here is adapted in part from https://github.com/theunissenlab/BioSoundTutorial

In [18]:
import librosa
import numpy as np
import pandas as pd
import sklearn
import vocalpy as voc

In [19]:
wav_paths = voc.paths.from_dir(
    './data/Elie-Theunissen-2016-zebra-finch-song-library-subset/',
    'wav'
)

In [20]:
wav_paths[0]

PosixPath('data/Elie-Theunissen-2016-zebra-finch-song-library-subset/WhiLbl0010_110411-DC-01.wav')

We make a helper function to get the bird IDs from the filenames.  

We will use this below when we want to predict the bird ID from the extracted features.

In [21]:
def bird_id_from_path(wav_path):
    """Helper functoin that gets a bird ID from a path"""
    return wav_path.name.split('_')[0]

In [22]:
bird_id_from_path(wav_paths[0])

'WhiLbl0010'

We use a list comprehension to get the ID from all 91 files.

In [23]:
bird_ids = [
    bird_id_from_path(wav_path)
    for wav_path in wav_paths
]

## Feature extraction

Now we extract the acoustic features we will use to classify.  

For this example we use the temporal and spectral features from `soundsig`, since those are relatively quick to extract. For an example that uses fundamental frequency estimation, see https://github.com/theunissenlab/BioSoundTutorial/blob/master/BioSound4.ipynb

In [24]:
callback = voc.feature.soundsig.predefined_acoustic_features
params = dict(ftr_groups=("temporal", "spectral"))
extractor = voc.FeatureExtractor(callback, params)

In [25]:
sounds = []
for wav_path in wav_paths:
    data, samplerate = librosa.load(wav_path)
    data = librosa.to_mono(data)
    sounds.append(
        voc.Sound(data, samplerate)
    )

In [26]:
features_list = extractor.extract(sounds, parallelize=True)

[########################################] | 100% Completed | 207.29 ms


## Data preparation

Now what we want to get from our extracted features is two NumPy arrays, `X` and `y`.  

These represent the samples $X_i$ in our dataset with their features $x$, and the labels for those samples $y_i$. In this case we have a total of $m=$91 samples (where $i \in 1, 2, ... m$).

We get these arrays as follows (noting there are always multiple ways to do things when you're programming):
- Take the `data` attribute of the `Features` we got back from the `FeatureExtractor` and convert it to a `pandas.DataFrame` with one row: the scalar set of features for exactly one sound
- Use `pandas` to concatenate all those `DataFrame`s, so we end up with 91 rows
- Add a column to this `DataFrame` with the IDs of the birds -- we then have $X$ and $y$ in a single table we could save to a csv file, to do further analysis on later
- We get $X$ by using the `values` attribute of the `DataFrame`, which is a numpy array
- We get $y$ using `pandas.factorize`, that converts the unique set of strings in the `"id"` column into integer class labels: i.e., since there are 4 birds, for every row we get a value from $\{0, 1, 2, 3\}$

In [27]:
df = pd.concat(
    [features.data.to_pandas()
    for features in features_list]
)

In [28]:
df.head()

Unnamed: 0_level_0,mean_t,std_t,skew_t,kurtosis_t,entropy_t,max_amp,mean_s,std_s,skew_s,kurtosis_s,entropy_s,q1,q2,q3
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,0.088417,0.046638,-0.007942,1.701301,0.990027,2190.359668,3179.943541,1141.036374,0.444231,5.511221,0.681472,2045.654297,3660.644531,4048.242188
0,0.109411,0.051968,-0.191355,1.904806,0.990254,2283.808144,3507.995317,944.136895,0.912424,11.054892,0.616933,3229.980469,3703.710938,3940.576172
0,0.117362,0.053734,-0.280394,1.893142,0.987814,2056.591994,3267.020307,1061.595836,1.040562,10.302365,0.636147,2734.716797,3186.914062,4069.775391
0,0.105553,0.055808,0.068496,1.803407,0.99323,2380.506805,3763.726551,911.892495,0.474032,9.10311,0.715781,3337.646484,3509.912109,4478.90625
0,0.102657,0.057831,0.069879,1.748356,0.994899,2595.650055,3942.28023,898.710281,-0.243041,5.894722,0.693535,3445.3125,3854.443359,4780.371094


In [29]:
df["id"] = pd.array(bird_ids, dtype="str")
y, _ = df["id"].factorize()
X = df.values[:, :-1]  # -1 because we don't want 'id' column

## Fitting a Random Forest classifier

Finally we will train a classifer from `scikit-learn` to classify these individuals.

In [30]:
import sklearn.model_selection

In [31]:
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(
    X, y, stratify=y, train_size=0.8
)

In [32]:
from sklearn.ensemble import RandomForestClassifier

In [33]:
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

In [34]:
print(
    f"Accuracy: {clf.score(X_val, y_val) * 100:0.2f}%"
)

Accuracy: 73.68%
