DSC160 Data Science and the Arts - Twomey - Spring 2020 - [dsc160.roberttwomey.com](http://dsc160.roberttwomey.com)

# Audio Feature Extraction, Clustering, Dimensional Reduction

This notebook walks you through feature extraction from a small collection of audio clips representing different genres (pop, classical, jazz, metal, rock). 

It is a simplified version of the more comprehensive approach outlined in Tzanetakis and Cook's 2002 paper [Musical Genre Classification of Audio Signals](https://pdfs.semanticscholar.org/4ccb/0d37c69200dc63d1f757eafb36ef4853c178.pdf) from IEEE Transactions on Audio and Speech Processing. Many of the techniques described in that paper (timbral, beat, and pitch features) can be implemented using librosa and our numpy/scipy toolkits.

This notebook works with the `mini-genres` dataset, a smaller version of the GITZAN dataset used in Tzanetakis and Cook's paper (see references at the end of this notebook). You can download minigenres here: http://opihi.cs.uvic.ca/sound/mini-genres.tar.bz2

## Setup

Import necessary modules:

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn

import librosa
import librosa.display

from IPython.display import Audio

import requests
import os

import sklearn
from sklearn.preprocessing import StandardScaler

import numpy as np

import pandas as pd

## Download the dataset and unzip

create data directory if it doesn't exist

In [None]:
if not os.path.exists('../data/'):
    os.makedirs('../data/')

In [None]:
#!wget -O ../data/mini-genres.tar.gz2 http://opihi.cs.uvic.ca/sound/mini-genres.tar.bz2 
#!tar -xvf ../data/mini-genres.tar.gz2 -C ../data/

## Explore the Audio Data

This section lets you explore two different examples from that dataset to observe differences in their time and frequency domain representations.

Using `librosa.load`, read in one of the files from the mini-genres dataset: `classical/classical.00001.au`

In [None]:
filename = "../data/mini-genres/classical/classical.00001.au"
x, fs = librosa.load(filename, duration=30)

Plot the waveform for this file:

In [None]:
plt.figure(figsize=(12,6))
librosa.display.waveplot(x, sr=fs)
plt.title('Sample')
plt.tight_layout()
plt.show()

Use the `Audio` class to make a playable widget for this file:

In [None]:
Audio(data=x, rate=fs)

Calculate the mel spectogram, using a log scale for magnitude:

In [None]:
hop_length = 256
S = librosa.feature.melspectrogram(x, sr=fs, n_fft=4096, hop_length=hop_length)
logS = librosa.power_to_db(abs(S))

Plot the log mel spectogram:

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.specshow(logS, sr=fs, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.show()

Repeat with another file from minigenres, `pop/pop.00004.au`

Load the file:

In [None]:
filename = "../data/mini-genres/pop/pop.00004.au"
x, fs = librosa.load(filename, duration=30)

Plot the waveform:

In [None]:
plt.figure(figsize=(12,6))
librosa.display.waveplot(x, sr=fs)
plt.title('Sample')
plt.tight_layout()
plt.show()

Make a playable widget for the file:

In [None]:
Audio(data=x, rate=fs)

Calculate and display the mel spectogram, using a log scale for magnitude:

In [None]:
# calculate mel spectogram:
hop_length = 256
S = librosa.feature.melspectrogram(x, sr=fs, n_fft=4096, hop_length=hop_length)
logS = librosa.power_to_db(abs(S))

# plot the mel spectogram
plt.figure(figsize=(15, 5))
librosa.display.specshow(logS, sr=fs, hop_length=hop_length, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.show()

### Observations

What differences do you see between the wave plots for these two audio files? Do they have a difference in amplitude, envelope?

What differences do you see between the two audio files in the frequency domain (mel spectogram)? For instance, can you see the difference between pure tonal sound in the classic work (repeated intensities in time at the same frequency band), and spectogram information related to the vocals in the pop work (warbly lines in spectogram)?

## Audio Feature Extraction

In this section you will iterate over your downloaded images and calculate a number of image statistics, saving the results in a pandas dataframe.

Here we have a function `extract_features()` that takes filename as an input and returns a list of audio stats calculated with librosa, including: 
  - MFCCs (5 coefficients)  
  - Chroma (5 features)
  - spectral centroid
  - spectral bandwidth
  - spectral roll-off
  - zero crossing rate
  
Each of these is averaged (using np.mean) across the copmlete audio file, to produce one set of features for the file. That features contains 14 values.

In [None]:
def extract_features(filename):
    y, sr = librosa.load(filename)
    spec_cent = librosa.feature.spectral_centroid(y=y, sr=sr)
    spec_bw = librosa.feature.spectral_bandwidth(y=y, sr=sr)
    rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
    zcr = librosa.feature.zero_crossing_rate(y)

    features = [np.mean(spec_cent), np.mean(spec_bw), \
                np.mean(rolloff), np.mean(zcr)]
    
    # chrome with n notes
    chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr, n_chroma=5)
    # mfcc with n mfccs
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=5)
    
    for c in chroma_stft:
        features.append(np.mean(c))
    for c in mfcc:
        features.append(np.mean(c))

    return features

Test this with single files from the mini-genres dataset:

In [None]:
feat1 = extract_features('../data/mini-genres/classical/classical.00001.au')
feat1

In [None]:
feat2 = extract_features('../data/mini-genres/pop/pop.00004.au')
feat2

### Extract Features from the dataset

For this part, you loop over the 50 files in the `mini-genres` dataset, using the `extract_features()` function from above, and store the results in a pandas dataframe.

In [None]:
AUDIO_DIR = '../data/mini-genres/'
rows_list = []
for directory in os.listdir(AUDIO_DIR):
#     print(directory)
    dirpath = os.path.join(AUDIO_DIR, directory)
    if os.path.isdir(dirpath):
        for filename in os.listdir(dirpath):
#             print(filename)
            if filename.endswith(".au"):
                stats_dict = {}
                feats = extract_features(os.path.join(AUDIO_DIR, directory, filename))
                spec_cent, spec_bw, rolloff, zcr = feats[:4]
                chroma = feats[4:9]
                mfccs = feats[9:]
                
                stats_dict['c0'] = chroma[0]
                stats_dict['c1'] = chroma[1]
                stats_dict['c2'] = chroma[2]
                stats_dict['c3'] = chroma[3]
                stats_dict['c4'] = chroma[4]

                stats_dict['m0'] = mfccs[0]
                stats_dict['m1'] = mfccs[1]
                stats_dict['m2'] = mfccs[2]
                stats_dict['m3'] = mfccs[3]
                stats_dict['m4'] = mfccs[4]

                stats_dict['filename'] = filename
                stats_dict['genre'] = directory
                stats_dict['spec_cent'] = spec_cent
                stats_dict['spec_bw'] = spec_bw
                stats_dict['rolloff'] = rolloff
                stats_dict['zcr'] = zcr
                rows_list.append(stats_dict)

summary_stats = pd.DataFrame(rows_list)

In [None]:
summary_stats.head(15)

what do the summary stats look like for one of our files?

In [None]:
summary_stats[summary_stats['filename'] == 'pop.00006.au']

### Scale Features

We will use the `sklearn.preprocessing` `StandardScaler` to scale all features to mean of `0.0` and std_dev of `1.0` across the dataset.

In [None]:
scaler = StandardScaler()

grab just the stats from our data frame:

In [None]:
just_stats = summary_stats[[ 'c0', 'c1', 'c2', 'c3', 'c4', \
                            'm0', 'm1', 'm2', 'm3', 'm4', \
                            'zcr', 'spec_bw', 'spec_cent', 'rolloff']]

scale the stats using the scaler:

In [None]:
scaled_stats = scaler.fit_transform(just_stats)
scaled_stats[0]

### Plot the Scaled Features

lets plot the scaled features for the first file:

In [None]:
plt.bar(x=range(len(scaled_stats[0])), height=scaled_stats[0])
plt.show()

and see if they differ for the second file:

In [None]:
plt.bar(x=range(len(scaled_stats[11])), height=scaled_stats[11])
plt.show()

Note: features 0-4 are the chroma, features 5-9 are the MFCC coefficients, and 10-13 are the zero crossing through spectral rolloff, 

### Store features with Column names back in Data Frame

Let's add the column names back in and make a dataframe

In [None]:
col_names = [ 'c0', 'c1', 'c2', 'c3', 'c4', \
              'm0', 'm1', 'm2', 'm3', 'm4', \
              'zcr', 'spec_bw', 'spec_cent', 'rolloff']

scaled_stats = pd.DataFrame(scaled_stats, columns = col_names)

In [None]:
scaled_stats['genre'] = summary_stats['genre']

In [None]:
scaled_stats.head()

## Display Features

In this section we will produce bivariate plots, colored by genre.

First will our stats by genre using `df.groupby()`:

In [None]:
groups = scaled_stats.groupby('genre')

Plot spectral bandwidth against spectral centroid, coloring by genre:

In [None]:
# create a plot 
fig, ax = plt.subplots(figsize=(10,10))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling

# loop over groups
for name, group in groups:
    ax.plot(group.spec_cent, group.spec_bw, marker='o', linestyle='', label=name)
ax.legend()
plt.title('Spectral Centroid vs Spectral Bandwidth (by genre)', fontsize=16);
plt.show()

Plot spec_bw against zcr, coloring/labelling with genre:

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.zcr, group.spec_bw, marker='o', linestyle='', label=name)
ax.legend()
plt.title('Zero-Crossing Rate vs Spectral Bandwidth (by genre)', fontsize=16);
plt.show()

Plot zcr against roloff, coloring by genre:

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.zcr, group.rolloff, marker='o', linestyle='', label=name)
ax.legend()
plt.title('Zero-Crossing Rate vs Spectral Rolloff (by genre)', fontsize=16);
plt.show()

Extension: Explore pairs of features across the dataset. Which features seem most useful for distinguishing between clips?

## Dimensional reduction

This section uses the dimensional reduction technique UMAP to map the 14 dimensional feature data to a two dimensional embedding. This produces two coordinates for each audio file, based on the feature data.

In [None]:
# !pip install umap-learn --user

In [None]:
import umap.umap_ as umap
import seaborn as sns

In [None]:
reducer = umap.UMAP()

In [None]:
embedding = reducer.fit_transform(scaled_stats.drop(['genre'], axis=1))
embedding.shape

In [None]:
scaled_stats['umap1'] = embedding[:,0]
scaled_stats['umap2'] = embedding[:,1]

In [None]:
scaled_stats.head()

## Clustering

In [None]:
from sklearn.cluster import KMeans
for_clustering = scaled_stats[['umap1', 'umap2']]
for_clustering.head()

In [None]:
kmeans = KMeans(n_clusters=5, random_state=0).fit(for_clustering)
summary_stats['cluster'] = kmeans.labels_

In [None]:
summary_stats.head()

In [None]:
groups = summary_stats.groupby('cluster')

In [None]:
for name, group in groups:
    print("cluster num {}: {} items".format(name, len(group)))

### Plot by genre label
using umap coordinates for x and y

In [None]:
groups = summary_stats.groupby('genre')

fig, ax = plt.subplots(figsize=(10,10))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.umap1, group.umap2, marker='o', linestyle='', label=name)
plt.title('UMAP projection of the Audio data (by Genre Label)', fontsize=16);
ax.legend()
plt.show()

### Plot by cluster label

using umap coordinates for x and y

In [None]:
groups = summary_stats.groupby('cluster')

fig, ax = plt.subplots(figsize=(10,10))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.umap1, group.umap2, marker='o', linestyle='', label=name)
ax.legend()
plt.title('UMAP projection of the Audio data (by Cluster number)', fontsize=16);
plt.show()

alterntive way to plot (using cluster as color with seaborn)

In [None]:
# plt.figure(figsize=(10,10))
# plt.scatter(embedding[:, 0], embedding[:, 1], c=[sns.color_palette()[x] for x in summary_stats.cluster.to_list()])
# plt.gca().set_aspect('equal', 'datalim')
# plt.title('UMAP projection of the Audio data (by Cluster)', fontsize=16);
# plt.show()

## Extensions
- Evaluating:
  - Plot both the genre label and k means cluster on the same graph, evaluating how well the clustering based on our features detected the distinct genres.
  - Evaluate the accuracy of the k-means clustering, comparing cluster labels to genre labels.
- Features:
   - Try with a different number of MFCCs (20) and Chroma (12), repeating the exercise. How does this change your results?
   - Try with other features
- Clustering:
  - Try with a different clustering method (affinity clustering, HDBSCAN)
- How could we improve the results?

## References
* For a more comprehensive approach to genre recognition, see [Tzanetakis and Cook 'Musical Genre Classification of Audio Signals'](https://pdfs.semanticscholar.org/4ccb/0d37c69200dc63d1f757eafb36ef4853c178.pdf) from IEEE Transactions on Audio and Speech Processing 2002.
* GITZAN dataset (from the above paper): http://opihi.cs.uvic.ca/sound/genres.tar.gz
  * The dataset consists of 1000 audio tracks each 30 seconds long. It contains 10 genres namely, blues, classical, country, disco, hiphop, jazz, reggae, rock, metal and pop. Each genre consists of 100 sound clips.
* `mini-genres` is a smaller subset of the above audio tracks: http://opihi.cs.uvic.ca/sound/mini-genres.tar.bz2