DSC160 Data Science and the Arts - Twomey - Spring 2020 - [dsc160.roberttwomey.com](http://dsc160.roberttwomey.com)

# Exercise 2: Audio Classification with MFCCs

This exercise walks you through feature extraction and genre classification based on MFCCs, using audio frames extracted from two instances of distinct audio styles (classical and rap music).

It has two parts:
- [Part 1](#Part-1:-Genre-Recognition). In this part you will load two audio files as genre examples, extract MFCC features from each, and implement a simple SVM classifier.
  - [Step 1 - Load Files and Display Audio](#Step-1:-Load-Files-and-Display-Audio)
  - [Step 2 - Extract Features](#Step-2:-Extract-Features)
  - [Step 3 - Train a Classifier](#Step-3:-Train-a-Classifier)
  - [Step 4 - Run the Classifier](#Step-4:-Run-the-Classifier)
- [Part 2](#Part-2:-Extension). In this part you will extend the work from Part 1, either creating a new classifier using new genre examples and training examples from your own music collection, implementing a new classifier method, or experimenting with different features for the SVM and existing examples.
  - [Part 2A - Code for your Extension](#2A.-Code-for-Extension)
  - [Part 2B - Discussion of Results](#2B.-Discussion-of-Results)
  
Once you have completed both parts, you will submit your completed notebook as a pdf to gradescope for grading.

Note: this is a simplified genre classification example. For a more comprehensive approach combining timbral, beat, and pitch features, see Tzanetakis and Cook ['Musical Genre Classification of Audio Signal'](https://pdfs.semanticscholar.org/4ccb/0d37c69200dc63d1f757eafb36ef4853c178.pdf) from IEEE Transactions on Audio and Speech Processing 2002. Many of the techniques described in that paper can be implemented using librosa and our numpy/scipy toolkits for your own Project 1.

## Setup

Import necessary modules:

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn

import librosa
import librosa.display

from IPython.display import Audio

import requests
import os

import sklearn
import numpy as np

import pandas as pd

## Part 1: Genre Recognition
(50 points total)

This section walks you through three steps: 
1. Load and Display Audio Files
2. Extracting features from an audio signal.
3. Training a genre classifier.
4. Using the classifier to classify the genre in a song.

### Step 1: Load Files and Display Audio
(10 points)

We will use two audio pieces as exemplars of distinct audio genres:

- Johannes Brahms' ['Hungarian Dance #5 in G Minor'](https://www.youtube.com/watch?v=3X9LvC9WkkQ) (1885) 
- Busta Rhmyes' ['Hits for Days feat. J Holiday'](https://www.youtube.com/watch?v=B6bt3gWLV5g) (2016) 
  
These have both been added to the course repository in an `audio` subdirectory under the current path:
- `audio/brahms_hungarian_dance_5.mp3`
- `audio/busta_rhymes_hits_for_days.mp3`

Using [`librosa.load`](https://librosa.github.io/librosa/generated/librosa.core.load.html), load 120 seconds of Brahms :

In [None]:
# your code here

Using `librosa.display.waveplot`, plot the time-domain waveform of the audio signal for Brahms:

In [None]:
# your code here

Using the IPython.display Audio class, play the audio file:

In [None]:
# your code here

Using [`librosa.feature.melspectogram`](https://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html), [`librosa.power_to_db`](https://librosa.github.io/librosa/generated/librosa.core.power_to_db.html), and [`librosa.display.specshow`](https://librosa.github.io/librosa/generated/librosa.display.specshow.html), calculate and display the mel spectogram with a logarithmic magnitude scale:

In [None]:
# calculate mel spectogram
# convert spectogram to log spectogram with power_to_db

In [None]:
# show spectogram

Repeat the above steps for the Busta Rhymes song. 

Load the file:

In [None]:
# your code here

Display the waveform:

In [None]:
# your code here

Play the audio file:

In [None]:
# your code here

Calculate and display the mel spectogram with a logarithmic magnitude scale:

In [None]:
# calculate

# display

Do you notice any difference between the wave form or spectogram for the classical and rap song? Is so, what? 

```YOUR RESPONSE HERE```

### Step 2: Extract Features

(10 points)

We are going to work with MFCCs. For each of your audio files (starting with Brahms), use [`librosa.feature.mfcc`](https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html) to calculate the MFCCs.

(Note: you can experiment with `n_mfcc` to select a different number of coefficients, e.g. 12)

Start with Brahms, using 12 coefficients, inputting the Brahms time series and Brahms sample rate as the arguments to the mfcc function.

In [None]:
# your code here

Note: We transpose the result to accommodate scikit-learn which assumes that each row is one observation, and each column is one feature dimension:

In [None]:
mfcc_brahms = mfcc_brahms.T
mfcc_brahms.shape

Using [`sklearn.preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), scale the features to have zero mean and unit variance:

In [None]:
# your code here

Verify that the scaling worked (e.g. do we have a mean close to zero and std deviation close to 1 for each feature?):

In [None]:
mfcc_brahms_scaled.mean(axis=0)

In [None]:
mfcc_brahms_scaled.std(axis=0)

Repeat these calculations for Busta Rhymes. Use [`librosa.feature.mfcc`](https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html) to calculate the MFCCs, inputting the Busta time series and Busta sample rate as the arguments to the mfcc function.

In [None]:
# your code here

Note: Transpose the result to accommodate scikit-learn which assumes that each row is one observation, and each column is one feature dimension:

In [None]:
mfcc_busta = mfcc_busta.T
mfcc_busta.shape

Scale the resulting MFCC features to have approximately zero mean and unit variance. Re-use the scaler from above.

In [None]:
# your code here

Verify that the mean of the MFCCs for the second audio file is approximately equal to zero and the variance is approximately equal to one.

In [None]:
mfcc_busta_scaled.mean(axis=0)

In [None]:
mfcc_busta_scaled.std(axis=0)

### Step 3: Train a Classifier

(15 points)

Concatenate all of the scaled feature vectors into one feature table using [`np.vstack`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html).

In [None]:
# your code here

In [None]:
features.shape

Construct a vector of ground-truth labels using [`np.concatenate`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html), where 0 refers to the first audio file, and 1 refers to the second audio file. (use `np.zeros` and `np.ones` for brahms and busta rhymes)

In [None]:
labels = np.concatenate((np.zeros(len(mfcc_brahms_scaled)), np.ones(len(mfcc_busta_scaled))))

Create a classifer model object using sklearn's Support Vector Machine [`sklearn.svm.SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html):

In [None]:
# Support Vector Machine
model = sklearn.svm.SVC()

Train the classifier with your test data and labels using [`SVC.fit`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit):

In [None]:
# your code here

### Step 4: Run the Classifier

(15 points)

To test the classifier, we will extract an unused 10-second segment from the earlier audio fields as test excerpts:

In [None]:
x_brahms_test, fs_brahms = librosa.load(filename_brahms, duration=10, offset=120)

In [None]:
x_busta_test, fs_busta = librosa.load(filename_busta, duration=10, offset=120)

Display the waveform and listen to the audio for both of the test excerpts using `librosa.display.waveplot` and 
`Audio`. 

Start with Brahms test (wave plot and audio player):

In [None]:
# wave plot

# audio playback

Next Busta test (waveplot and audio player):

In [None]:
# wave plot

# audio playback

Compute MFCCs from both of the test audio excerpts folowing above using `librosa.feature.mfcc`:

In [None]:
# calculate brahms test mfccs

In [None]:
# calculate busta rhymes test mfccs

Scale the test sample MFCCs using the previous scaler:

In [None]:
# your code here
mfcc_busta_test_scaled.shape

In [None]:
# your code here
mfcc_brahms_test_scaled.shape

Concatenate all test features together using `np.vstack`:

In [None]:
test_features = np.vstack((mfcc_brahms_test_scaled, mfcc_busta_test_scaled))

Concatenate all test labels together (using `np.concatenate`, with `np.zeros` for brahms and `np.ones` for busta rhymes):

In [None]:
test_labels = np.concatenate((np.zeros(len(mfcc_brahms_test_scaled)), np.ones(len(mfcc_busta_test_scaled))))

Compute the predicted labels using `model.predict`:

In [None]:
model.predict(test_features)

Finally, compute the accuracy score of the classifier on the test data using `model.score`, based on predicted labels and test labels:

In [None]:
# your code here
score

Do you believe this classifier is performing well? If so, why? If not, why not?

```WRITE YOUR ANSWER HERE```

## Part 2: Extension
(50 points)

Extend this exercise in some aspect. Possible extension include:
- Find a confounding example (hip hop song that samples classical music), calculate and classify MFCCs. What do you find are the classification results? Plot the label over time.
- Create a new genre classifier by repeating the steps above, but this time use training data and test data from your own audio collection representing two or more different genres. For what genres and audio data styles does the classifier work well, and for which (pairs of) genres does the classifier fail?
- Create a new genre classifier by repeating the steps above, but this time use a different machine learning classifier, e.g. random forest, Gaussian mixture model, Naive Bayes, k-nearest neighbor, etc. Adjust the parameters. How well do they perform?
- Create a new genre classifier by repeating the steps above, but this time use different features. Consult the [librosa documentation on feature extraction](http://librosa.github.io/librosa/feature.html) for different choices of features. Which features work well? not well?

### 2A. Code for Extension

Write your code below with comments (25 points): 

### 2B. Discussion of Results

(25 points total)

Describe your goals for the extension (1 paragraph, 10 points):

```REPLACE THIS WITH YOUR DESCRIPTION OF YOUR EXTENSIONS GOALS``` 

Describe your results for the extension (1 paragraph, 10 points):

```REPLACE THIS WITH YOUR DESCRIPTION OF RESULTS```

Describe future directions and interesting research questions for this line of inquiry (1 paragraph, 5 points):

```REPLACE THIS WITH YOUR DESCRIPTION OF FUTURE DIRECTIONS```

## References
- Tzanetakis and Cook ['Musical Genre Classification of Audio Signals'](https://pdfs.semanticscholar.org/4ccb/0d37c69200dc63d1f757eafb36ef4853c178.pdf) from IEEE Transactions on Audio and Speech Processing 2002.- International Society for Music Information Retrieval (ISMIR) [https://ismir.net/](https://ismir.net/)
- LibROSA [https://librosa.github.io/librosa/](https://librosa.github.io/librosa/)
- SciPy 2015 Talk on Audio / MIR: https://www.youtube.com/watch?v=MhOdbtPhbLU
  - [website](https://bmcfee.github.io/) [paper](https://bmcfee.github.io/papers/scipy2015_librosa.pdf)
- Music Representation: https://musicinformationretrieval.com/audio_representation.html
