In [1]:
import numpy as np
from matplotlib import pyplot as plt
from scipy.io import wavfile

%matplotlib inline

In [2]:
import math
from scipy.fftpack import dct

### Exercise 0

1. Write a function that will compute a STFT of a single channel of audio.
2. Read in any audio one of the audio files provided, plot your results using the provided `plotSTFT()` function. (If you did this part correctly, you should be getting a plot very similar to what you obtain from `specgram()` of last week's lab

In [None]:
def stft(x, N = 256, M = 128):
    ## this function should take in a signal x, 
    ## and compute a N-point STFT with step size M
    ## with a Hamming window applied to it

    n = len(x)
    L = N-M
    K = (n-N)/M + 1
    N2 = N/2 
    S = np.zeros([N2+1,K+1])
    
    ## Your Code here
    start = 0
    while start + M < n:
        signal = x[start:start + N] * np.hamming(start +N - start)

        start += M
    
    return S

In [None]:
def plotSTFT(S):
    N = S.shape[0]
    tt = np.arange(S.shape[1])
    freq = np.hstack([np.arange(0., N)/N, 1])
    plt.pcolormesh(tt, freq, 10 * np.log10(S[:,:-1]))  # plot your spectrogram
    plt.axis('tight')
    plt.ylabel('frequency (normalized)')
    plt.xlabel('time (in samples)')
    

In [6]:
Fs, s = wavfile.read('data/chopin.wav')
S = stft(s[:,0])

plotSTFT(S) 
plt.title('My Spectrogram of Audio Signal')

[-424 -460 -501 ...,  154  174  222]


NameError: name 'stft' is not defined

## Mel Frequency Cepstrum

The Mel Frequency Ceptrum is a human auditory response insipred audio feature that is being used succesfully in many audio processing applications. We will need some math tools from scipy to implement this audio feature extractor.

The Mel Frequency Cepstral Coefficients (MFCC) are computed as follows:

1. Compute the Fourier Transform of the signal
2. Map power spectrum to the mel scale, using overlapping triangular windows (filter banks)
3. Take log of the power spectrum in the mel 'frequency'
4. Compute the Discrete Cosine Transform of the spectrum computed above, as though it were a signal

The amplitudes obtained are the MFCC

### Converting from frequency to mel qerfuency

First, two important formulas:

Formula to convert frequency $f$ Hertz to $m$ Mel:

$$m = 2595 * log_{10}(1+\frac{f}{700})$$

### Exercise 1

1. Derive the formula to convert from Mel back to Hertz
2. Implement these 2 conversions as functions

In [None]:
def hz2mel(h):
    ## This function should output the Mel quefrency given some input frequency
    ## h is a 1-d array of input frequencies in hertz, 
    ## Your code here
    return 0

def mel2hz(m):
    ## This function should output the frequency in Hertz given some input quefrency
    ## m is a 1-d array of input quefencies in Mel, 
    ##Your code here
    return 0

## Mel-spaced filter banks

The Mel-spaced filterbank is the set of triangular window filter which corresponds to human ear hearing perception. The following code is to use to generate **nfilt** filters using the end points supplied by binPoints vector. 

In [None]:
#filter bank generation
def filterBank(nfilt, NFFT, binPoints):
    fbank = np.zeros([nfilt,NFFT/2+1])
    for j in xrange(0,nfilt):
        for i in xrange(int(binPoints[j]),int(binPoints[j+1])):
            fbank[j,i] = (i - binPoints[j])/(binPoints[j+1]-binPoints[j]) #rising edge of the triangle
        for i in xrange(int(binPoints[j+1]),int(binPoints[j+2])):
            fbank[j,i] = (binPoints[j+2]-i)/(binPoints[j+2]-binPoints[j+1]) #falling edge of the triangle
    return fbank

Let's now look at how the Mel-spaced filter banks are applied to the power spectrum of an audio signal. The following code generates the filter banks given the sampling frequency, number of filters, and length of the FFT.

(You will need to finish the code for previous section for this part to run)

In [None]:
Fs, x = wavfile.read('data/WakeMeUp.wav')     # x stores the input audio signal, Fs is the sampling frequency
signal = x[:,0]

highfreq = Fs/2       # Nyquist rate
lowfreq = 0           # lower end of audio spectrum

lowmel = hz2mel(lowfreq)     # convert to Mel scales
highmel = hz2mel(highfreq)

NFFT = 512            # number of FFT points
nfilt = 20            # number of Mel spaced filter banks

# convert the frequency into Mel points
melPoints = np.linspace(lowmel, highmel, nfilt + 2)
binPoints = np.floor(mel2hz(melPoints)/Fs*(NFFT+1))

fbank = filterBank(nfilt, NFFT, binPoints)

Here's a visualization of the 20 filter banks in the frequency domain

In [None]:
freq = np.arange(257)*(Fs/2.)/257    # generate frequencies for plot
plt.figure()
for i in np.arange(nfilt):
    plt.plot(freq, fbank[i,:])
plt.title('Filter Bank Responses')
ax = plt.axes()
ax.set_xlabel('frequency')
ax.set_ylabel('amplitude')
ax.set_xlim([0, 22050]);

Here we plot the power spectrum of a window of the audio signal. The plot is of what would be one column of the spectrogram as created in the previous lab.

In [None]:
T = 10000
N=300
w = np.hamming(N)      # generate a Hamming window of length 300
s = signal[T:T+NFFT]
f =  np.square((1.0/NFFT) *np.absolute(np.fft.rfft(s,)))

plt.figure()
plt.plot(freq, f)
plt.title('Power Spectrum (PS) of Signal')
ax = plt.axes()
ax.set_xlabel('frequency')
ax.set_ylabel('magnitude')
ax.set_xlim([0, 22050])

Finally, we see the result of applying the 12th filter bank to the window of the audio signal. To apply the filter, the filter and the PS of the signal are simply multiplied together. The plot below is the product of the filter and the PS of the window.

In [None]:
filtered_f = fbank*np.tile(f, (nfilt, 1))

plt.figure()
plt.plot(freq, filtered_f[11,:])
plt.title('PS of Signal Filtered by 12th Filter Bank')
ax = plt.axes()
ax.set_xlabel('frequency')
ax.set_ylabel('magnitude')
ax.set_xlim([0, 22050])

However, what we will need for features are actually are the inner products of the PS with the filters, and not the multiplication, thus creating *coefficients*. 

In [None]:
fbank_coefs = np.dot(fbank,f)
print fbank_coefs

### Generating the MFCC's

The MFCC's are the DCT of the log of the filter bank coefficients. For the window of the input audio signal, the coefficients are computed below.

In [None]:
melMapped = np.log(fbank_coefs) # take the log of the mel-spaced mapping signal
mfcc = dct(melMapped.T, type = 2, axis = 0, norm = 'ortho')# take the Discrete Cosine Transform
print mfcc

plt.plot(mfcc)
plt.title('MFCC of window of "WakeMeUp.wav"')
ax = plt.axes()
ax.set_xlabel('frequency bin')
ax.set_ylabel('value')

Compare the coefficients of a window of 'further.wav' to a window of a different file, 'chopin.wav,' of a different genre of music with different frequency content.

In [None]:
Fs2, x2 = wavfile.read('data/chopin.wav')     # x stores the input audio signal, Fs is the sampling frequency
signal2 = x2[:,0]

T = 10000
w2 = np.hamming(N)      # generate a Hamming window of length 300
s2 = signal2[T:T+NFFT]
f2 =  np.square((1.0/NFFT) *np.absolute(np.fft.rfft(s2,)))

fbank_coefs2 = np.dot(fbank,f2)

melMapped2 = np.log(fbank_coefs2) # take the log of the mel-spaced mapping signal
mfcc2 = dct(melMapped2.T, type = 2, axis = 0, norm = 'ortho')# take the Discrete Cosine Transform

plt.plot(mfcc2)
plt.title('MFCC of window of "chopin.wav"')
ax = plt.axes()
ax.set_xlabel('frequency bin')
ax.set_ylabel('value')

## Exercise 2

With the information and code examples provided above, you should now be able to implement a function to compute the MFCC of a given signal. (Hint, you will need the functions that you have defined so far, and other crucial code components are already provided above)

In [None]:
def calcMFCC(Fs, data):
    #Your code here
    
    pass

# Speaker Identification

In this exercise, we will utilize the MFCC feature to recognize the identity of a speaker. You should put the two folders that were provided, train and test, into the same folder of your ipython notebook. There are 8 training data files and 8 testing data files, s1.wav to s8.wav. The training and testing data are already correctly labeled, i.e, s1.wav in the training folder matches with s1.wav in the testing folder (same person speaking). Nearest neighbor will be used to label the testing data.

Each audio file in this dataset is mono-channel, thus you do not need to split the files into two channels as in the previous part. However the files are of different length, which requires special attention to allocate the training dataset. For each audio file in the train fodler, we create its MFCC matrix, where each row corresponds to 12 MFCC's of one frame. If your signal has 125 frames, for instance, the size of MFCC matrix will be 125 by 12. The MFCC matrix of all audio files are stacked into one matrix called the **codebook**. At the same time, we also store the labels of the training data. 

## Training

We first calculate the total number of frames of our whole dataset. This is used to initialize the codebook matrix and the label vector. Let's say you have only 2 audio files. The first file has 10 frames, and the second has 20 frames. Then the size of your **codebook** is 30 by 12. 

The length of your **labels** vector will be 30, in which the first 10 elements are *1* and the last 20 elements are *2*. For this exercise, we will choose the length of each frame to be 512 and the step size is one third of the frame length.

In [None]:
import mfcclab
# find the number of frames in advance
# this is to reduce dynamic array allocation 
# which slows down our code
N = 512
M = 512/3
sumNumFrames = mfcclab.get_num_frames("train", N, M)

We define the range of frequencies for our filter banks in Hertz, *lowfreq* and *highfreq* (from 0 Hz to the Nyquist rate). To create the endpoints for our filter banks, we transform this range into mel-scale range. The *for* loop loads all the training data in the **train** folder, computes the MFCC's, and appends them to the **codebook**. Since we know the labels of our data, this is considered an example of *supervised learning*, as opposed to an *unsupervised learning* paradigm, like K-means clustering, where the labels are unknown and must be learned.

In [None]:
highfreq = Fs/2 # Nyquist rate
lowfreq = 0 # lower end of audio spectrum

lowmel = hz2mel(lowfreq) # convert to Mel scales
highmel = hz2mel(highfreq)

NFFT = 512 # number of FFT points

nfilt = 20 # number of Mel spaced filter banks
# convert the frequency into Mel points
#melPoints = np.linspace(lowmel, highmel, nfilt + 2)
#binPoints = np.floor(mel2hz(melPoints)/Fs*(NFFT+1))

numKeptCoeff = 12 # we only keep the lower 12 coefficients since the higher frequencies do not help speaker identification
#fbank = filterBank(nfilt, NFFT, binPoints) # create the Mel-spaced filter bank
codeBook = np.empty((sumNumFrames, numKeptCoeff-1)) #preallocate the training data
labels = np.zeros((sumNumFrames,)) # and the labels
N = 512 # lenghth of a frame
M = 512/3 # step size
currentIdx = 0
for i in range(1,9): # we have 8 signals
    filename = "train/s%d.wav" % i
    fs, data = wavfile.read(filename)
    mfcc = calcMFCC(fs, data, 20)[:,1:numKeptCoeff]
    codeBook[currentIdx:currentIdx + mfcc.shape[0],:] = mfcc# assign this training data to codebook
    labels[currentIdx:currentIdx + mfcc.shape[0]] = i # and label them
    currentIdx = currentIdx + mfcc.shape[0] # move to next block in codeBook and labels

Since the each codeword in the codebook is of dimension 11, it is not possible to visualize the distribution of the codewords and their proximity to each other. However, tools exist that attempt to reduce the dimensionality of data while preserving the relative distances between data points. The following code uses a tool called Isomap (a non-linear dimension-reduction technique, unlike PCA) to plot the data in 2D to give some intuition of the distribution of the codewords. How well grouped are the codewords for each speaker?

In [None]:
mfcclab.plot_codebook_2D(codeBook, labels, ["s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8"])

In [None]:
from sklearn import neighbors
n_neighbors = 5 # we use nearest neighbour
knn = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knn.fit(codeBook, labels) 

In [None]:
filename = "test/s2.wav" #try load one testing wave file
fs, data = wavfile.read(filename)

mfcc = calcMFCC(fs, data, 20)

mfcc.shape

We will now predict the label of each frame you just computed. There will be some mislabeled data. 

In [None]:
l = np.round(knn.predict(mfcc[:,1:numKeptCoeff]))
print l

In [None]:
from collections import defaultdict # this is used to find the most occured label
d = defaultdict(int)
for i in l:
    d[i] += 1
result = max(d.iteritems(), key=lambda x: x[1])
print "Matching speaker: %d" % result[0] #the first result is the label, the second is the number of occurences

### Exercise 3 - Extra Credit

The current algorithm will predict a speaker regardless if it is in the database. Propose and implement a modification that will be able to do so. Record a short clip of your own voice and briefly describe and comment on your results and observations.