In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import scipy.cluster

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVC


import librosa
%matplotlib inline

import IPython.display as ipd
from scipy import signal, linalg   #use scipy.signal.hilbert to create an envelope of the audio signal
from scipy.signal import hilbert, chirp
from scipy.io import wavfile

import glob
import os

import wave

from keras.layers import Dense
from keras import Input
from keras.engine import Model
from keras.utils import to_categorical
from keras.layers import Dense, TimeDistributed, Dropout, Bidirectional, GRU, BatchNormalization, Activation, LeakyReLU, \
    LSTM, Flatten, RepeatVector, Permute, Multiply, Conv2D, MaxPooling2D, UpSampling2D

# Encoding Audio with DFT - An Application for Speech Classification#
## EECS 16ML , Fall 2021 ##

Written by Tony Shara, Adi Ganapathi, Richard Shuai, Dohyun Cheon, Larry Yan

anthony.shara@berkeley.edu, avganapathi@berkeley.edu, 
richardshuai@berkeley.edu, dohyuncheon@berkeley.edu, yanlarry@berkeley.edu

### Table of contents ###
* [Introduction](#introduction)
    
    Fourier Features
* [Part 0: What are fourier features?](#part0)
    
    Voice Recognition
* [Part 1: Collecting Data and preprocessing](#part1)
* [Part 2: Learning time features](#part2)
    - [PCA In the Time Domain](#part2a)
    - [Visualizing Clusters in the Time Domain](#part2b)
<!--     - [Testing Classifiers](#part2c) -->
* [Part 3: Learning Fourier features](#part3)
    - [Perform PCA on STFT representation of a Signal](#part3a)
    - [Visualizing Clusters in the STFT Domain](#part3b)
* [Part 4: Testing Classifiers](#part4)
    - [Logistic Regression](#part4a)
    - [Classification of Audio Signals with Nets](#part4b)
* [Part 5: Testing on your Own Voice](#part5)
    

<a id="introduction"></a>
# Introduction #
This lab will give you a new perspective on your data through the lens of the frequency domain by means of the Fourier Transform and variants of it. The Fourier Transform is a useful mathematical tool that takes a signal and breaks it down into its constituent frequencies.  This allows us to look at time-varying data (such as an audio signal) or spatial data (such as a 2D image) by looking at the strength of each frequency that is contributed to the data point rather than just its spatial or time dependent magnitudes (i.e., the magnitude of a waveform at some time t or the brightness of a pixel in an image). You will also learn the power of using the frequency domain as an intermediate representation of time-varying and spatial data for downstream applications such as classification of raw audio signals.

In this lab, we are particularly interested in audio signals. We will first cover some of the basics of the Fourier Transform and analyze it through a series of toy examples. We will then take this understanding and apply it to the much more unstructured problem of classifying spoken digits given only a raw audio signal as input. 


### Goals ###
The goals of this lab are to:
* Gain a practical understanding of the frequency domain representation of data through visualizations
* Gain an intuitive understanding of the benefits of using the frequency domain for certain types of data and applications
* Get hands on experience tackling a difficult real world problem using techniques learned in class

<a id="part0"></a>
# Part 0: Fourier Features in an Audio Signal #


A raw audio signal can look like a mess, and indeed it is.  This is because the audio signal is a complex signal composed of a linear combination of cosines, each of which can have a unique amplitude and frequency. 

A plot of raw audio data simply tells us about the amplitude of an audio signal at a single sample in time

### Encoding audio with DFT ###
Like all signals that computers can handle, an audio signal is a series of discrete audio samples, and because every audio signal has finite duration, it is aperiodic.  Because these audio signals are both discrete and aperiodic, we can look at the constituent frequencies of the signal by taking the Discrete Fourier Transform (DFT) of the signal.  At its core, the Discrete Fourier Transform is a linear combination of scaled complex exponentials.  Since the signal is aperidic, we need take into account every point in the signal to determine how it impacts each frequency.  Since the signal is discrete, we sum over each of the scaled complex exponentials.

##### Discrete Fourier Transform (DFT): $$F(\omega) = \sum_{n = -\infty}^{\infty} x(n) e^{-i\omega n}$$

Notice how when taking the DFT of a signal, the ith complex exponential is scaled by the ith component of the signal.  Since our signal can be expressed as a vector, it looks a lot like a matrix multiplication.  In fact it is!  The Vandermonds matrix is another way to compute the DFT of signal in matrix-vector form.

##### Vandermonde matrix: 

$$W = 
\frac{1}{\sqrt{N}}
\begin{bmatrix}
1&1&1&1& ... &1\\
1&\omega&\omega^2&\omega^3& ... &\omega^{N-1} \\
1&\omega^2&\omega^4&\omega^6& ...& \omega^{2(N-1)}\\
1&\omega^3&\omega^6&\omega^9& ... &\omega^{3(N-1)}\\
.&.&.&.&.  & .\\
.&.&.&.&.  & .\\
.&.&.&.&. & .\\
1&\omega^{N-1}&\omega^{2(N-1)}&\omega^{3(N-1)}& ... &\omega^{(N-1)(N-1)}
\end{bmatrix} ,\omega = e^{\frac{i2\pi}{N}}
$$


If $x$ is a discrete length n time varying signal, then $Wx$ is the DFT of $x$.
The Fast Fourier Transform is an algorithm that performs $Wx$ in O(NlogN) rather than the usual O(N^2) by leveraging symmetries in the matrix $W$.

### Working with audio files ###

Q1: 
Play the audio file below.  Does anything stand out about pitch of the most pronounced sound as the file plays?  How might we be able to use the DFT to learn about these features?

# Fourier Transforms of waves #

When looking at waves, what are some important characteristics of these waves that make them stand out? To answer this question, lets break down the different components of a wave.

<br><br>
<img src="visuals/wave_visual.gif" alt="Drawing" style="width: 50em;"/>

### amplitude ###
* What is the amplitude of a wave
* What does the amplitude at a given time tell us

### wavelength / frequency ###
* What is the wavelength of a wave
* What is the frequency of a wave
* how are the wavelength and frequency of a wave important
* What new imformation can we gain by looking at the frequency of the wave
* How can we look at our frequencies? FFT!

## Frequency Spectrum ##

When talking about signals in the frequency domain, we generally refer to the frequency spectrum of the signal. The frequency spectrum is a plot of the magnitude of the Fourier Transform of the signal over all frequencies from -infinity to infinity, but in practice we generally cut off the domain to only the relevant frequencies. Since the fourier transform of a signal is symmetric about the y axis, we only care to look at positive frequencies.

## Simple Signals ##

As we discussed above, it can be useful to talk about a signal in terms of the energy of its constituent frequencies rather than the amplitude over time.  Let's look at the most simple example of a signal: a cosine wave. The code below plots this wave. Fill in the section to compute the FFT of the signal and plot it below to visualize the frequency respons over time.

In [None]:
sig = np.cos(np.arange(0, 20, 0.2))
plt.plot(sig)
plt.show()

##BEGIN CODE##
# Remember, the frequency response is symmetric about the y axis, so we 
# really only care to look at positive frequencies (w > 0)
fft = np.fft.fft(sig)[:50]
fft = np.abs(fft)
plt.plot(fft)
plt.show()
##END CODE##



In [None]:
sig = 2*np.cos(np.arange(0, 20, 0.2)*2)
plt.plot(sig)
plt.show()

##BEGIN CODE##
fft = np.fft.fft(sig)[:50]
fft = np.abs(fft)
plt.plot(fft)
plt.show()
##END CODE##

## Complex Signals ##


In [None]:
cos1 = np.cos(np.arange(0, 20, 0.2))
cos2 = 2*np.cos(np.arange(0, 20, 0.2)*2)
cos3 = 8*np.cos(np.arange(0, 20, 0.2)*4)
sig = cos1 + cos2 + cos3
plt.plot(sig)
plt.show()

##BEGIN CODE##
fft = np.fft.fft(sig)[:50]
fft = np.abs(fft)
plt.plot(fft)
plt.show()
##END CODE##

Notice above that in our signal, we have 3 different distinct cosines contributing to the signal. Is this evident when looking at the fourier transform of the signal?  Explain.

<span style="color:blue">**A**: Yes; we see a spike which corresponds to each cosine.  Cosines with a larger amplitude contribute more to energy to their respective frequencies. </span>

<!-- ## Fourier Series ##

While certain signals can be messy and hard to understand in their raw form, breaking down the signal into a combination of many simpler signals could help us better understand the seemingly complex signal. It turns out we can express any wave as a linear combination of sines and cosines.  This mathematical expression is called the <b>Fourier Series</b>.

$$ X[k] = \sum_{n = 0}^{N-1} x(n) e^{-i\frac{2\pi}{N}nk} $$

As we saw from above, each cosine contributed some energy to a frequency coresponding to the frequency of the wave!  This means that TODO: Tony

                # DO WE WANT THIS ^^^^ # -->

## Sound Waves ##



In [None]:
hz100 = "test-tones/100hz.wav"

fs, data = wavfile.read(hz100)

t = np.arange(0,data.shape[0]/fs, 1/fs)

plt.plot(t, data)
plt.show()

ipd.Audio("test-tones/100hz.wav")


In [None]:
F = np.fft.fft(data)

#magnitude spectrum
plt.plot(np.abs(F[:11025]))
plt.show()

## Complex Sound Wave ##

In [None]:
ipd.Audio("SpiritInTheSky.wav")

Encode SpiritInTheSky.wav with DFT and plot the frequency response of the signal.  Then compare it to the plot of the original signal of SpiritInTheSky.wav without DFT encoding. (Hint: np.fft.fft might be usefull)

In [None]:
# fs is the sampling rate of the wav file in frames per second
# data is the audio sample
fs, data = wavfile.read("SpiritInTheSky.wav")
data = data[:,0]

# TODO: Encode SpiritInTheSky.wav with DFT and plot the frequency response of the signal
F = np.fft.fft(data)
F = F[:int(F.shape[0]/2)]      #only look at first half of data since it is symmetric
F = np.abs(F)
plt.plot(F)
# plt.magnitude_spectrum(data, Fs=fs)
plt.show()

plt.plot(data)
plt.show()


# Speech Classification #

In the previous part we looked at audio files with and without DFT encodings and saw that the DFT encoding allows you to learn about the energy of different frequencies throughout the sound.  Now that we have a way to look at the different frequencies in a controlled setting, let's try to learn about the frequencies in a person's voice! We can then train a machine learning model which uses these frequencies to classify what they are saying!

In this part, we will use the [free spoken digits dataset](https://github.com/Jakobovski/free-spoken-digit-dataset) open source dataset to train a variety of classification algorithms to recognize digits in human speech.  This dataset contains 50 recordings of spoken digits (0 - 9) from 6 different speakers.  Your job is to classify spoken digits in an audio signal.  

<a id="part1"></a>
# Part 1: Collecting Data and Preprocessing #


Collecting and preprocessing data is perhaps the most important and difficult part of being a machine learning or data science practitioner. Before we begin applying algorithms, we need to make sure our data is in a structured and interpretable format so that we can easily repurpose it down the road. Before we get started, let's look at a few of our labels.  Each label in our dataset represents a unique number between 0 and 9.  We will use this data to recognize a unique word, in this case, a number. Each row of our data matrix will consist of the time domain representation of a single spoken digit. Remember, we must make sure that the sampling rate of each data point is the same to ensure the correct dimensionality and consistency. It turns out that the sampling rate required to fully encapsulate a continuous signal is closely tied to the frequencies present in the raw signal. See the supporting note for more details on this. Luckily, the dataset provided has already done this, but when we collect our own test data later, we will have to do this manually.

### Dataset Organization and Train/Test Split ###

In order to eventually classify our data, we need to properly organize our dataset so that we can simply plug and play with different classifiers. This includes creating a train/test split of our data so that we can train on a portion of the data and test on the remaining portion. In the most realistic setting, we won't have training data on the speaker we encounter at test time, so let's limit our training data to only 5 of the speakers and leave out one speaker for the test set. We will also take the time here to compute all the different representations of our data that we will end up using later in the notebook.

In [None]:
#Fill in the code split he 

"""
train_X: training set
test_X: test set

train_y: training labels (1 hot encoding)
train_y_1d: training labels

test_y: test labels (1 hot encoding)
test_y_1d: test labels
"""

DATA_DIR = "recordings/"

#Feel free to change the test speaker for each round of training/testing so that we can perform k-fold validation
test_speaker = 'theo' 
train_X = []
train_spectrograms = []

train_y = []

test_X = []
test_spectrograms = []

test_y = []

pad1d = lambda a, i: a[0: i] if a.shape[0] > i else np.hstack((a, np.zeros(i - a.shape[0])))
pad2d = lambda a, i: a[:, 0: i] if a.shape[1] > i else np.hstack((a, np.zeros((a.shape[0],i - a.shape[1]))))

for fname in os.listdir(DATA_DIR):
    try:
        if '.wav' not in fname or 'dima' in fname:
            continue
        struct = fname.split('_')
        digit = struct[0]
        speaker = struct[1]
        wav, sr = librosa.load(DATA_DIR + fname)
        padded_x = pad1d(wav, 30000)
        spectrogram = np.abs(librosa.stft(wav))
        padded_spectogram = pad2d(spectrogram,40)
        # Seperate your data into a train and test set.  
        # The test set should only include data sample collected from test_speaker.
        # The training set should include the rest of the data.
        ###BEGIN CODE
        if speaker == test_speaker:
            test_X.append(padded_x)
            test_spectrograms.append(padded_spectogram)
            test_y.append(digit)
        else:
            train_X.append(padded_x)
            train_spectrograms.append(padded_spectogram)
            train_y.append(digit)
        ###END CODE###
    except Exception as e:
        print(fname, e)
        raise

train_X = np.vstack(train_X)
#De-mean your train set
###BEGIN CODE###
train_X = (train_X - np.mean(train_X))/np.std(train_X)
###END CODE###
train_spectrograms = np.array(train_spectrograms)

train_y = to_categorical(np.array(train_y))
train_y_1d = np.zeros(train_y.shape[0])
for i in range(train_y.shape[0]):
    val = np.where(train_y[i] == 1)
    train_y_1d[i] = val[0][0]
    
    
test_X = np.vstack(test_X)
#De-mean your test set
###BEGIN CODE###
test_X = (test_X - np.mean(test_X))/np.std(test_X)
###END CODE###
test_spectrograms = np.array(test_spectrograms)

test_y = to_categorical(np.array(test_y))
test_y_1d = np.zeros(test_y.shape[0])
for i in range(test_y.shape[0]):
    val = np.where(test_y[i] == 1)
    test_y_1d[i] = val[0][0]

print('train_X:', train_X.shape)
print('train_spectrograms:', train_spectrograms.shape)
print('train_y:', train_y.shape)
print('train_y_1d: ', train_y_1d.shape)

print('test_X:', test_X.shape)
print('test_spectrograms:', test_spectrograms.shape)
print('test_y:', test_y.shape)
print('test_y_1d: ', test_y_1d.shape)


In [None]:
# Dataset of only 3 distinct digits.  Feel free to play around with these labels (digits)
digits = [2, 7, 9]
train_X_small = []
train_y_small = []
for i in range(train_y_1d.shape[0]):
    if train_y_1d[i] == digits[0] or train_y_1d[i] == digits[1] or train_y_1d[i] == digits[2] :
        train_X_small.append(train_X[i])
        train_y_small.append(train_y_1d[i])
train_X_small = np.vstack(train_X_small)
train_y_small = np.array(train_y_small)

test_X_small = []
test_y_small = []
for i in range(test_y_1d.shape[0]):
    if test_y_1d[i] == digits[0] or test_y_1d[i] == digits[1] or test_y_1d[i] == digits[2] :
        test_X_small.append(test_X[i])
        test_y_small.append(test_y_1d[i])
test_X_small = np.vstack(test_X_small)
test_y_small = np.array(test_y_small)

train_X = train_X_small
train_y_1d = train_y_small

test_X = test_X_small
test_y_1d = test_y_small

plt.plot(train_X.T)
plt.show()

print("train_X: ", train_X.shape)
print("train_y_1d: ", train_y_1d.shape)

# Enveloping data for time domain clustering
Since we want to learn about patterns in the amplitude of the audio signal over time, we can envelope our signal.  By enveloping, we are simply tracing the magnitudes of our wave and therefore focusing our attention to the general shape of the audio signal.

What is the motivation behind enveloping our time domain signal?  Why can't we do PCA on the unprocessed signal?
TODO: ANSWER

In [None]:
# here is an example of enveloping
duration = 1.0
fs = 400.0
samples = int(fs*duration)
t = np.arange(samples) / fs

signal = chirp(t, 20.0, t[-1], 100.0)
signal *= (1.0 + 0.5 * np.sin(2.0*np.pi*3.0*t))

analytic_signal = hilbert(signal)
amplitude_envelope = np.abs(analytic_signal)

plt.plot(t, signal, label='signal')
plt.plot(t, amplitude_envelope, label='envelope')
plt.show()

In [None]:
#TODO: Describe what is going on in this cell (enveloping is done here)

# hilberts = []

# for i in range(X.shape[0]):
#     analytic_signal = hilbert(X[i])
#     amplitude_envelope = np.abs(analytic_signal)
#     hilberts.append(amplitude_envelope)
# hilberts = np.vstack(hilberts)
# processed_X = hilberts

def envelope(X):
    enveloped_X = []
    print(X.shape)
    for i in range(X.shape[0]):
        analytic_signal = hilbert(X[i])
        amplitude_envelope = np.abs(analytic_signal)
        enveloped_X.append(amplitude_envelope)
    enveloped_X = np.vstack(enveloped_X)
    print(enveloped_X.shape)
    return enveloped_X

processed_X = envelope(train_X)

plt.plot(processed_X.T)
plt.show()

print(processed_X.shape)

In [None]:
processed_X = processed_X[:,:23000]

<a id="part2"></a>
# Part 2: Visualizing Our Data in Lower Dimensions  #

<a id="part2a"></a>
### Part 2a: PCA in the Time Domain ###

We will start off by trying to visualize our data in the time domain format that it has been provided. To do this, it would be helpful to first perform PCA on the processed data matrix containing the enveloped signals. This will allow us to extract the most important dimensions of our data. We can then project our data onto these dimensions to visualize their differences.

In [None]:
# Perform PCA on the processed data matrix (you may use the numpy svd library).
# Plot the eigenvalues and principical components of the new basis
# Project the test data onto the basis

###BEGIN CODE###
U_t,S_t,Vt_t = np.linalg.svd(processed_X)

plt.stem(S_t)
plt.show()

#selecting principle components (can chose variable number of principle components, but have chosen 3 for now)
new_basis_t = np.array([Vt_t[0], Vt_t[1], Vt_t[2]]).T        
plt.plot(new_basis_t)

# Project the data onto the new basis
proj_X_t = np.dot(processed_X, new_basis_t)
###END CODE###



In [None]:
proj_X_t = proj_X_t[:,:15000]

<a id="part2b"></a>
### Part 2b: Visualizing Clusters in the Time Domain ###

Here, we will write a function to visualize our test data that has been projected onto the principal component basis. We will make use of this function later when visualizing our data under different representations. Since PCA is inherently linear, it will be difficult to properly cluster 10 distinct spoken digits well given the sever non-linearity of the data. Thus, it will likely be more insightful to only visualize a subset of the data (i.e. only 2 or 3 different digits rather than all 10).

In [None]:
# Here we visualize our clusters by color coding the labels


def get_PCA_clusters(X, y, labels):
    """
    inputs:
    X - a matrix which is the projectin of your dataset onto your new basis
    y - np.array contining your labels
    labels - a list of three desired labels (strings)
    
    outputs:
    label0 - an array containing points in X corresponding to labels[0]
    label1 - an array containing points in X corresponding to labels[1]
    label2 - an array containing points in X corresponding to labels[2]
    """
    label0 = []
    label1 = []
    label2 = []

    ###BEGIN CODE###
    for i in range(y.shape[0]):
        if y[i] == labels[0]:
            label0.append(X[i])
        if y[i] == labels[1]:
            label1.append(X[i])
        if y[i] == labels[2]:
            label2.append(X[i])
    label0 = np.vstack(label0)
    label1 = np.vstack(label1)
    label2 = np.vstack(label2)
    ###END CODE###      
    return label0, label1, label2
    
label0, label1, label2 = get_PCA_clusters(proj_X_t, train_y_1d, digits)    
fig=plt.figure(figsize=(10,7))
plt.scatter(label0[:,0], label0[:,1], c=['blue'], edgecolor='none')
plt.scatter
plt.scatter(label1[:,0], label1[:,1], c=['red'], edgecolor='none')
plt.scatter
plt.scatter(label2[:,0], label2[:,1], c=['green'], edgecolor='none')
plt.scatter

In [None]:
# A visualization of our clusters in 3D!

from mpl_toolkits import mplot3d

fig = plt.figure()
ax = plt.axes(projection='3d')

ax.scatter3D(label0[:,0], label0[:,1], label0[:,2], c=label0[:,2], cmap='Blues');
ax.scatter3D(label1[:,0], label1[:,1], label1[:,2], c=label1[:,2], cmap='Reds');
ax.scatter3D(label2[:,0], label2[:,1], label2[:,2], c=label2[:,2], cmap='Greens');


<a id="part3"></a>
# Part 3: Learning Fourier Features #

Now, we will do exactly what we did before except in the frequency domain! The idea is to see if using the frequency domain representation of our data matrix will change the way we cluster with PCA. 

## Computing the FFT of the Data Matrix ##

Let's start by creating a new data matrix which is the DFT of our original data matrix. Once we can do this, then the process of running PCA should be identical.

In [None]:
# Perform FFT on the data matrix and plot a few of the data points in the frequency domain
# Remember, the DFT is symmetric about the y-axis, so you only want the first half of the result of FFT

###BEGIN CODE 1###
fft_X = np.fft.fft(train_X)                     #take fft of data matrix
fft_X = fft_X[:,:int(fft_X.shape[1]/2)]   # first half of frequeny response since symmetric
fft_X = np.abs(fft_X)
plt.plot(fft_X[0])
plt.show()
###END CODE 1###

# Since certain frequencies about a certain threshold have zero or near-zero presence in the frequency response,
# we can simply cut them off or get rid of them. This will allow us to perform PCA on a smaller matrix which will 
# save us a lot of time and prevent dead kernels. Cut off the frequencies appropriately (where the magnitudes are
# essentially zero and then visualize the signals again.)

###BEGIN CODE 2###
fft_X = fft_X[:,:6000]
plt.plot(fft_X[0])
plt.show()
###END CODE 2###



## Performing PCA in the Frequency Domain ##
You guessed it. Let's now perform PCA on our new data matrix.

In [None]:
# Now perform PCA in the frequency domain. Change the number of basis vectors, plot them, and comment
# on the differences.

###BEGIN CODE SVD###
U_f, S_f, Vt_f = np.linalg.svd(fft_X)
###END CODE SVD###

###BEGIN CODE STEM PLOT PRINCIPAL COMPONENTS###
plt.stem(S_f)
plt.show()
###END CODE STEM PLOT PRINCIPAL COMPONENTS###

###BEGIN CODE CREATE PRINCIPAL COMPONENT BASIS###
new_basis_f = np.array([Vt_f[0], Vt_f[1], Vt_f[2]]).T        # This should be the basis containing your principal components
###END CODE CREATE PRINCIPAL COMPONENT BASIS###

###BEGIN CODE PLOT BASIS VECTORS###
plt.plot(new_basis_f)
plt.show()
###END CODE PLOT BASIS VECTORS###

###BEGIN CODE PROJECT TEST DATA ONTO NEW BASIS###
proj_fft_X = np.dot(fft_X, new_basis_f)
###END CODE PROJECT TEST DATA ONTO NEW BASIS###



### Visualizing Clusters in the Frequency Domain ###

Call the function you created earlier to visualize the new clusters!

In [None]:
#plot the color coded clusters from PCA

###BEGIN CODE###
label0, label1, label2 = get_PCA_clusters(proj_fft_X, train_y_1d, digits)    
        
fig=plt.figure(figsize=(10,7))
plt.scatter(label0[:,0], label0[:,1], c=['blue'], edgecolor='none')
plt.scatter
plt.scatter(label1[:,0], label1[:,1], c=['red'], edgecolor='none')
plt.scatter
plt.scatter(label2[:,0], label2[:,1], c=['green'], edgecolor='none')
plt.scatter
###END CODE###

In [None]:
#run this cell to visualize the clusters in a 3D scatter plot
fig = plt.figure()
ax = plt.axes(projection='3d')

ax.scatter3D(label0[:,0], label0[:,1], label0[:,2], c=label0[:,2], cmap='Blues');
ax.scatter3D(label1[:,0], label1[:,1], label1[:,2], c=label1[:,2], cmap='Reds');
ax.scatter3D(label2[:,0], label2[:,1], label2[:,2], c=label2[:,2], cmap='Greens');


# Using STFT to Encode Signal with DFT

Human speech, unlike the toy audio signals we were dealing with earlier, is not static noise. In fact, individuals words themselves are dynamic and change rapidly over time which motivates some combination of a time and frequency based representation of the signal, rather than just performing a Fourier Transform on the signal and viewing it in the frequency domain. The Short Time Fourier Tranform (STFT) addresses this problem by breaking up a signal into windowed time component and computing the DFT in each of these windows of time. You will now explore the STFT of our signals.

In [None]:
###########################
##         STFT
##########################

#Perform the STFT on our data matrix and plot the spectogram.
# (Hint 1: Consider using the librosa library for both computing stft and plotting spectogram)
# (Hint 2: Make sure to cut off the frequencies where there are close to 0 magnitude as we did earlier)
# (Hint 3: The STFT returns a 2D array because there is a time and frequency component. How can we make this work
#  with our data matrix?)

###BEGIN CODE###
stft = []
for i in range(train_X.shape[0]):
    temp = np.abs(librosa.stft(train_X[i]))
    temp = temp[:400,:]
    temp = temp.flatten()
    stft.append(temp)
    
stft = np.vstack(stft)

import librosa.display
D = librosa.amplitude_to_db(np.abs(librosa.stft(train_X[0])), ref=np.max)
librosa.display.specshow(D, y_axis='linear')
##END CODE###

<a id="part3a"></a>
### Part 3a: Perform PCA on the STFT Representation of the Signal ###

You know what's next! Let's perform PCA in the STFT domain and see if this gives us any better clustering. Since linearly separating 10 distinct spoken digits is quite difficult, try using a data matrix of only 2-3 digits as you did for both time domain and DFT. This will better help identify the differences between all the representations we have considered so far.


In [None]:
############################
##  PCA on STFT Signal
############################

# Perform PCA on the STFT representation of the signal and plot it. Then, as before, compute the new basis 
# (play around with different numbers of basis vectors) and plot the basis vectors. Finally, project the test 
# data onto the new basis.

###BEGIN CODE###
U_s,S_s,Vt_s = np.linalg.svd(stft)
plt.stem(S_s)
plt.show()

new_basis_s = np.array([Vt_s[0], Vt_s[1], Vt_s[2]]).T # basis containing your principal components
plt.plot(new_basis_s)
plt.show()

proj_stft = np.dot(stft, new_basis_s)
###END CODE###

<a id="part3b"></a>
### Part 3b: Visualizing Clusters in the STFT Domain ###

Call the function you created earlier to visualize the clusters!

In [None]:
#plot and visualize 2D clusters from PCA

###BEGIN CODE###
label0, label1, label2 = get_PCA_clusters(proj_stft, train_y_1d, digits)    
        
fig=plt.figure(figsize=(10,7))
plt.scatter(label0[:,0], label0[:,1], c=['blue'], edgecolor='none')
plt.scatter
plt.scatter(label1[:,0], label1[:,1], c=['red'], edgecolor='none')
plt.scatter
plt.scatter(label2[:,0], label2[:,1], c=['green'], edgecolor='none')
plt.scatter
###END CODE###

In [None]:
# run the cell to visualize 3D clusters from PCA

fig = plt.figure()
ax = plt.axes(projection='3d')

ax.scatter3D(label0[:,0], label0[:,1], label0[:,2], c=label0[:,2], cmap='Blues');
ax.scatter3D(label1[:,0], label1[:,1], label1[:,2], c=label1[:,2], cmap='Reds');
ax.scatter3D(label2[:,0], label2[:,1], label2[:,2], c=label2[:,2], cmap='Greens');


<a id="part4"></a>
# Part 4: Testing Classifiers

Now that we have visualized our data in using three different representations, let's see if we can perform some classification of the spoken digits! As mentioned earlier, classifying all 10 spoken digits is a highly non-linear problem and will likely not be possible with the linear classifiers we will use at first, so make sure you are using a subset of the data matrix mentioned above.


## Part 4a: Logistic Regression ##

How do you expect logistic regression to perform given the PCA clusters from our dataset in (1) time domain (2) frequency domain and (3) STFT?

TODO: Answer the question

In [None]:
# print the shape of your processed data and then use 
# sklearn's logistic regression to classify that data

###BEGIN CODE###
print(processed_X.shape)
clf = LogisticRegression(random_state=0).fit(processed_X, train_y_1d)
###END CODE###

### Logistic regression with PCA preprocessing for dimensionality reduction
As you should have seen in the step above, logistic regression does not converge! This is because the dataset was too large initially. This is a great place for dimensionality reduction algorithms such as PCA to shine! Let's perform PCA on our data matrix and then use logistic regression.

In [None]:
# print the shape of your new data after performing PCA and then use 
# sklearn's logistic regression to classify that data

###BEGIN CODE###
print(proj_X_t.shape)
clf = LogisticRegression(random_state=0).fit(proj_X_t, train_y_1d)
###END CODE###

Now, let's see how our classifier performs on the test set! Here, you must project the test set onto our new basis and then perform predictions. How does it do?

In [None]:
# First project the test set onto the new basis
# Then use your classifier to get a prediction vector on your projected test set
# Finally, compute the accuracy of your classifier on the test set

###BEGIN CODE###
processed_X_test = envelope(test_X)
processed_X_test = processed_X_test[:,:23000]
proj_test = np.dot(processed_X_test, new_basis_t)

predictions = clf.predict(proj_test)
error = test_y_1d - predictions
error[np.abs(error) > 0] = 1
print(1 - sum(error)/error.shape[0])
###END CODE###

### Logistic regression on the DFT of the data matrix

Now let's perform logistic regression on the frequency representation of our data matrix. As before, to ensure convergence, let's perform PCA on the DFT of the data matrix before applying logistic regression.

In [None]:
# use sklearn's logistic regression to classify your data after DFT encoding

###BEGIN CODE###
clf = LogisticRegression(random_state=0).fit(proj_fft_X, train_y_1d)
###END CODE###

Now we will learn a classifier on the frequency representation of our data matrix. Perform the same steps as you did for the time domain representation with the only difference being the representation of the data matrix. How does the accuracy of this classifier on the test set compare to the time domain classifier? What does this tell you about the frequency domain representation of raw audio signals?

In [None]:
# First project the test set onto the new basis
# Then use your classifier to get a prediction vector on your projected test set
# Finally, compute the accuracy of your classifier on the test set

###BEGIN CODE###
test_fft = np.fft.fft(test_X)
test_fft = test_fft[:,:int(test_fft.shape[1]/2)]   # first half of frequeny response since symmetric
test_fft = np.abs(test_fft)
test_fft = test_fft[:,:6000]
proj_test = np.dot(test_fft, new_basis_f)

predictions = clf.predict(proj_test)
error = test_y_1d - predictions
error[np.abs(error) > 0] = 1
print(1 - sum(error)/error.shape[0])
###END CODE###

### Logistic regression on the STFT of the data matrix

Finally, let's perform logistic regression on the STFT representation of the data matrix!

In [None]:
# use sklearn's logistic regression to classify your data after DFT encoding

###BEGIN CODE###
clf = LogisticRegression(random_state=0).fit(proj_stft, train_y_1d)
###END CODE###

Let's now learn a classifier on the STFT represenation using similar steps. How does the accuracy of this classifier compare to both the time domain and frequency domain classifiers? What does this tell you about the STFT representation for raw human speech?

In [None]:
# First project the test set onto the new basis
# Then use your classifier to get a prediction vector on your projected test set
# Finally, compute the accuracy of your classifier on the test set

###BEGIN CODE###
test_stft = []
for i in range(test_X.shape[0]):
    temp = np.abs(librosa.stft(test_X[i]))
    temp = temp[:400,:]
    temp = temp.flatten()
    test_stft.append(temp)
    
test_stft = np.vstack(test_stft)

proj_test = np.dot(test_stft, new_basis_s)
predictions = clf.predict(proj_test)
error = test_y_1d - predictions
error[np.abs(error) > 0] = 1
print(1 - sum(error)/error.shape[0])
###END CODE###

<a id="part4b"></a>
## Part 4b: Classification of Audio Signals with Neural Nets

Now that we have attempted to classify audio signals using traditional clustering and linear classification methods, let's try using a neural network approach!  For this, we will use [Keras](https://keras.io/) to classify the time domain and STFT representations of the spoken digits dataset and compare performance to our other methods. The goal of this section is to not only show you that neural networks are powerful nonlinear classifiers for highly unstructured data, but also that careful attention to representations used when training a neural network can help improve performance! This motivates the better understanding of time and frequency domain relationships even when using deep learning methods. 

### Neural Network Classification in the Time Domain

First we will try to classify all 10 digits directly on the time domain representation of the signal in an end to end manner. Here you will gain some experience using deep learning libraries such as Keras to build and train neural networks.

In [None]:
# First, we will construct the neural network. Go online to find the documentation for how to create a simple 1 layer
# MLP network in Keras. The hardest part here is making sure that your dimensions match the shape of your data.
# For this part, use Dense layers for your hidden and output layers.

###BEGIN CODE###
ip = Input(shape=(train_X[0].shape))
hidden = Dense(128, activation='relu')(ip)
op = Dense(10, activation='softmax')(hidden)
#model = Model(input=ip, output=op)
model = Model(ip, op)
###END CODE###
model.summary() # This will help you visualize your model


In [None]:
# Compile your model here
###BEGIN CODE###
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
###END CODE###


# Start training your model here on the training split of the data

###BEGIN CODE###
history = model.fit(train_X,
          train_y,
          epochs=10,
          batch_size=32,
          validation_data=(test_X, test_y))
###END CODE###

In [None]:
# Plot your training and validation accuracy here

###BEGIN CODE###
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
###END CODE###

### Neural Network Classification in the STFT Domain

Now we will use the STFT representation to perform classification with a neural network. Note that, unlike the time domain representation, the STFT representation is a 2D representation which means that we can use a Convolutional Neural Network (CNN) to perform the classification. We have performed some basic preprocessing below to use our 2D data as an input to the CNN, but you will have to build and train the network.

In [None]:
train_X_ex = np.expand_dims(train_spectrograms, -1)
test_X_ex = np.expand_dims(test_spectrograms, -1)
print('train X shape:', train_X_ex.shape)
print('test X shape:', test_X_ex.shape)

In [None]:
# Build the CNN architecture using Keras
# You will have to play around with parameters to optimize performance!

###BEGIN CODE###
ip = Input(shape=train_X_ex[0].shape)
m = Conv2D(32, kernel_size=(4, 4), activation='relu', padding='same')(ip)
m = MaxPooling2D(pool_size=(4, 4))(m)
m = Dropout(0.2)(m)
m = Conv2D(64, kernel_size=(4, 4), activation='relu')(ip)
m = MaxPooling2D(pool_size=(4, 4))(m)
m = Dropout(0.2)(m)
m = Flatten()(m)
m = Dense(32, activation='relu')(m)
op = Dense(10, activation='softmax')(m)

model = Model(ip, op)
###END CODE###
model.summary()


In [None]:
#Compile and train the model here

###BEGIN CODE###
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(train_X_ex,
          train_y,
          epochs=10,
          batch_size=32,
          verbose=1,
          validation_data=(test_X_ex, test_y))
###END CODE###

In [None]:
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

<a id="conclusion"></a>
# Part 5: Testing on Your Own Voice #

Now that we have compared different methods of classifying audio signal, let's try to classify numbers using our own voice!  

The first step in classifying your own voice is collecting the data.  In order to limit the variability, we can classify speech using samples taken at the same sampling rate as the data that was collected in the dataset.  In the note, you learned that by sampling at or above the nyquist rate you can perfectly reconstruct a signal.  In the code below, we start with a sampling rate of 8000 Hz.  This is sufficient to reconstruct a sample of your voice.  Try taking a sample of yourself saying one of the digits that you selected earlier in the code at a sampling rate (fs) of 8000 Hz. Feel free to play around with this rate and listen to the resulting audio signal.  Comment on what you notice.

TODO: Change the sampling rate in the code below and comment on what you notice.

In [None]:
##################################
## Code to collect Audio sample
## from jupyter notebook
#################################

import pyaudio


chunk = 1024                     # Record in chunks of 1024 samples
sample_format = pyaudio.paInt16  # 16 bits per sample
channels = 1
fs = 8000                        # Record at 8000 HZ to record at same sampling rate as dataset
seconds = 2                     # Record 2 seconds of audio
filename = "test"

p = pyaudio.PyAudio()  # Create an interface to PortAudio

print('Recording')

stream = p.open(format=sample_format,
                channels=channels,
                rate=fs,
                frames_per_buffer=chunk,
                input=True)

frames = []  # Initialize array to store frames

for i in range(0, int(fs / chunk * seconds)):
    data = stream.read(chunk)
    frames.append(data)
    
stream.stop_stream()
stream.close()
p.terminate()

print('Finished recording')


wf = wave.open(filename +".wav", 'wb')
wf.setnchannels(channels)
wf.setsampwidth(p.get_sample_size(sample_format))
wf.setframerate(fs)
wf.writeframes(b''.join(frames))
wf.close()

In [None]:
ipd.Audio("test.wav")


# Predict on Your Own Voice

Before you run the following cell, how do you think the model will classify your speech?  What characteristics about your voice do you think could influence the classfication of the speech?  Now run the cell.  Was it able to classify your speech correctly?  If it did not, what are some reasons why this model might have had a hard time?

In [None]:
wav, sr = librosa.load("test.wav")

pad2d = lambda a, i: a[:, 0: i] if a.shape[1] > i else np.hstack((a, np.zeros((a.shape[0],i - a.shape[1]))))
spectogram = np.abs(librosa.stft(wav))
padded_spectogram = pad2d(spectogram, 40)
# print(padded_spectogram.shape)
temp = test_X_ex[0]

test_spectograms = np.array(padded_spectogram)
test_spectograms = np.expand_dims(test_spectograms, -1)
test_spectograms = test_spectograms[np.newaxis, :, :]
# print(test_X_ex.shape)
prediction = model.predict(test_spectograms)
print(prediction)

# Congratulations!

Great job! You have completed this lab.  Hopefully you now appreciate the power of the frequency domain and Fourier Transform and their many applications.

# References

Allen V. Oppenheim, Signals and Systems, Second Edition, 1997

Berkeley Microscopy, Capturing images
http://microscopy.berkeley.edu/courses/dib/sections/02Images/sampling.html

Jarno Seppänen, Audio Signal Processing basics, 1999
https://www.cs.tut.fi/sgn/arg/intro/basics.html

Dima Shulga, Speech Classification Using Neural Networks: https://towardsdatascience.com/speech-classification-using-neural-networks-the-basics-e5b08d6928b7