## Import basic libraries

In [1]:
import glob
import numpy as np
import csv
import pickle

import python_speech_features
from python_speech_features import mfcc
import soundfile as sf

## Features extraction
With the help of `file` command in terminal, we are easily to find that the information of flac files: 

FLAC `audio bitstream data, 16 bit, mono, 16 kHz, 33440 samples`

The first step is to extract the MFCC features of all the sound files spoken by every speaker. You can try to set the variable `generate` as `True`, or you can skip this step and load the data by `pickle`. More details are shown below.

In [2]:
generate = False # variable used for generate clean input training data by yourself.

if generate:
    
    base_string = './LibriSpeech/dev-clean'
    speakers = []
    for filename in glob.glob(base_string + '/*'):
        new_string = base_string + '/'
        speakers.append(filename.replace(new_string,''))

    speakers_chapters = {}
    for speaker in speakers:
        speakers_chapters[speaker] = {}
        for filename in glob.glob(base_string + '/' + speaker + '/*'):
            new_string = base_string + '/' + speaker + '/'
            chapter = filename.replace(new_string,'')
            speakers_chapters[speaker][chapter] = []

    for speaker in speakers_chapters:
        for chapter in speakers_chapters[speaker]:
            path = base_string + '/' + speaker + '/' + chapter + '/' + '*.flac'
            file_list = glob.glob(path)
            for file in file_list:
                with open('%s'%file, 'rb') as f:
                    data, samplerate = sf.read(f)
                    mfcc_feat = mfcc(data,samplerate)
                    mfcc_f = avg_mfcc = np.mean(mfcc_feat, axis = 0)
                    speakers_chapters[speaker][chapter].append(mfcc_f)
    '''
    Until now, I load all the .flac sound files and their mfcc features into a dictionary. The shapes of one mfcc 
    feature is 13, which represents 13 bandwidths between low frequency to high frequency of human voice. 
    
    Also, I notice that the lengths of the sound files vary from files to files, which will result the shapes of 
    mfcc features differently. So, I calculate the mean of all mfcc features of one speakers. 
    
    I think this method is resonable and scientific because usually a human speaks in different frequencies toward
    different sentences, and average operation can reduce this variance and the result can represent different  
    people more reliable.
    '''

    raw_label = []
    with open('./LibriSpeech/SPEAKERS.TXT', newline='') as inputfile:
        for row in csv.reader(inputfile):
            if '|' in row[0]:
                raw_label.append(str(row[0]))
    raw_label = raw_label[2:]

    labels = {}
    for i in range(0, len(raw_label)):
        raw_label[i] = raw_label[i][0:raw_label[i].index('|')+4]
        number, label = raw_label[i].replace(" ", "").split('|')
        if label == 'F':
            labels[number] = 0
        else:
            labels[number] = 1
    '''
    I generate the labels of different speakers by the .txt file provided by the dataset.
    '''

    training_data = []
    for speaker in speakers:
        files_of_one_speaker = list(speakers_chapters[speaker].values())
        files_of_one_speaker = sum(files_of_one_speaker, [])
        training_data.append(files_of_one_speaker)

    training_label = []
    for speaker in speakers:
        for chapter in speakers_chapters[speaker]:
            for file in speakers_chapters[speaker][chapter]:
                training_label.append(labels[speaker])


    training_data = sum(training_data,[])
    training_data = np.array(training_data)
    training_label = np.array(training_label)
    '''
    I save the mfcc features and their corresponding labels into two seperate arraies which can be used for 
    training and testing later.
    '''
    pickle.dump(training_data, open("training_data", "wb"))
    pickle.dump(training_label, open("training_label", "wb"))

## Load datasets by pickle

In [3]:
training_data = pickle.load(open("training_data", "rb"))
training_label = pickle.load(open("training_label", "rb"))

## Import specific libraries for naive classifiers

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn import svm

## Start playing with these classifiers!

In [5]:
# Split the training data into training set and testing set
X_train, X_test, y_train, y_test = train_test_split(training_data, training_label, test_size=0.1, random_state=0)

In [6]:
# SVM classifier
clf_svm = svm.SVC(kernel='linear', degree = 1, C=1).fit(X_train, y_train)
scores = cross_val_score(clf_svm, training_data, training_label, cv=5)
print("SVM Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

#logistic regressiong classifier
clf_log = LogisticRegression(C=1, penalty='l1', tol=0.1).fit(X_train, y_train)
scores = cross_val_score(clf_log, training_data, training_label, cv=5)
print("Logisctic Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

#multilayer Perceptron
clf_mlp = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(20, 10), random_state=1).fit(X_train, y_train)
scores = cross_val_score(clf_mlp, training_data, training_label, cv=5)
print("MLP Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

SVM Accuracy: 0.77 (+/- 0.21)
Logisctic Accuracy: 0.77 (+/- 0.21)
MLP Accuracy: 0.82 (+/- 0.08)


## Import specific libraries for CNN 

In [7]:
import tensorflow as tf

## Start playing with CNN!

First of all, I define two functions used as normalization and standardization to datasets. This is important because the input of the CNN should be normalized or standardized for a better result.

In [8]:
def normalize(list_):
    maximum = max(list_)
    minimum = min(list_)
    return (maximum - list_)/(maximum - minimum)

def standardize(list_):
    mean = list_.mean()
    standard = list_.std()
    return (list_ - mean)/standard

norm_training_data = np.zeros((training_data.shape[0],training_data.shape[1]))
for i in range(0, len(training_data)):
    norm_training_data[i] = normalize(training_data[i])
    
stand_training_data = np.zeros((training_data.shape[0],training_data.shape[1]))
for i in range(0, len(training_data)):
    stand_training_data[i] = standardize(training_data[i])

Then, I change the shape of the training data and testing data, because I need to apply tensorflow later,
which requires a shape of 4 dimensional tensors as input, abd shape of (n,2) as output.

In [9]:
# Split the training data into training set and testing set
X_train, X_test, y_train, y_test = train_test_split(norm_training_data, training_label, test_size=0.1, random_state=0)
y_train = np.eye(len(y_train),2)[y_train]
y_test = np.eye(len(y_test),2)[y_test]

X_train_flatten = X_train.flatten()
train_x = np.vstack(X_train_flatten).reshape(X_train.shape[0],1,13,1).astype(np.float32)
train_y = np.vstack(y_train)

X_test_flatten = X_test.flatten()
test_x = np.vstack(X_test_flatten).reshape(X_test.shape[0],1,13,1).astype(np.float32)
test_y = np.vstack(y_test)

Here, I define two functions used as normalization and standardization to datasets. This is important because the input of the CNN should be normalized or standardized for a better result.

Here, I set the basic parameters of CNN and define weight, bias, convolution and pooling operations for later use.

It is easily to see that the height and the width of input are 1 and 13, the ouput should be 0 or 1, represent as female of male, and the number of channel I set to 1, because we don't have any other features except mfcc.

As for the kernel size, which is the size of the convolutional layer, I set the first kernel to size 5 and the second kernel to size 3. Because for human voice, I think the low frequency won't affect the high frequency too much and I set the second kernel to size 3 because I want to concentrate more on signle features.

As for the depth and number of hidden layers, I set them randomly according to others' experience. But I set the learning rate and training epochs slightly bigger because I believe increase these values will ensure the model converge faster. However, there is a potentional problem that the solution would oscilate around the optimal solution because the learning rate is too high. Anyway, in my case, I don't have this problem.

In [10]:
input_height = 1
input_width = 13
num_labels = 2
num_channels = 1

batch_size = 20
kernel_size = 5
depth = 30
num_hidden = 1000

learning_rate = 0.01
training_epochs = 50

total_batchs = X_train.shape[0] // batch_size

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev = 0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.0, shape = shape)
    return tf.Variable(initial)

def apply_depthwise_conv(x,kernel_size,num_channels,depth):
    weights = weight_variable([1, kernel_size, num_channels, depth])
    biases = bias_variable([depth * num_channels])
    return tf.nn.relu(tf.add(tf.nn.depthwise_conv2d(x,weights, [1, 1, 1, 1], padding='VALID'),biases))
    
def apply_max_pool(x,kernel_size,stride_size):
    return tf.nn.max_pool(x, ksize=[1, 1, kernel_size, 1], 
                          strides=[1, 1, stride_size, 1], padding='VALID')

Here, I define the layers of my CNN architecture, and I use softmax as my final layer at the end of CNN, because the ouput is only 0 and 1, so using softmax, representation of logistic regression in this case, is more reasonable. And I use gradient descent for loss optimization because the training set is not too big and gradient descent can help me achieve optimal solution accurately.

In [11]:
X = tf.placeholder(tf.float32, shape=[None,input_height,input_width,num_channels])
Y = tf.placeholder(tf.float32, shape=[None,num_labels])

c = apply_depthwise_conv(X,kernel_size,num_channels,depth)
p = apply_max_pool(c,2,1)
c = apply_depthwise_conv(p,3,depth*num_channels,depth//10)

shape = c.get_shape().as_list()
c_flat = tf.reshape(c, [-1, shape[1] * shape[2] * shape[3]])

f_weights_l1 = weight_variable([shape[1] * shape[2] * depth * num_channels * (depth//10), num_hidden])
f_biases_l1 = bias_variable([num_hidden])
f = tf.nn.tanh(tf.add(tf.matmul(c_flat, f_weights_l1),f_biases_l1))

out_weights = weight_variable([num_hidden, num_labels])
out_biases = bias_variable([num_labels])

y_ = tf.nn.softmax(tf.matmul(f, out_weights) + out_biases)
loss = -tf.reduce_sum(Y * tf.log(y_))
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(loss)

correct_prediction = tf.equal(tf.argmax(y_,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
cost_history = np.empty(shape=[1],dtype=float)

Here, I begin to train my CNN, epoch represents the times I train and each time I train total_batches number of times. And at the end, I test the model by testing set.

In [12]:
with tf.Session() as session:
    tf.initialize_all_variables().run()
    for epoch in range(training_epochs):
        for b in range(total_batchs):    
            offset = (b * batch_size) % (train_y.shape[0] - batch_size)
            batch_x = train_x[offset:(offset + batch_size), :, :, :]
            batch_y = train_y[offset:(offset + batch_size), :]
            _, c = session.run([optimizer, loss],feed_dict={X: batch_x, Y : batch_y})
            cost_history = np.append(cost_history,c)
        if epoch%5 == 0:
            print ("Epoch: ",epoch," Training Loss: ",c," Training Accuracy: ",
              session.run(accuracy, feed_dict={X: train_x, Y: train_y}))
    print ("Testing Accuracy:", session.run(accuracy, feed_dict={X: test_x, Y: test_y}))

Instructions for updating:
Use `tf.global_variables_initializer` instead.
Epoch:  0  Training Loss:  8.24281  Training Accuracy:  0.785773
Epoch:  5  Training Loss:  6.63318  Training Accuracy:  0.842105
Epoch:  10  Training Loss:  6.37844  Training Accuracy:  0.874589
Epoch:  15  Training Loss:  6.64336  Training Accuracy:  0.85773
Epoch:  20  Training Loss:  7.81959  Training Accuracy:  0.901727
Epoch:  25  Training Loss:  6.9914  Training Accuracy:  0.898026
Epoch:  30  Training Loss:  7.30063  Training Accuracy:  0.915707
Epoch:  35  Training Loss:  7.89846  Training Accuracy:  0.889391
Epoch:  40  Training Loss:  5.10765  Training Accuracy:  0.916118
Epoch:  45  Training Loss:  6.46669  Training Accuracy:  0.92023
Testing Accuracy: 0.926199


## Conclusion

Conclusionly, I have tested four methods for audio classification. They are: linear SVM with degree 1, Logistic regression (classification), Multilayer perceptron and Convolutional neural network. Generally, the performances varies from model to model. I have 77% accuracy for SVM, 78% for logistic, 82% for MLP and around 93% for CNN.

It is a bit weird to see that SVM worse than logistic, I think it is because I only use linear method, instead of poly, with only degree 1 for model training. I increase the degree of linear SVM model but the results don't improve too much. I try to use poly SVM but it costs lots of time so that I give up this method. But theoratically, SVM should work better than naive logistics in most cases. 

As for MLP, it performs better than previous two, because of it is kind of neural networks which perform better on classification overall. However, due to the existance of CNN, which can learn features in unstructured data, I don't think I need to consume more time on tuning the MLP parameters.

CNN, as one of the most popular deep learning networks, has a state of art performance not only on image classification and recognition, but also on audio classfication and feature extraction. In my experiment, I try to use a CNN architecture as simple as possible, that I only use two convolutional layers and one pooing layers. I iterate only 50 times and have already received a much better result with 93% accuracy. But the main difficulties of CNN are choosing the number of layers, also known as model architecture, and tuning the parameters reasonably and efficiently. Sometimes, I encounter a common problem that the model converge to a local optimization and I have to retrain or change the model parameters. But anyway, CNN performances best in these four methods and it worth me writing and finding the optimal solution.

In a word, CNN has the best accuracy without doubt, MLP works also fine, SVM and Logistic regression look similar in my case, but I believe SVM can performs better.