$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

# Instrument Classification Using Nerual Networks

*Sean Russell*

## Summary

Online radio services such as Pandora and Spotify try to tailor radio stations to their users. For instance, someone who likes listening to the Beatles will be more likely to listen to other classic rock. However, most of these services do things the old fashioned way: with the blood, sweat and tears of human helpers. Before a song is added to the database of songs to be played, a person must go through the song and add tags to it, such as heavy bass, up-tempo, or electronic. Then the service algorithmically decides what songs to play based upon these tags.

This project is really just a starting point into looking at automating the entire process. Having people tag songs by hand is a time consuming process, and is fairly repetitive. This sort of process that requires of human pattern recognition abilities seems like the perfect application for machine learning. So, this document looks at the most basic of the basics: classifying instruments.

The goal is to classify an audio file as containing one of five instruments: saxaphone, piano, violin, trumpet, or guitar. Using signal processing techniques to preprocess data and machine learning to create a classifier, a model was made that outperforms random guessing, and in some cases did quite well. Also, in the conclusion, I have a whole bunch of ideas for further expansion and refinement in future works.

## Introduction

### The Goal
As stated in the summary, the intention of this project is to be able to classify instruments as belonging to one of five categories: guitar, violin, saxaphone, piano, or trumpet. Since an instrument can only belong to one of those five categories, a random guess has 20% chance of being correct. So, that is the base level benchmark. If the model predicts accurately more than 20% of the time, than it is at least better than nothing. To be able to compare with humans, however, it would have to be quite good. On the data I have selected, I think the average person could easily predict with 90%-95% accuracy (that is just my intuition, no particular data to back that up).

### The Data

My sample audio data was taken off of the site https://www.freesound.org/. It consists of .wav files of instruments from all five categories. When I was working on this, I had 3 gigabytes of audio data for training and testing my models. Since training took a very long time with that much data, I've included a small subset of the data for running with this.

One additional note is that I used a package called pysoundfile to read in the audio data. So in all likelyhood running this notebook will not happen without this installed. If you want to read more here is the link to the documentation: https://pysoundfile.readthedocs.io/en/0.9.0/.

### The Algorithms

Because audio files are stored as a series of samples like this:

![image of Samples](http://sce2.umkc.edu/BIT/burrise/it222/notes/sound/sampling.gif)

there is pretty much no way that a linear classifier would do the trick. Or a quadratic classifier. So I opted for using a neural network. I figured since the relationships were more akin to a wave than any other function, a neural network would do the best job approximating that. So I'm using the neural network moduel from scikit-learn, a python machine learning package.

## Methods

In this section I'll show all of the code I used for my results and explain some of the motivations for some of the things I did. First I'll import all of the libraries that I need.

In [1]:
import os
import random
import soundfile as sf
from sklearn.neural_network import MLPClassifier
from numpy.fft import rfft,rfftfreq
import numpy as np
import matplotlib.pyplot as plt

### Data creation and preprocessing

This is probably where most of my effort went. While only a small amount of preprocessing is strictly necessary, I found that if I put in a bit more work on my side, the accuracy and speed with which the models are genereated went way up.

First, I defined a dictionary to translate class labels into numbers and back.

In [2]:
classes = {'trumpet':0,'violin':1,'guitar':2,'piano':3,'saxaphone':4,
           0:'trumpet',1:'violin',2:'guitar',3:'piano',4:'saxaphone',}

Next I defined some methods that turn an audio file into numpy arrays that can be used by the machine learning algorithms. Since the number of samples in an audio file changes depending on the length of that audio file, I needed a way to make the length of that file something more regular. SplitSamples() slices the samples up into multiple fixed length samples, and discards any excess.

In [3]:
def SplitSamples(samples, widthOfSlice = 2000):
    return np.reshape(samples[:len(samples)-len(samples)%widthOfSlice],(-1,widthOfSlice))

As an example, we can split an example array like this:

In [4]:
a = np.array([1,2,3,4,5,6,7,8,9])
SplitSamples(a,4)

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

So this makes one audio file into a bunch of nice, easy to work with vectors. Next, I defined DeleteSamplesLowAmplitude(). This method deletes every row produced by the previous method that has a low average amplitude. The reason that I'm doing this is that audio files tend to have a lot of silence. Gaps inbetween notes, the start before the instrument plays, and the end after the instrument finishes, there are a lot of times where there is silence. Since classifying silence doesn't work so well, I created this method that discards pieces that are silent. This is probably an area that could be improved upon quite a bit. I also thought of coming up with other ways of dealing with silence, such as having an additional class for the model to fit, but this was the easiest solution that seemed to work the best.

This method of deleting quite samples really worked wonders. After implementing this method, accuracy for my model went up at least 10% across the board. However, some issues remain. Later on, I'll discuss how the accuracy for piano lags behind the accuracy for other instruments. One of my hypotheses is that because piano tends to be a quieter instrument, it is being adversely impacted by this filtering process.

In [5]:
def DeleteSamplesLowAmplitude(samples):
    threshold = np.mean(samples) * np.std(samples)
    return np.delete(samples,np.where(np.mean(samples,axis=1) < threshold),axis=0)

The FourierTransform() method is the final step in the preprocessing of audio files. The fourier transform is one of the most useful techniques for signal processing. For a really excellent guide on the basics of the fourier transform, I reccomend [this article](https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/). The basic idea is this: really complicated waves are really just made up of a bunch of simple waves like this:

![image of Composite Waves](http://dagsaw.sdsu.edu/images/fig3-4.gif)

The fourier transform just decomposes a really complicated wave into a bunch of simple waves.

![image of Fourier Transform](https://i.stack.imgur.com/Y5EAf.png)

Once a wave has been decomposed like this, it is much easier to analyze. So this is the final part of the preprocessing process. Technically speaking, a neural network can also learn a model like this, and this step is not strictly necessary. However, by doing this fouier transform myself I found that I was able to use much smaller networks with similar accuracies, which allowed for much shorter training times.

In [6]:
def FourierTransform(samples):
    return np.abs(rfft(samples))

And finally, FileToData(), which uses the above methods to turn an audio file into something that can be used by the neural network.

In [7]:
def FileToData(filename):
    samples,samplerate = sf.read(filename)
    samples = SplitSamples(samples)
    samples = DeleteSamplesLowAmplitude(samples)
    samples = FourierTransform(samples)
    return samples

So this is how I convert individual files into data. The next step was to convert a whole bunch of files at once, and also to partition it into training and testing data. These next two labels do this by iterating through every file in the samples directory and reading them using the above methods. It then splits this randomly into train and test data and targets.

In [8]:
def GenerateData():
    trumpetTrain,trumpetTest = GenerateDataFromLabel('trumpet')
    violinTrain,violinTest = GenerateDataFromLabel('violin')
    guitarTrain,guitarTest = GenerateDataFromLabel('guitar')
    pianoTrain,pianoTest = GenerateDataFromLabel('piano')
    saxaphoneTrain,saxaphoneTest = GenerateDataFromLabel('saxaphone')
    train = np.vstack((trumpetTrain,violinTrain,guitarTrain,pianoTrain,saxaphoneTrain))
    test = np.vstack((trumpetTest,violinTest,guitarTest,pianoTest,saxaphoneTest))
    traindata = train[:,1:]
    testdata = test[:,1:]
    traintargets = train[:,0].astype(int)
    testtargets = test[:,0].astype(int)
    return traindata,traintargets,testdata,testtargets

def GenerateDataFromLabel(label):
    train = []
    test = []
    for file in os.listdir('samples/' + label):
        if(file.endswith('.wav')):
            if(random.choice([True,True,False])):
               train += FileToData('samples/' + label + '/' + file).tolist()
            else:
               test += FileToData('samples/' + label + '/' + file).tolist()
    train = np.array(train)
    test = np.array(test)
    train = np.insert(train,0,classes[label],axis=1)
    test = np.insert(test,0,classes[label],axis=1)
    return train,test

### Algorithms

As I said above, I'm using a neural network classifier from the scikit-learn package. Since it works very well right out of the box, not a lot of effor had to go into the actual machine learning itself. However, figuring out the correct shape of the network is still an ongoing process.

Most of the work for this section went into producing useful diagnostic information. I wanted to see firstly how accurate the classifier was, but I also wanted to be able to see how it was misclassifying as well. I figured if I could tell how it wasn't working, that would offer me insights into how to make it work.

So, I created one method to do everything. First, it generates the data using the above methods. Then, it fits the model using the training data. Finally, it compares the predicted values of the testing set versus the actual values, and prints out some useful information. It then returns the generated value.

In [9]:
def CreateModel():
    print('Generating data...')
    trainingData,trainingTargets,testingData,testingTargets = GenerateData()
    print('Finished generating data. Training model...')
    classifier = MLPClassifier(solver='lbfgs',hidden_layer_sizes=(200,50))
    classifier.fit(trainingData,trainingTargets)
    print('Finished training model. Gathering statistics...')
    predicted = classifier.predict(testingData)
    totalNumberCorrect = np.sum(predicted == testingTargets)
    print('total % correct on testing data:',totalNumberCorrect / len(testingTargets))
    predictedVsTargets = np.vstack((predicted,testingTargets)).T
    for c in range(5):
        subset = predictedVsTargets[predictedVsTargets[:,0] == c]
        print('\t' + classes[c].upper())
        print(classes[c],'% correct:',
              "%.2f" % (100*len(subset[subset[:,0]==subset[:,1]]) / len(subset)))
        print('% guessed piano:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['piano']])/len(subset)))
        print('% guessed violin:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['violin']])/len(subset)))
        print('% guessed trumpet:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['trumpet']])/len(subset)))
        print('% guessed guitar:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['guitar']])/len(subset)))
        print('% guessed saxaphone:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['saxaphone']])/len(subset)))

    return classifier

And once the classifier has been generated, it can be used to classify audio files. ClassifyAudioFile() does this by splitting up the file into a bunch of parts, the same way I did for preprocessing the data. Then it classifies each of those parts individually, and is then classified based on the class that appear the most frequently. Its sort of like the classifier is voting on what the instrument is based off of the smaller individual parts.

In [10]:
def ClassifyAudioFile(classifier,filename):
    vectors = FileToData(filename)
    if len(vectors) < 1:
        return -1
    predicted = classifier.predict(vectors)
    mostPredicted = np.argmax(np.bincount(predicted))
    return classes[mostPredicted]

## Results

So, to start of the results section, I'll just show an example run of the whole system.

In [None]:
classifier = CreateModel()

Generating data...
Finished generating data. Training model...


This seems to be a pretty typical run. Trumpet, piano, guitar, and saxaphone are all classified fairly accurately. The model seems to have a bit more difficulty however when it comes to piano. It seems to have trouble telling piano apart from guitar and saxaphone. The accuracies of the others all seem to be from around 70 to 90 percent. This has been true across almost all of the trials I have run. Piano seems to range from between 20 to 60 percent accuracy, ocassionally dipping below the 20 percent threshold. 20 percent is the target to beat because there are five classes, so randomly guessing one gives an average accuracy of 20 percent. The overall accuracy seems to hover fairly close to 70 percent, which is in my opinion very good.

So the low scores for piano are what vexes me. I have multiple hyphothesis as to why the accuracy for piano might be lower than the accuracies for the other instruments.

Hypothesis one is that the training data is simply not as good as it could be. Since I downloaded 3 gigabytes of audio data, I didn't have time to properly go through the whole dataset and verify its quality. Some lower quality recordings might have made their way into the mix. However, I believe this to be unlikely, as runs with fewer samples that I have verified myself still seem to show this gap.

Hypothesis two is that the component frequencies of pianos are very similar to those of guitars and saxaphones. The problem with this hypothesis is that if this were the case, and it was difficult to tell the difference between the two, the error should go both ways. However, saxaphone and guitar are both classified fairly accurately, which makes me think that this is also not the case.

Hypothesis three is based off of the notion that piano is a quiet instrument. Since I am discarding certain audio segments that are too quiet because there might not be any instrument playing in that section, I may be disproportianately discarding piano segments, which would lead to skewed results in the model.

Hypothesis four is that because I have less data for the piano, the learning is not happening in a proper manner. If this were the case, once I obtained more piano data, the issue would be resolved.

*NOTE: At the eleventh hour, I do belive I have found hypothesis four to be correct. I don't have time for further testing to make certain I fixed the problem, but my method has helped. Insetead of finding more piano data, I simply reduced the amount of data across all of the other instruments. This seemingly decreased overall accuracy, but the gap between piano and the other instruments looks to have narrowed significantly. These are just preliminary findings, but I do find them compelling.*

Also, here are some cherrypicked examples of the model working correctly to classify audio files:

In [27]:
print('Violin?',ClassifyAudioFile(classifier,'samples/violin/a2.wav'))
print('Guitar?',ClassifyAudioFile(classifier,'samples/guitar/353492__matteshaus__guitchord1.wav'))
print('Trumpet?',ClassifyAudioFile(classifier,'samples/trumpet/357326__mtg__trumpet-b3-bad-richness.wav'))

Violin? violin
Guitar? guitar
Trumpet? trumpet


Of course, it doesn't always work...

In [40]:
print('Piano?',ClassifyAudioFile(classifier,'samples/piano/39206__jobro__piano-ff-058.wav'))

Piano? trumpet


### Lots of Data

I did one more run using all 3 gigs of my data. This did not do anything to fix the issues with the piano gap, however the accuracies for all of the other instruments are now hovering around 90%. Currently, the data is not balanced between all of the instruments. I have 10 times more violin and saxaphone data than piano data. This leads me to believe that if I were to have a more complete set of piano data, the accuracy might improve dramatically.

In [20]:
classifier = CreateModel()

Generating data...
Finished generating data. Training model...
Finished training model. Gathering statistics...
total % correct on testing data: 0.945274576593
	TRUMPET
trumpet % correct: 96.09
% guessed piano: 0.25
% guessed violin: 2.24
% guessed trumpet: 96.09
% guessed guitar: 0.30
% guessed saxaphone: 1.12
	VIOLIN
violin % correct: 97.49
% guessed piano: 0.07
% guessed violin: 97.49
% guessed trumpet: 0.90
% guessed guitar: 0.48
% guessed saxaphone: 1.06
	GUITAR
guitar % correct: 88.99
% guessed piano: 6.19
% guessed violin: 2.20
% guessed trumpet: 1.62
% guessed guitar: 88.99
% guessed saxaphone: 1.00
	PIANO
piano % correct: 19.72
% guessed piano: 19.72
% guessed violin: 0.66
% guessed trumpet: 1.60
% guessed guitar: 66.10
% guessed saxaphone: 11.92
	SAXAPHONE
saxaphone % correct: 91.07
% guessed piano: 1.45
% guessed violin: 1.55
% guessed trumpet: 1.18
% guessed guitar: 4.75
% guessed saxaphone: 91.07


## Conclusion

So, the classifier works! More or less. With around 70% accuracy, the neural network was able to distinguish between violin, piano, trumpet, saxaphone, and guitar. Preprocessing really helped speed things up and improve the accuracy of the model at least for this problem. However, there is still much that can be improved upon. 70% is pretty good, and is much better than random guesses, but I think it is totally feasible that accuracies of 90% can be achieved by people, and so that is what the model should be going for. In addition, the issue where piano is being classified much less accurately than others remains an area for further research. In fact, once this area is cleared up, we might be seeing accuracies pretty close to the 90% range.

### Ideas for Future Work

I really enjoyed working on this project, so I think it is likely that I'll expand on it in the future. These here notes are as much for me as they are for you.

First, I think more work could be done on the collection of useful data. When I ran some trials with much larger but somewhat skewed datasets, I could get accuracies up into the 90% range. Perhaps the issue with the piano would completely go away if only I had more piano data. Going hand in hand with that, some improvements on performance when reading in data would make the development process go much smoother.

Along the same performance lines, I think that experimenting with machine learning packages such as tensorflow that utilize GPU optimizations for their algorithms could also help tremendously in development.

Currently, the process only really works on very specific sorts of audio files. They have to be in the .wav format, they have to posess only a single channel, they have to be at a certain framerate, etc. I thought that perhaps including metadata into the algorithm as several additional datapoints could improve the generality of these methods. Or, perhaps, I could take the cowards way out and just use some sort of utility to convert other audio files into the correct format before applying my functions.

I could probably waste an endless amount of time trying to optimize the shape of the networks and the preprocessing methods to come up with a method that increases accuracy. This is pretty low on the list of things to do in my opinion, but these changes are also quite easy to make.

The diagnostic information that the CreateModel() method provides was instrumental in figuring out everything that I did. I think it would be nice if it provided even more information, perhaps some more things about the metadata of the data sets themselves. I believe that would prove useful in the creation of more advanced methods of classification.

Finally, my next big ambition is to classify an audio file where multiple instruments are playing at once, and identify each one. Once I get the accuracy of this current method of classifying lone instruments up high enought, that is the next big step.