$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

# Instrument Classification Using Nerual Networks

*Sean Russell*

## Abstract

## Introduction

## Methods

In [12]:
import os
import random
import soundfile as sf
from sklearn.neural_network import MLPClassifier
from numpy.fft import rfft,rfftfreq
import numpy as np
import matplotlib.pyplot as plt

### Data creation and preprocessing

In [15]:
classes = {'trumpet':0,'violin':1,'guitar':2,'piano':3,'saxaphone':4,
           0:'trumpet',1:'violin',2:'guitar',3:'piano',4:'saxaphone',}

In [2]:
def GenerateData():
    trumpetTrain,trumpetTest = GenerateDataWithLabels('trumpet')
    violinTrain,violinTest = GenerateDataWithLabels('violin')
    guitarTrain,guitarTest = GenerateDataWithLabels('guitar')
    pianoTrain,pianoTest = GenerateDataWithLabels('piano')
    saxaphoneTrain,saxaphoneTest = GenerateDataWithLabels('saxaphone')
    train = np.vstack((trumpetTrain,violinTrain,guitarTrain,pianoTrain,saxaphoneTrain))
    test = np.vstack((trumpetTest,violinTest,guitarTest,pianoTest,saxaphoneTest))
    traindata = train[:,1:]
    testdata = test[:,1:]
    traintargets = train[:,0].astype(int)
    testtargets = test[:,0].astype(int)
    return traindata,traintargets,testdata,testtargets

def GenerateDataWithLabels(label):
    train,test = GenerateDataFromDirectory('samples-train/'+label)
    train = np.insert(train,0,classes[label],axis=1)
    test = np.insert(test,0,classes[label],axis=1)
    return train,test

def GenerateDataFromDirectory(directory):
    train = []
    test = []
    for file in os.listdir(directory):
        if(file.endswith('.wav')):
            if(random.choice([True,False])):
               train += FileToVectors(directory + '/' + file).tolist()
            else:
               test += FileToVectors(directory + '/' + file).tolist()
    return np.array(train),np.array(test)

def FileToVectors(filename):
    samples,samplerate = sf.read(filename)
    split = ProcessSamples(samples)
    return split

def ProcessSamples(samples):
    samples = SplitSamples(samples)
    samples = DeleteSamplesLowAmplitude(samples)
    samples = FourierTransform(samples)
    return samples

def DeleteSamplesLowAmplitude(samples):
    mean = 0.00001
    return np.delete(samples,np.where(np.mean(samples,axis=1) < mean),axis=0)

def FourierTransform(samples):
    return np.abs(rfft(samples))

def SplitSamples(samples):
    widthOfVector = 2000
    return np.reshape(samples[:len(samples)-len(samples)%widthOfVector],(-1,widthOfVector))


### Algorithms

In [13]:
def Train(data, targets):
    classifier = MLPClassifier(solver='lbfgs',hidden_layer_sizes=(100,25))
    classifier.fit(data,targets)
    return classifier

def Use(classifier,filename):
    vectors = FileToVectors(filename)
    if len(vectors) < 1:
        return -1
    predicted = classifier.predict(vectors)
    mostPredicted = np.argmax(np.bincount(predicted))
    return mostPredicted

def CreateModel():
    print('Generating data...')
    trainingData,trainingTargets,testingData,testingTargets = GenerateData()
    print('Finished generating data. Training model...')
    classifier = Train(trainingData,trainingTargets)
    print('Finished training model. Gathering statistics...')
    predicted = classifier.predict(testingData)
    totalNumberCorrect = np.sum(predicted == testingTargets)
    print('total % correct on testing data:',totalNumberCorrect / len(testingTargets))
    predictedVsTargets = np.vstack((predicted,testingTargets)).T
    for c in range(5):
        subset = predictedVsTargets[predictedVsTargets[:,0] == c]
        print('\t' + classes[c].upper())
        print(classes[c],'% correct:',
              "%.2f" % (100*len(subset[subset[:,0]==subset[:,1]]) / len(subset)))
        print('% guessed piano:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['piano']])/len(subset)))
        print('% guessed violin:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['violin']])/len(subset)))
        print('% guessed trumpet:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['trumpet']])/len(subset)))
        print('% guessed guitar:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['guitar']])/len(subset)))
        print('% guessed saxaphone:',
              "%.2f" % (100*len(subset[subset[:,1]==classes['saxaphone']])/len(subset)))

    return classifier


## Results

## Conclusion

In [16]:
c = CreateModel()

Generating data...
Finished generating data. Training model...
Finished training model. Gathering statistics...
total % correct on testing data: 0.940694662337
	TRUMPET
trumpet % correct: 94.90
% guessed piano: 0.22
% guessed violin: 2.37
% guessed trumpet: 94.90
% guessed guitar: 0.56
% guessed saxaphone: 1.96
% unknown: 0.00
	VIOLIN
violin % correct: 97.29
% guessed piano: 0.03
% guessed violin: 97.29
% guessed trumpet: 1.02
% guessed guitar: 0.43
% guessed saxaphone: 1.24
% unknown: 0.00
	GUITAR
guitar % correct: 91.02
% guessed piano: 6.08
% guessed violin: 0.85
% guessed trumpet: 1.30
% guessed guitar: 91.02
% guessed saxaphone: 0.75
% unknown: 0.00
	PIANO
piano % correct: 26.09
% guessed piano: 26.09
% guessed violin: 1.48
% guessed trumpet: 11.56
% guessed guitar: 47.93
% guessed saxaphone: 12.94
% unknown: 0.00
	SAXAPHONE
saxaphone % correct: 88.87
% guessed piano: 1.53
% guessed violin: 1.69
% guessed trumpet: 2.50
% guessed guitar: 5.41
% guessed saxaphone: 88.87
% unknown: 0

## Introduction

Internet radio stations such as Spotify and Pandora are really popular right now. The big selling point of these services is that they can personalize stations, so that each user listens to a custom channel. In order to do this, the services have experts pouring over music looking for particular elements, such as an upbeat tempo or heavy bass, that can be used to connect songs. The thought process goes that if you like one song with heavy bass, you will likely enjoy another one.

However, this process is extremely time intensive. Analyzing a 4 minute song can take up to half an hour, and at the rate that content is being generated it is difficult to keep up. In addition, this process is extremely incomplete. Pandora, for instance, looks at 400 different elements to catagorize music. While this is extremely impressive, there are innumerable different ways to classify songs, and only having 400 can lead to oversights.

So, the motivation for applying machine learning to music is to reduce the human workload for analyzing music and to increase the accuracy of the catagorization. My goal for this project is to use signal processing and machine learning techniques to being this process, by determining the instrument present in a sample of sound.

## Methods

I figure there will be three steps necessary. First, reading in sound from a file. Second, converting that sound into something usable by classification algorithms. And third, classifying the sound using the algorithms.

Reading in sound files requires only a bit of knowledge about the structure of the file. Audio is stored as a series of amplitudes, as in this image:

![image of Samples](http://sce2.umkc.edu/BIT/burrise/it222/notes/sound/sampling.gif)

with a little bit of additional header information. So to read in the sound file, just read in the samples. Unfortunately, this data is not particularly easy to work with. Even a second of audio can contain up to 44,000 datapoints, which would make machine learning processes fairly slow. In addition, this data is inflexible, as you would end up with different input sizes depending on how long each sample was. Another problem is that each individual datapoint is essentially useless on its own. The way something sounds is determined by the relative positioning of all of the datapoints. In fact, the way something sounds is determined by the sound waves that compose it.

### About the Fourier Transform

The Fourier Transform is a way to turn complicated waves into a bunch of simple waves. As in the picture below:

![image of Composite Waves](http://dagsaw.sdsu.edu/images/fig3-4.gif)

the top three waves are all very simple. However, when they are all added together a much more complicated and interesting wave emerges (the fourth one). This is what fourier analysis is all about. Break a complicated wave up into its simplest components, so that they can then be analyzed.

![image of Fourier Transform](https://i.stack.imgur.com/Y5EAf.png)

So, my idea is to convert complicated sound samples into their simplest component parts, the building block waves, and use that as input to the classification algorithms. That way, input sizes can be dramatically reduced and are independent of the length of the sound sample, and also accuratly carry most of the information encoded within a sound. Since there will be a lot of work having to do with turning raw data into usable inputs, I will be relying on algorithms that others have created, both from class and from the internet. I will assemble the data myself, which will be somewhat of an involved process. Fortunately there are a lot of sound samples available for free online, it will just be a matter of collecting the samples and tagging them.

## Possible Results

Hopefully classification of sounds will work based off of only their component waves. If that should work, I would like to try more complicated examples where more than one instrument is playing at a time. I expect that having multiple instruments playing at once will be much more difficult for the algorithm than having to deal with just one, and that classification will not really work without more adjustments to the process.

## Timeline

April 14- Implement methods for reading in sound samples and doing fourier analysis

April 21- Assemble dataset of sound samples on which to do learning and classification

April 28- Use algorithms to classify sound samples, run experiments

May 5- Finish writing report