# 1 Author

**Student Name**:  Yashika

**Student ID**:  220802299



# 2 Problem formulation

Here, Using the MLEnd London Sounds dataset, we want to train a machine learning model that takes as an input an audio segment and predicts whether the audio segment has been recorded indoors or outdoors. We will then be evaluating the accuracy of the model on the training and test data.

# 3 Machine Learning pipeline

Following steps will be carried out:

1. First of all, we will extract a sample(of length 1000) of audio files into a location that can be accessed by the program.
2. Next, we will load the csv file containing the information of the Sounds dataset(attributes and features) into a pandas dataframe.
3. Now we need to identify what features are relevant to classify an audio file as indoor or outdoor. Feature selection is a very important step as the prediction depends on this. If the features are not selected carefully, there is a chance for the model to underfit/overfit. The features selected for this problem statement are:
  1. Flatness
  2. Signal to noise ratio
  3. Energy
  4. Chroma
4. Next, we will define a function to calculate the values of these features.
5. Once the functions are defined, we will iterate over the set of audio files to execute the function to get the values of these features for each audio.
After this step, we will have 2 arrays X and Y where X will contain the features of the audio files in the form of some values and Y will contain the actual label of the audio file (0 or 1) i.e. whether the audio has been recorded indoor or outdoor. Here 1 is for indoor and 0 is for outdoor.
6. Once we have the datasets, we need to select the model that we need to train. Different types of models can be accurate for different types of datasets. we will analyse the accuracy of the training and test data using different models and will choose the final model according to that.
7. Now, we will split our dataset into 2 parts i.e. train and test. We will be using 70% of the datset for training and the remaining 30% for testing.
8. Now when we have our datasets ready, we will train the selected model by passing into it the training datset.
9. Once this has been done, we will predict the training and validation labels using the trained model and calculate the accuracy for each.
10. We can check the change in accuracy by running a number of iterations for steps 7-9.
11. Next, we can normalise our dataset and try steps 7-10 again to see if the accuracy improves.
12. Finally, we will analyse the accuracy of different models and select the final model based on the results and some facts.

# 4 Transformation stage


Since our data was in the form of audio files, we could not use it directly to make any predictions. Hence we have to perform transformation on our datset to get relevant features out of it. The input to this step is the set of audio files and the output is a 2D array that contains 4 features for each audio. Each row in X will correspond to an audio file and each column will be a feature as described above. The resulting dataset after transformation will be of the form:

X(features): (1000,4)

Y(labels): (1000,)

The reason for selecting each of the features is:

1. Flatness - It provides a way to identify that a sound is a tone or more like noise. This can be helpful in identifying the label as there might be more noise in an audio recorded outdoors as compared to the one recorded indoors.

2. Signal to noise ratio - It can be a useful measure to classify an audio as indoor or outdoor as it tells about the performance of an audio signal in terms of noise and signal quality. It compares the strength of the audio signal to the level of noise in the audio. Again, the outdoor recorded audios are more prone to noise as compared to the indoors.

3. Energy - The energy of a signal tells about the total magntiude of the signal. For audio signals, it basically tells how loud the signal is. As compared to indoors, the outdoor recorded signals should have energy component on the higher side. Thus, it can be useful in classifying the mentioned label.

4. Chroma - This feature relates to the pitch of the sound. It can be useful in identifying the label as the pitch of an outdoor recorded sound might be greater in most cases when compared to indoors recorded sounds.

# 5 Modelling

Here, we will train multiple models i.e. SVM, KNN, Random Forest Classifier and Logistic Regression model. We will analyse the accuracy of our training and test data with each of these models and will choose the appropriate model at the end based on certain results and facts.

# 6 Methodology


We will divide the X(features) and Y(actual labels) array into 2 parts i.e. train and test. 70% of the data will be used to train the model and the rest 30% will be used to test the model. We will then measure the training and validation accuracy by running a number of iterations. We need to be careful that model should not overfit/underfit in any scenario. We will assess the accuracy of the model mainly by 2 factors:
  1. Accuracy on train and test data(both before and after normalisation)
  2. Whether the model is underfitting/overfitting



# 7 Dataset

We have the following data to train and test our model as per the problem statement:
  1. Sample of 1000 audio files
  2. A CSV file that contains the following information about each audio:
    * **file_id** - name of the audio file, which will be same as the audio name that we have already extracted.
    * **area** - The area where the audio file has been recorded. It can contain the following 6 values:
      * british
      * kensington
      * campus
      * westend
      * Euston
      * southbank
    * **spot** - The particular spot in that area where the audio has been recorded. Each of the 6 areas can have 6 different values for the spot. These are described as below:

          british - street, forecourt, greatcourt, room12, square, room13

          campus - canal, curve, ground, library, reception, square

          Euston - forecourt, gardens, library, piazza, ritblat, upper

          kensington -  albert, cromwell, dinosaur, hintze, marine, pond

          southbank - book, bridge, food, royal, skate, waterloo

          westend - charing, leicester, market, national, piazza, trafalgar
    * **in_out** - This column can contain 2 different values:
      * indoor - If the audio file has been recorded indoors
      * outdoor - If the audio file has been recorded outdoors
    * **Participant** - The task of recording audios was carried out by various participants. This column contains the participant id of the person who has recorded the audio.


We cannot use the audio files directly to train our model. Thus we need to extract some features out of it that will be further passed as input to the model. Functions have been defined to extract the features and the code for feature extraction is shown below. 

Also, we need to load our csv file in a pandas dataframe, so that we can perform operations on it.

The output of the Transformation step will be 2 2-Dimensional arrays:

X(Array of attributes) - Shape of this will be (1000,4) where each row will represent an audio file and each column will correspond to a feature.

Y(array of labels) - Shape of this will be (1000,) and this array will contain the actual labels of the corresponding entries in X.





In [6]:
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os, sys, re, pickle, glob
import urllib.request
import zipfile

import IPython.display as ipd
from tqdm import tqdm
import librosa
import scipy.io.wavfile as wavfile
import os.path
import scipy.stats as stats
import scipy

drive.mount('/content/drive')

Mounted at /content/drive


In [19]:
sample_path = '/content/drive/MyDrive/Data/MLEndLS/sample/*.wav'
files = glob.glob(sample_path) # Storing all the files matching the pattern oin a variable
print("Total number of audio files is: ", len(files))

Total number of audio files is:  1000


In [20]:
MLENDLS_df = pd.read_csv('./MLEndLS.csv').set_index('file_id') # Loading csv to dataframe
MLENDLS_df

Unnamed: 0_level_0,area,spot,in_out,Participant
file_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0001.wav,british,street,outdoor,S151
0002.wav,kensington,dinosaur,indoor,S127
0003.wav,campus,square,outdoor,S18
0004.wav,kensington,hintze,indoor,S179
0005.wav,campus,square,outdoor,S176
...,...,...,...,...
2496.wav,westend,trafalgar,outdoor,S151
2497.wav,campus,square,outdoor,S6
2498.wav,westend,national,indoor,S96
2499.wav,british,room12,indoor,S73


In [21]:
def signaltonoise(a, axis=0, ddof=0):
    a = np.asanyarray(a)
    m = a.mean(axis)
    sd = a.std(axis=axis, ddof=ddof)
    return np.where(sd == 0, 0, m/sd)
def snr(file):
  data = wavfile.read(file)[1]
  singleChannel = data
  try:
    singleChannel = np.sum(data, axis=1)
  except:
    pass
    
  norm = singleChannel / (max(np.amax(singleChannel), -1 * np.amin(singleChannel)))
  return signaltonoise(norm)

In [22]:
def getEnergy(x,fs,winLen=0.02):
  p = winLen*fs
  frame_length = int(2**int(p-1).bit_length())
  hop_length = frame_length//2
  rmse = librosa.feature.rms(x, frame_length=frame_length, hop_length=hop_length, center=True)
  return rmse

In [23]:
''' This function will be used to extract the features from the audio files'''

def get_features_labels(files,labels_file, scale_audio=False, onlySingleDigit=False):
  X,y =[],[]
  for file in tqdm(files):
    try:
      fileID = file.split('/')[-1]
      file_name = file.split('/')[-1]
      yi = labels_file.loc[fileID]['in_out']=='indoor'

      fs = None # if None, fs = 22050
      x, fs = librosa.load(file,sr=fs)
      energy = np.mean(getEnergy(x,fs,winLen=0.02))
      tempo, beat_frames = librosa.beat.beat_track(y=x, sr=fs)
      flatness = np.mean(librosa.feature.spectral_flatness(y=x))
      chroma = np.mean(librosa.feature.chroma_stft(x))
      snratio = snr(file)

      xi = [flatness, snratio, energy, chroma]
      X.append(xi)
      y.append(yi)
    except Exception as e:
      print("Broken file: ", file)

  return np.array(X),np.array(y)

In [24]:
X,y = get_features_labels(files, labels_file=MLENDLS_df, scale_audio=True, onlySingleDigit=True)

100%|██████████| 1000/1000 [07:27<00:00,  2.24it/s]


In [25]:
print('The shape of X is', X.shape) 
print('The shape of y is', y.shape)
print('The labels vector is', y)

The shape of X is (1000, 4)
The shape of y is (1000,)
The labels vector is [False False False  True False False False  True False  True False False
  True False False False  True False  True  True False False  True False
  True False  True False  True False  True False  True False  True False
 False False False False  True  True False False False False  True False
 False False  True False  True False  True False  True  True  True False
 False False  True  True False  True False False False  True  True False
 False False False False  True False  True  True  True False False False
  True False  True False  True False  True  True False False  True False
 False False  True  True  True False False False False False  True False
 False False  True False False  True  True  True False  True  True  True
  True  True  True  True  True  True False False False  True  True False
  True  True  True  True False False False False False  True False False
 False  True  True  True False  True  True False 

In [26]:
print(' The number of indoor recordings is ', np.count_nonzero(y))
print(' The number of outdoor recordings is ', y.size - np.count_nonzero(y))

 The number of indoor recordings is  457
 The number of outdoor recordings is  543


# 8 Results

Now, when we have our arrays of attributes and labels ready, we need to carry out the main task of training the mdoel and analysing the training and validation accuracy to select the right model. We will be doing it here.

We will train 4 different models of supervised learning:

  **Support Vector Machine** - This model finds a hyperplane in an N-dimensional space(where N is the number of features, 4 in our case) to classify the data points. The objective of this model is to find the best hyperplane where best means the one having maximum margin i.e. the maimum dstance between data points of classes. 

  **k nearest neighbour** - This is based on the notion that the features of a datapoint can be predicted based on the features of its neighbours.

  **Logistic Regression Model** - The binary logistic regression model is used in the cases where there are 2 classes. It is mostly used with categorical classes.

  **Random Forest Classifier** - This model is efficient in classifying large datsets. the accuracy of this model is better than a decision tree because it uses a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement.

In [92]:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
print("The shape of training set(X) is : {}\nThe shape of validation set(X) is : {}\nThe shape of training set(Y) is : {}\nThe shape of validation set(Y) is : {}".format(X_train.shape, X_val.shape, y_train.shape, y_val.shape))

The shape of training set(X) is : (700, 4)
The shape of validation set(X) is : (300, 4)
The shape of training set(Y) is : (700,)
The shape of validation set(Y) is : (300,)


In [87]:
print(' The number of indoor recordings in training dataset is ', np.count_nonzero(y_train))
print(' The number of outdoor recordings in training dataset is ', y_train.size - np.count_nonzero(y_train))
print(' The number of indoor recordings in validation dataset is ', np.count_nonzero(y_val))
print(' The number of outdoor recordings in validation dataset is ', y_val.size - np.count_nonzero(y_val))

 The number of indoor recordings in training dataset is  319
 The number of outdoor recordings in training dataset is  381
 The number of indoor recordings in validation dataset is  138
 The number of outdoor recordings in validation dataset is  162


In [84]:
model  = svm.SVC(C=1)
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

train_labels = np.unique(y_train)
val_labels = np.unique(y_val)

print("--------FOR SVM MODEL-------")
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
conf_matrix_train = metrics.confusion_matrix(y_train, yt_p, labels=train_labels)
conf_matrix_val = metrics.confusion_matrix(y_val, yv_p, labels=val_labels)

print('Training confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_train, index=train_labels, columns=train_labels)))
print('Validation confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_val, index=val_labels, columns=val_labels)))

model  = KNeighborsClassifier()
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print("\n\n--------FOR KNN MODEL-------")
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
conf_matrix_train = metrics.confusion_matrix(y_train, yt_p, labels=train_labels)
conf_matrix_val = metrics.confusion_matrix(y_val, yv_p, labels=val_labels)

print('Training confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_train, index=train_labels, columns=train_labels)))
print('Validation confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_val, index=val_labels, columns=val_labels)))

model = LogisticRegression()
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print("\n\n--------FOR LOGISTIC REGRESSION MODEL-------")
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
conf_matrix_train = metrics.confusion_matrix(y_train, yt_p, labels=train_labels)
conf_matrix_val = metrics.confusion_matrix(y_val, yv_p, labels=val_labels)

print('Training confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_train, index=train_labels, columns=train_labels)))
print('Validation confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_val, index=val_labels, columns=val_labels)))

model = RandomForestClassifier()
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print("\n\n--------FOR RANDOM FOREST CLASSIFIER MODEL-------")
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
conf_matrix_train = metrics.confusion_matrix(y_train, yt_p, labels=train_labels)
conf_matrix_val = metrics.confusion_matrix(y_val, yv_p, labels=val_labels)

print('Training confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_train, index=train_labels, columns=train_labels)))
print('Validation confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_val, index=val_labels, columns=val_labels)))

--------FOR SVM MODEL-------
Training Accuracy 0.7042857142857143
Validation  Accuracy 0.6833333333333333
Training confusion matrix:
        False  True
False    214   158
True      49   279

Validation confusion matrix:
        False  True
False     98    73
True      22   107



--------FOR KNN MODEL-------
Training Accuracy 0.7742857142857142
Validation  Accuracy 0.6433333333333333
Training confusion matrix:
        False  True
False    286    86
True      72   256

Validation confusion matrix:
        False  True
False    112    59
True      48    81



--------FOR LOGISTIC REGRESSION MODEL-------
Training Accuracy 0.6085714285714285
Validation  Accuracy 0.6366666666666667
Training confusion matrix:
        False  True
False    277    95
True     179   149

Validation confusion matrix:
        False  True
False    131    40
True      69    60



--------FOR RANDOM FOREST CLASSIFIER MODEL-------
Training Accuracy 1.0
Validation  Accuracy 0.67
Training confusion matrix:
        False

The confusion matrix here represents the number of times that an actual value was predicted correctly/incorrectly. The diagonal of the confusion matrix has the values of True Positives.

For example if we see the confusion matrix of SVM model, in training set, out of 319 times when the audio was actually indoor, it was predicted correct 279 times and out of 381 times when the adio was outdoor, it was predicted correct 214 times. An outdoor recording was predicted as indoor 158 times while the opposite occured 49 times.
Similarly, we can see for validation set.

Normalising the dataset and calculating the accuracy again

In [69]:
mean = X_train.mean(0)
sd =  X_train.std(0)

X_train = (X_train-mean)/sd
X_val  = (X_val-mean)/sd

model  = svm.SVC(C=1)
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

train_labels = np.unique(y_train)
val_labels = np.unique(y_val)


print("--------FOR SVM MODEL-------")
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
conf_matrix_train = metrics.confusion_matrix(y_train, yt_p, labels=train_labels)
conf_matrix_val = metrics.confusion_matrix(y_val, yv_p, labels=val_labels)

print('Training confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_train, index=train_labels, columns=train_labels)))
print('Validation confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_val, index=val_labels, columns=val_labels)))

model  = KNeighborsClassifier()
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print("\n\n--------FOR KNN MODEL-------")
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
conf_matrix_train = metrics.confusion_matrix(y_train, yt_p, labels=train_labels)
conf_matrix_val = metrics.confusion_matrix(y_val, yv_p, labels=val_labels)

print('Training confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_train, index=train_labels, columns=train_labels)))
print('Validation confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_val, index=val_labels, columns=val_labels)))


model = LogisticRegression()
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print("\n\n--------FOR LOGISTIC REGRESSION MODEL-------")
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
conf_matrix_train = metrics.confusion_matrix(y_train, yt_p, labels=train_labels)
conf_matrix_val = metrics.confusion_matrix(y_val, yv_p, labels=val_labels)

print('Training confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_train, index=train_labels, columns=train_labels)))
print('Validation confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_val, index=val_labels, columns=val_labels)))

model = RandomForestClassifier()
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)

print("\n\n--------FOR RANDOM FOREST CLASSIFIER MODEL-------")
print('Training Accuracy', np.mean(yt_p==y_train))
print('Validation  Accuracy', np.mean(yv_p==y_val))
conf_matrix_train = metrics.confusion_matrix(y_train, yt_p, labels=train_labels)
conf_matrix_val = metrics.confusion_matrix(y_val, yv_p, labels=val_labels)

print('Training confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_train, index=train_labels, columns=train_labels)))
print('Validation confusion matrix:\n {}\n'.format(pd.DataFrame(conf_matrix_val, index=val_labels, columns=val_labels)))

--------FOR SVM MODEL-------
Training Accuracy 0.7171428571428572
Validation  Accuracy 0.74
Training confusion matrix:
        False  True
False    250   132
True      66   252

Validation confusion matrix:
        False  True
False    111    50
True      28   111



--------FOR KNN MODEL-------
Training Accuracy 0.7728571428571429
Validation  Accuracy 0.68
Training confusion matrix:
        False  True
False    304    78
True      81   237

Validation confusion matrix:
        False  True
False    117    44
True      52    87



--------FOR LOGISTIC REGRESSION MODEL-------
Training Accuracy 0.6442857142857142
Validation  Accuracy 0.6833333333333333
Training confusion matrix:
        False  True
False    267   115
True     134   184

Validation confusion matrix:
        False  True
False    120    41
True      54    85



--------FOR RANDOM FOREST CLASSIFIER MODEL-------
Training Accuracy 1.0
Validation  Accuracy 0.69
Training confusion matrix:
        False  True
False    382     0
Tr

We can see from the results that both SVM and KNN model are giving a good accuracy both before and after normalisation. The Random Forest classifier is always gicing a Tarining accuracy of 100%, so this model is overfitting. From SVM and KNN, SVM is giving a slightly better accuracy than the latter and also, we can observe that KNN is overfitting. So we will choose SVM.

# 9 Conclusions

Here, we will select the SVM model, because the Support Vector Machine model is a good choice when we have 2 classes. We can also see from the test results that the SVM is giving a good accuracy and minimum difference between the training and validation accuracy as compared to the other models i.e. overfitting is minimum. Here we have a binary label i.e. we need to classify the audio samples as 0 or 1(outdoor or indoor). An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class.

Improvements:

Between SVM and Logistic Regression models, there was minor difference of accuracy. There is a possibility of change in accuracy if we increase the sample size. Thus we can try to train the models by passing larger sample sizes and then compare the  respective accuracy of both models.
Although, we have tried to take the most relevant features of the audio files,  we can try taking other features based on which the accuracy of the model can be improved.