## Test Data Preprocessing

In this notebook, I preprocess the test data. The procedure is the same for both the train and test subsets: 
1. Get correct amount (10% of entire dataset for each class for test)
2. Generate one hot encoded labels and store in Numpy ndarray (test labels)
3. Generate Mel Spectrograms for each input audio file 
4. Store Mel Spectrograms in Numpy ndarray (test samples)
5. Pickle test samples and test labels. 


In [1]:
import numpy as np
from glob import glob
import librosa 
import matplotlib.pyplot as plt
import pandas as pd
import pylab
import librosa.display

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Data preprocessing

In [3]:
# get paths to files 
speech_directory = 'path/speech'
no_speech_directory ='path/no_speech'

In [4]:
# create a list of all of the files in the folder using glob 
speech_subset = glob(speech_directory + '/*.wav')
no_speech_subset = glob(no_speech_directory + '/*.wav')

In [5]:
print(len(speech_subset))
print(len(no_speech_subset))

333
2933


In [9]:
# testing subset 
test_num_speech = len(speech_subset) - round(.90*len(speech_subset))
test_num_no_speech = len(no_speech_subset) - round(.90*len(no_speech_subset))
print(test_num_speech)
print(test_num_no_speech)

33
293


In [10]:
speech_test = speech_subset[300:]
no_speech_test = no_speech_subset[2640:]
print(len(speech_test))
print(len(no_speech_test))

33
293


Correct amount. 

In [12]:
temp_speech_labels_test = []
for i in range(len(speech_test)):
    temp_speech_labels_test.append(1) # 1 = speech


temp_no_speech_labels_test = []
for i in range(len(no_speech_test)):
    temp_no_speech_labels_test.append(0) # 0 = speech

print(len(temp_speech_labels_test))
print(len(temp_no_speech_labels_test))

33
293


In [13]:
test_labels = temp_speech_labels_test + temp_no_speech_labels_test
test_labels = np.array(test_labels)

In [14]:
test_data_points_raw = speech_test + no_speech_test

In [15]:
print(len(test_labels))
print(len(test_data_points_raw))

326
326


### Pickle Labels

In [16]:
import pickle
out_file = open("path/test_labels.pkl", "wb")
pickle.dump(test_labels, out_file)
out_file.close()

### Generate Mel Spectrograms and Pickle

In [17]:
test = []

In [19]:
for elem in test_data_points_raw:
    y, sr = librosa.load(elem)
    time = np.arange(0, len(y)) / sr
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128,
                                    fmax=8000)
    test.append(S)

In [20]:
test = np.array(test)

In [22]:
out_file = open("path/test_samples.pkl", "wb")
pickle.dump(test, out_file)
out_file.close()