## Train Data Preprocessing

In this notebook, I preprocess the train data. The procedure is the same for both the train and test subsets: 
1. Get correct amount (90% of entire dataset for each class for train)
2. Generate one hot encoded labels and store in Numpy ndarray (train labels)
3. Generate Mel Spectrograms for each input audio file 
4. Store Mel Spectrograms in Numpy ndarray (train samples)
5. Pickle train samples and train labels. 


In [1]:
import numpy as np
from glob import glob
import librosa 
import matplotlib.pyplot as plt
import pandas as pd
import pylab
import librosa.display

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Data preprocessing

In [3]:
# get paths to files 
speech_directory = 'path/speech'
no_speech_directory ='path/no_speech'

In [4]:
# create a list of all of the files in the folder using glob 
speech_subset = glob(speech_directory + '/*.wav')
no_speech_subset = glob(no_speech_directory + '/*.wav')

In [5]:
print(len(speech_subset))
print(len(no_speech_subset))

333
2933


In [6]:
train_num_speech = round(.90*len(speech_subset))
train_num_no_speech = round(.90*len(no_speech_subset))
print(train_num_speech)
print(train_num_no_speech)

300
2640


In [7]:
speech_train = speech_subset[0:300]
no_speech_train = no_speech_subset[0:2640]
print(len(speech_train))
print(len(no_speech_train))

300
2640


In [8]:
temp_speech_labels = []
for i in range(len(speech_train)):
    temp_speech_labels.append(1) # 1 = speech

In [9]:
temp_no_speech_labels = []
for i in range(len(no_speech_train)):
    temp_no_speech_labels.append(0) # 0 = speech

In [10]:
train_data_points_raw = speech_train + no_speech_train

In [11]:
train_labels = temp_speech_labels + temp_no_speech_labels
train_labels = np.array(train_labels)

In [12]:
len(train_labels)

2940

### Pickle Training Labels

In [None]:
import pickle
out_file = open("path/train_labels.pkl", "wb")
pickle.dump(train_labels, out_file)
out_file.close()

### Generate Mel Spectrograms

In [None]:
train = []

In [None]:
for elem in train_data_points_raw:
    y, sr = librosa.load(elem)
    time = np.arange(0, len(y)) / sr
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128,
                                    fmax=8000)
    train.append(S)

In [None]:
train = np.array(test)

In [None]:
out_file = open("path/test_samples.pkl", "wb")
pickle.dump(test_labels, out_file)
out_file.close()