## Train Data Preprocessing

In this notebook, I preprocess the train data. The procedure is the same for both the train and test subsets: 
1. Get correct amount (90% of entire dataset for each class for train)
2. Generate one hot encoded labels and store in Numpy ndarray (train labels)
3. Generate Mel Spectrograms for each input audio file 
4. Store Mel Spectrograms in Numpy ndarray (train samples)
5. Pickle train samples and train labels. 


In [1]:
import numpy as np
from glob import glob
import librosa 
import matplotlib.pyplot as plt
import pandas as pd
import pylab
import librosa.display

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Data preprocessing

In [7]:
# get paths to files 
speech_directory = '/content/drive/MyDrive/Mignot Lab Research/Experiments/one_sample/one-second-splits/speech'
no_speech_directory ='/content/drive/MyDrive/Mignot Lab Research/Experiments/one_sample/one-second-splits/no_speech'

In [9]:
# create a list of all of the files in the folder using glob 
speech_subset = glob(speech_directory + '/*.wav')
no_speech_subset = glob(no_speech_directory + '/*.wav')

In [10]:
print(len(speech_subset))
print(len(no_speech_subset))

2487
30219


In [11]:
train_num_speech = round(.90*len(speech_subset))
train_num_no_speech = round(.90*len(no_speech_subset))
print(train_num_speech)
print(train_num_no_speech)

2238
27197


For class balancing, we will use 2238 for both. 

In [12]:
speech_train = speech_subset[0:train_num_speech]
no_speech_train = no_speech_subset[0:train_num_speech]
print(len(speech_train))
print(len(no_speech_train))

2238
2238


In [13]:
temp_speech_labels = []
for i in range(len(speech_train)):
    temp_speech_labels.append(1) # 1 = speech

In [14]:
temp_no_speech_labels = []
for i in range(len(no_speech_train)):
    temp_no_speech_labels.append(0) # 0 = speech

In [18]:
train_data_points_raw = speech_train + no_speech_train

In [19]:
train_labels = temp_speech_labels + temp_no_speech_labels
train_labels = np.array(train_labels)

In [20]:
len(train_data_points_raw)

4476

In [21]:
len(train_labels)

4476

### Pickle Training Labels

In [22]:
import pickle
out_file = open("/content/drive/MyDrive/Mignot Lab Research/Experiments/one_sample/one-second-splits/train_labels.pkl", "wb")
pickle.dump(train_labels, out_file)
out_file.close()

### Generate Mel Spectrograms

In [23]:
train = []

In [24]:
for elem in train_data_points_raw:
    y, sr = librosa.load(elem)
    time = np.arange(0, len(y)) / sr
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128,
                                    fmax=8000)
    train.append(S)

In [25]:
train = np.array(train)

In [26]:
out_file = open("/content/drive/MyDrive/Mignot Lab Research/Experiments/one_sample/one-second-splits/train_samples.pkl", "wb")
pickle.dump(train, out_file)
out_file.close()