## 2.0 Further Pre-Processing
This notebook does further pre-processing of data to be ready for input into machine learning models.

### Table of Contents
[2.1. Setup](#1.)<br>
[2.1.1 Loading libraries](#1.1)<br>
[2.1.2 Setting data directories](#1.2)<br>
[2.1.3 Defining functions](#1.3)<br>

[2.2. Further Pre-processing](#2.)<br>
[2.2.1 Reading in train, validation, and test data sets](#2.1)<br>
[2.2.2 Scaling the spectrograms for min max](#2.2)<br>
[2.2.3 Setting genre classes](#2.3)<br>

[2.3. Saving Pre-Processed Data](#3.)<br>
[2.3.1 Shuffling the data and saving as .npy files](#3.1)<br>

### 2.1. Setup <a class="anchor" id="1."></a>

#### 2.1.1 Loading libraries <a class="anchor" id="1.1"></a>

In [2]:
!pip install "numpy"
!pip install "pandas"
!pip install "librosa"
!pip install "matplotlib"
!pip install "timeit"


Collecting numpy
  Downloading numpy-1.19.2-cp38-cp38-win_amd64.whl (13.0 MB)
Installing collected packages: numpy
Successfully installed numpy-1.19.2
Collecting pandas
  Downloading pandas-1.1.2-cp38-cp38-win_amd64.whl (9.6 MB)
Collecting pytz>=2017.2
  Downloading pytz-2020.1-py2.py3-none-any.whl (510 kB)
Installing collected packages: pytz, pandas
Successfully installed pandas-1.1.2 pytz-2020.1
Collecting librosa
  Downloading librosa-0.8.0.tar.gz (183 kB)
Collecting audioread>=2.0.0
  Downloading audioread-2.1.8.tar.gz (21 kB)
Collecting scipy>=1.0.0
  Downloading scipy-1.5.2-cp38-cp38-win_amd64.whl (31.4 MB)
Collecting scikit-learn!=0.19.0,>=0.14.0
  Downloading scikit_learn-0.23.2-cp38-cp38-win_amd64.whl (6.8 MB)
Collecting joblib>=0.14
  Downloading joblib-0.16.0-py3-none-any.whl (300 kB)
Collecting resampy>=0.2.2
  Downloading resampy-0.2.2.tar.gz (323 kB)
Collecting numba>=0.43.0
  Downloading numba-0.51.2-cp38-cp38-win_amd64.whl (2.2 MB)
Collecting soundfile>=0.9.0
  Download

ERROR: Could not find a version that satisfies the requirement timeit (from versions: none)
ERROR: No matching distribution found for timeit


In [2]:
import os
import numpy as np
import pandas as pd
import random
import librosa
import librosa.display
import matplotlib.pyplot as plt
%matplotlib inline

import timeit
import datetime

from sklearn import preprocessing

#### 2.1.2 Setting data directories <a class="anchor" id="1.2"></a>

In [4]:
ds_description = '5x10s'
# Set the directory for the spectrograms
data_dir = f'./data/spect_subsample_{ds_description}_np'
print("Directory of spectrograms: {}".format(data_dir))

Directory of spectrograms: ./data/spect_subsample_5x10s_np


#### 2.1.3 Defining functions <a class="anchor" id="1.3"></a>

In [5]:
def load_data(data_dir, ds_description, str_X, str_Y):
    '''
    Loads the .npy data files generated previously from the pre-processing ipynb
    Note: .npy files need to be in the format: train_spect_{ds_description}_np.npy
    
    Inputs
    ------
    data_dir: directory of the .npy files
    ds_description: e.g. '5x10s'  5 subsamples of 10s length
    str_X: str name of the 'X' data, either: 'spect' or 'X'
    str_Y: str name of the 'Y' data, either: 'labels' or 'Y'
    
    Returns
    -------
    6 numpy arrays of:
        train_{str_X}, train_{str_Y}, val_{str_X}, val_{str_Y}, test_{str_X}, test_{str_Y}
    '''
    assert (str_X in ['spect','X']), "Assertion Error, str_X must be either 'spect' or 'X'."
    assert (str_Y in ['labels','Y']), "Assertion Error, str_Y must be either 'labels' or 'Y'."
    
    print("Loading .npy data files...")
    # Start timer
    start_time = timeit.default_timer()

    train_str_X = np.load(f'{data_dir}/train_{str_X}_{ds_description}_np.npy')
    val_str_X = np.load(f'{data_dir}/val_{str_X}_{ds_description}_np.npy')
    test_str_X = np.load(f'{data_dir}/test_{str_X}_{ds_description}_np.npy')

    train_str_Y = np.load(f'{data_dir}/train_{str_Y}_{ds_description}_np.npy')
    val_str_Y = np.load(f'{data_dir}/val_{str_Y}_{ds_description}_np.npy')
    test_str_Y = np.load(f'{data_dir}/test_{str_Y}_{ds_description}_np.npy')
    
    elapsed = str(datetime.timedelta(seconds = timeit.default_timer() - start_time))
    print("", end='\n')
    print("Total processing time (h:mm:ss): {}".format(elapsed[:-7]))
    print("\nLoaded .npy data files, verifying shape of saved data...")
    print(f"Shape of 'train_{str_X}':", train_str_X.shape)
    print(f"Shape of 'train_{str_Y}':", train_str_Y.shape)

    print(f"Shape of 'val_{str_X}':", val_str_X.shape)
    print(f"Shape of 'val_{str_Y}':", val_str_Y.shape)

    print(f"Shape of 'test_{str_X}':", test_str_X.shape)
    print(f"Shape of 'test_{str_Y}':", test_str_Y.shape)
    
    return train_str_X, train_str_Y, val_str_X, val_str_Y, test_str_X, test_str_Y

In [6]:
def min_max_scaler_3d(array_3d):
    '''
    Takes in a 3D numpy array, converts to 2D to apply scikit-learn's 
    preprocessing.MinMaxScaler() method, and then converts to 3D
    
    Returns
    -------
    3D numpy array with values [0,1] (scaled with MinMaxScaler)    
    '''
    (s0, s1, s2) = array_3d.shape
    array_2d = np.reshape(array_3d, (s0 * s1, s2))
    min_max_scaler = preprocessing.MinMaxScaler()
    array_2d = min_max_scaler.fit_transform(array_2d)
    array_3d = np.reshape(array_2d, (s0, s1, s2))
    
    return array_3d

In [7]:
def map_classes(data_labels, class_dict):
    '''
    Takes in a 1D numpy array of labels, and converts to a 2D array of labels
    based on the class_dict
    
    Inputs
    ------
    Class_dict: dictionary of int keys starting from 0, and str label values
        e.g. genre_dict = {0 : 'Hip-Hop', 1 : 'Pop', 2 : 'Folk',}
    
    Returns
    -------
    data_classified: 2D numpy int array of [0,1], 1 indicating the class for the jth location.
    '''
    # Reverse the dict to have str as the keys
    class_dict_reverse = {v:k for k,v in class_dict.items()}
    n_obs = len(data_labels)
    n_cls = len(class_dict)
    data_classified = np.zeros((n_obs, n_cls), dtype=int)
    
    for i in range(n_obs):
        data_classified[i][class_dict_reverse[data_labels[i]]] = 1
    
    return data_classified

In [8]:
def unison_shuffled_copies(a, b):
    '''
    Shuffles two arrays in unison along the first axis by using a permutation
    Returns
    -------
    a and b numpy arrays shuffled in unison
    '''
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    return a[p], b[p]

### 2.2. Further Preprocessing <a class="anchor" id="2."></a>

#### 2.2.1 Reading in train, validation, and test data sets <a class="anchor" id="2.1"></a>

In [9]:
# Read in the spectrogram and labels data from the .npy files
train_spect, train_labels, val_spect, val_labels, test_spect, test_labels = load_data(
    data_dir, ds_description, 'spect', 'labels')

Loading .npy data files...

Total processing time (h:mm:ss): 0:00:15

Loaded .npy data files, verifying shape of saved data...
Shape of 'train_spect': (31970, 431, 128)
Shape of 'train_labels': (31970,)
Shape of 'val_spect': (4000, 431, 128)
Shape of 'val_labels': (4000,)
Shape of 'test_spect': (4000, 431, 128)
Shape of 'test_labels': (4000,)


In [10]:
# Checking if all train, val, test sets have same min and max ranges
print("Train [min, max]:", [train_spect.min(), train_spect.max()])
print("Val [min, max]:", [val_spect.min(), val_spect.max()])
print("Test [min, max]:", [test_spect.min(), test_spect.max()])

assert(train_spect.min() == val_spect.min() == test_spect.min()), 'minimum values do not match'
assert(train_spect.max() == val_spect.max() == test_spect.max()), 'maximum values do not match'

Train [min, max]: [-80.0, 3.814697265625e-06]
Val [min, max]: [-80.0, 3.814697265625e-06]
Test [min, max]: [-80.0, 3.814697265625e-06]


#### 2.2.2 Scaling the spectrograms for min max <a class="anchor" id="2.2"></a>

In [11]:
val_spect_minmax = min_max_scaler_3d(val_spect)
test_spect_minmax = min_max_scaler_3d(test_spect)
train_spect_minmax = min_max_scaler_3d(train_spect)

In [12]:
# Cheecking shape and min and max values to be between 0 and 1
print("Shape of 'train_spect':", train_spect_minmax.shape)
print("Train [min, max]:", [train_spect_minmax.min(), train_spect_minmax.max()])
print()
print("Shape of 'val_spect':", val_spect_minmax.shape)
print("Val [min, max]:", [val_spect_minmax.min(), val_spect_minmax.max()])
print()
print("Shape of 'test_spect':", test_spect_minmax.shape)
print("Test [min, max]:", [test_spect_minmax.min(), test_spect_minmax.max()])

Shape of 'train_spect': (31970, 431, 128)
Train [min, max]: [0.0, 1.0]

Shape of 'val_spect': (4000, 431, 128)
Val [min, max]: [0.0, 1.0000000000000002]

Shape of 'test_spect': (4000, 431, 128)
Test [min, max]: [0.0, 1.0000000000000002]


#### 2.2.3 Setting genre classes <a class="anchor" id="2.3"></a>

In [13]:
genre_dict = {0 : 'Hip-Hop',
              1 : 'Pop',
              2 : 'Folk',
              3 : 'Experimental',
              4 : 'Rock',
              5 : 'International',
              6 : 'Electronic',
              7 : 'Instrumental'}

In [14]:
# map labels to classes
train_classes = map_classes(train_labels, genre_dict)
val_classes = map_classes(val_labels, genre_dict)
test_classes = map_classes(test_labels, genre_dict)

### 2.3. Saving Pre-Processed Data <a class="anchor" id="3."></a>

#### 2.3.1 Shuffling the data and saving as .npy files<a class="anchor" id="3.1"></a>

In [15]:
# shuffle the data and save the the pre-processed 'X' and 'Y' data files

train_X, train_Y = unison_shuffled_copies(train_spect_minmax, train_classes)
val_X, val_Y = unison_shuffled_copies(val_spect_minmax, val_classes)
test_X, test_Y = unison_shuffled_copies(test_spect_minmax, test_classes)

np.save(f'{data_dir}/train_X_{ds_description}_np', train_X)
np.save(f'{data_dir}/val_X_{ds_description}_np', val_X)
np.save(f'{data_dir}/test_X_{ds_description}_np', test_X)

np.save(f'{data_dir}/train_Y_{ds_description}_np', train_Y)
np.save(f'{data_dir}/val_Y_{ds_description}_np', val_Y)
np.save(f'{data_dir}/test_Y_{ds_description}_np', test_Y)
