# Vocal Pitch Modulator NN Training 
This is the notebook used to train the Vocal Pitch Modulator.

This notebook is split into two sections. The first section goes through in detail (with plots and prints) how the data is organized. The second section is the portion that makes use of the data to train our timbre encoder

## Global variables/Imports
Run these cells before running either of the following sections.

In [None]:
%load_ext autoreload
%autoreload 1

import os
import csv

import scipy.io as sio
from scipy.io import wavfile
from scipy.io.wavfile import write

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots

%aimport VPM
from VPM import *
%aimport Utils
from Utils import *

In [None]:
# Constants that should not change without the dataset being changed
n_pitches = 16
n_vowels = 12
n_people = 3

# These dictionaries are more for reference than anything
label_to_vowel = { 0: "bed",  1: "bird",   2: "boat",  3: "book", 
                   4: "cat",  5: "dog",    6: "feet",  7: "law",  
                   8: "moo",  9: "nut",   10: "pig",  11: "say" }

vowel_to_label = { "bed": 0,  "bird": 1,  "boat":  2, "book":  3,
                   "cat": 4,  "dog":  5,  "feet":  6, "law":   7,
                   "moo": 8,  "nut":  9,  "pig":  10, "say":  11}

noteidx_to_pitch = {  0: "A2",   1: "Bb2",  2: "B2",   3: "C3",
                      4: "Db3",  5: "D3",   6: "Eb3",  7: "E3", 
                      8: "F3",   9: "Gb3", 10: "G3",  11: "Ab3",
                     12: "A3",  13: "Bb3", 14: "B3",  15: "C4" }

## Data Walkthrough
This first section demonstrates how the data is put into data structures for training. You can skip to the second section and simply run the first cell there to generate the data structures, there is no need to execute the cells in this section.

### Getting data references
Read the reference csv to relevant data structure.

`data_ref_list` is the list of filenames in the dataset in a 3d array format.
A specific file is accessed with `data_ref_list[vowel_idx][pitch_idx][person_idx]`.

`flat_data_ref_list` is the list of filenames in the dataset as a 1d array. To access a specific file, use `flat_data_ref_list[flat_idx(vowel, pitch, person)]`

In [None]:
# e.g. data_list[vowel_to_label["dog"]][5][1]
data_ref_list = create_data_ref_list(os.path.join("Data", 'dataset_files.csv'),
                            n_pitches, n_vowels, n_people)
# print(data_ref_list)
# e.g. flat_data_ref_list[flat_idx(3, 1, 2)]
flat_data_ref_list = flatten_3d_array(data_ref_list, 
                                      n_vowels, n_pitches, n_people)

The following are the accessor functions used to compute indices from flat to 3d and vice versa.

`flat_idx` returns a `flat_idx`, given a `(vowel, pitch, person)`, while `nd_idx` returns `vowel, pitch, person`, given a `flat_idx`.

In [None]:
# Returns a flat_idx, given a vowel, pitch, person
flat_idx = lambda vowel, pitch, person: flat_array_idx(
    vowel, pitch, person, n_vowels, n_pitches, n_people)
# Returns vowel, pitch, person, given a flat_idx
nd_idx = lambda idx: nd_array_idx(idx, n_vowels, n_pitches, n_people)

In [None]:
print("Data ref list ({}):".format(len(flat_data_ref_list)), 
      flat_data_ref_list)

### Data-label Pitch Index pairs
Generate the data-label pitch index pairs. This is an array where each element is a 3-tuple of `[shift_amt, input_pitch_idx, label_pitch_iIdx]`.


In [None]:
data_label_pairs, _ = create_data_label_pairs(n_pitches)

In [None]:
print("Total data-label pairs ({}):".format(len(data_label_pairs)), 
      data_label_pairs)

### Get All .wav Data
Get the wav file data into a single matrix, where each element `all_wav_data[idx]` is the wavfile content of the file at `flat_data_ref_list[idx]`. To retrieve the 3d indices of a specific index, use `vowel, pitch, person = nd_idx(idx)`.


In [None]:
all_wav_data = load_wav_files(os.path.join("Data", "dataset"), 
                              flat_data_ref_list)

In [None]:
print("All wav data length: {}\nTrack length: {}".format(
      all_wav_data.shape, all_wav_data[0].shape))

### Create all spectrograms
Get the spectrograms for each wav in `all_wav_data`. The spectrogram at `all_spectrograms[idx]` is the spectrogram of the wav at `all_wav_data[idx]`.

In [None]:
all_spectrograms = np.array([ stft(waveform, plot=False) 
                              for waveform in all_wav_data ])

In [None]:
print("All spectrograms has shape: {} (n_wavs, n_freq_bins, n_windows)\n"
      .format(all_spectrograms.shape))

print("FFT Spectrogram of vowel 4, pitch 3, person 2 ({}):"
      .format(flat_data_ref_list[flat_idx(4, 3, 2)]))
plot_ffts_spectrogram(all_spectrograms[flat_idx(4, 3, 2)], sample_rate,
                      flat_data_ref_list[flat_idx(4, 3, 2)])

### Create Mel Spectrograms and MFCC
Get the mel spectrograms/MFCC for each ffts (spectrogram) in `all_spectrograms` (similar indexing as above).

In [None]:
all_mels, all_mfcc = map(np.array, map(list, zip(*
                         [ ffts_to_mel(ffts, n_mels = 128) 
                           for ffts in all_spectrograms ])))

In [None]:
print("All mels has shape: {} (n_wavs, n_mels, n_windows)"
      .format(all_mels.shape))
print("All mfccs has shape: {} (n_wavs, n_mfcc, n_windows)\n"
      .format(all_mfcc.shape))

print("Mel Spectrogram of vowel 4, pitch 3, person 2 ({}):"
      .format(flat_data_ref_list[flat_idx(4, 3, 2)]))
plot_mel_spectrogram(all_mels[flat_idx(4, 3, 2)], sample_rate,
                     flat_data_ref_list[flat_idx(4, 3, 2)])
print("MFCC of vowel 4, pitch 3, person 2 ({}):"
      .format(flat_data_ref_list[flat_idx(4, 3, 2)]))
plot_mfcc(all_mfcc[flat_idx(4, 3, 2)], sample_rate,
          flat_data_ref_list[flat_idx(4, 3, 2)])

## ANN Training - Timbre Encoder
This section is where training for the Timbre encoder is done.

### Imports
Run these cells before running the following section.

In [None]:
import time
import math

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, log_loss

from tqdm.notebook import trange, tqdm

from IPython.display import HTML
import warnings
from sklearn.preprocessing import OneHotEncoder
from sklearn.datasets import make_blobs

import torch
warnings.filterwarnings('ignore')
import torch.nn as nn
import torch.nn.functional as F
from torch import optim

%aimport ANN
from ANN import *

### Constants
Used to tune the ANN.

In [None]:
n_mels = 128
n_mfcc = 20
n_hid1 = 12
n_timb = 4
lr     = 0.2

### Data Generation
This is all the code that was explained in the Data Walkthrough above. It generates data structures to hold all wav file data, spectrograms, mel spectra and MFCC data for all wav files.

In [None]:
# File reference lists
data_ref_list = create_data_ref_list(os.path.join("Data", 'dataset_files.csv'),
                            n_pitches, n_vowels, n_people)
# flat_data_ref_list[flat_idx(vowel, pitch, person)]
flat_data_ref_list = flatten_3d_array(data_ref_list, 
                                      n_vowels, n_pitches, n_people)

# File reference list accessors
# Returns a flat_idx, given a vowel, pitch, person
flat_idx = lambda vowel, pitch, person: flat_array_idx(
    vowel, pitch, person, n_vowels, n_pitches, n_people)
# Returns vowel, pitch, person, given a flat_idx
nd_idx = lambda idx: nd_array_idx(idx, n_vowels, n_pitches, n_people)

# Data-label pairs for pitch-shift training - not used here
# data_label_pairs, _ = create_data_label_pairs(n_pitches)

# wav, spectrogram, mels, mfcc for each file in flat_data_ref_list
# wav_data:     (576, ~29400)  (n_wavs, n_samples)
# spectrograms: (576, 513, 58) (n_wavs, n_freq_bins, n_windows)
# mels:         (576, 128, 58) (n_wavs, n_mels, n_windows)
# mfccs:        (576, 20, 58)  (n_wavs, n_mfcc, n_windows)
all_wav_data = load_wav_files(os.path.join("Data", "dataset"), 
                              flat_data_ref_list)
all_spectrograms = np.array([ stft(waveform, plot=False) 
                              for waveform in all_wav_data ])
all_mels, all_mfcc = map(np.array, map(list, zip(*
                         [ ffts_to_mel(ffts, n_mels = n_mels, n_mfcc = n_mfcc) 
                           for ffts in all_spectrograms ])))

### Data-Label Structuring
This puts together the actual data-label pairs to be fed into the ANN.

Generate `data` and `labels` from `all_mfcc` and using `nd_idx`.

In [None]:
n_files, n_mfcc_dummy, n_windows = all_mfcc.shape

# vowel_labels: (576) (n_wavs)
all_vowel_labels, _, _ = map(np.array, map(list, zip(*
                         [ nd_idx(idx) 
                           for idx in range(len(flat_data_ref_list)) ])))
# data:   (33408, 20) (n_wavs * n_windows, n_mfcc)
# labels: (33408) (n_wavs * n_windows)
data = np.array([ all_mfcc[wav_file_idx][:, window_idx] 
                  for wav_file_idx in range(n_files) 
                  for window_idx in range(n_windows) ])
labels = np.array([ all_vowel_labels[wav_file_idx]
                    for wav_file_idx in range(n_files)
                    for window_idx in range(n_windows) ])

In [None]:
# For testing purposes - verify that the mfcc have been arranged in order of
# wav_idx, win_idx, mel_feature_idx
for wav_idx in range(n_files):
    for win_idx in range(n_windows):
        for m in range(n_mfcc_dummy):
            assert data[wav_idx * n_windows + win_idx][m] == \
                   all_mfcc[wav_idx][m][win_idx]
for wav_idx in range(n_files):
    for win_idx in range(n_windows):
        assert labels[wav_idx * n_windows + win_idx] == \
               all_vowel_labels[wav_idx]

Split Data into `train` and `test`, and convert to Torch tensors of the correct types.

In [None]:
# X_train, Y_train: (25056, 20) (25056) 
# X_val, Y_val:     (8352, 20) (8352)
X_train, X_val, Y_train, Y_val = train_test_split(data, labels, stratify=labels, random_state=0)
X_train, Y_train, X_val, Y_val = map(torch.tensor, (X_train, Y_train, X_val, Y_val))
# Default tensor is float
X_train = X_train.float(); X_val = X_val.float()
# Used as index, so it is long
Y_train = Y_train.long(); Y_val = Y_val.long()

### Timbre-Encoder
This takes MFCC (and mel-spectrograms in future?), and tries to identify the vowel spoken.

In [None]:
# Training model 
# model = TimbreEncoder(n_mfcc=n_mfcc, n_hid1=n_hid1, n_timb=n_timb, n_vowels=n_vowels)
model = TimbreEncoder()
# Define loss 
loss_fn = F.cross_entropy
# Define optimizer 
opt = optim.SGD(model.parameters(), lr=lr)

In [None]:
print("GPU Available" if torch.cuda.is_available() else "GPU Not available")

GPU Version (Only run if GPU is available)

In [None]:
# Use GPU if possible
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Move inputs to GPU (if possible)
X_train_gpu = X_train.to(device)
Y_train_gpu = Y_train.to(device)
X_val_gpu = X_val.to(device)
Y_val_gpu = Y_val.to(device)

# Move the network to GPU (if possible)
model.to(device) 
# model = model.to(device) 
# Define optimizer 
opt = optim.SGD(model.parameters(), lr=lr)

# Fit the model
tic = time.time()
loss = fit(X_train_gpu, Y_train_gpu, X_val_gpu, Y_val_gpu, 
           model, opt, loss_fn, epochs=5000, print_graph = True)
toc = time.time()
print('Final loss: {}\nTime taken: {}'.format(loss, toc - tic))

Non-GPU Version (Run if GPU is not available)

In [None]:
# Fit the model
tic = time.time()
loss = fit(X_train, Y_train, X_val, Y_val, model, opt, loss_fn, epochs=500, print_graph = True)
toc = time.time()
print('Final loss: {}\nTime taken: {}'.format(loss, toc - tic))