# Data Processing
## Notebook
This Notebook is used to process the raw data tracks that contain multiple pitches per vowel into data tracks that contain a single pitch and vowel per file.

The directory structure should be as follows:
```
<Root directory>
-- DataProcessing.ipynb
-- file_list.txt
-- raw_data
---- 0_0-bed.wav
---- 0_1-bird.wav
---- <...>
---- 3_11-say.wav
-- processed_data
---- <Empty folder>
```

Each raw file has 16 pitches for a single vowel, and is approximately 48s long. Each of these are named `<personNum>_<wordIdx-wordVowel>.wav`, e.g. `0_4-cat.wav`.

A list of filenames is will be available, called `file_list.txt`. The format of each row in `file_list.txt` is `<filename> <wordIdx>`, where `<wordIdx>` is used as the vowel label for that file, e.g. `1_8-book.wav 8`.

The processed files will be named `<personNum>_<wordIdx>-<wordVowel>_<pitchIdx>-<pitch>.wav`, e.g. `2_3-bed_13-Bb3.wav`.

## Person Numbers
Person numbers are in alphabetical order:
- 0: Louiz
- 1: Rachel
- 2: Shaun
- 3: Zachary  

## Vowels
Vowels are the following: 
- 0-bed
- 1-bird
- 2-boat
- 3-book
- 4-cat
- 5-dog
- 6-feet
- 7-law
- 8-moo
- 9-nut
- 10-pig
- 11-say

## Output Notes
Just for reference, a few files sound strange on WMP:
- 3_11-say_0-A2.wav
- 3_6-feet_1-Bb2.wav
- 3_6-feet_0-A2.wav
- 2_9-nut_3-C3.wav
- 2_9-nut_2-B2.wav
- 2_9-nut_1-Bb2.wav
- 2_8-moo_2-B2.wav
- 2_8-moo_1-Bb2.wav
- 2_8-moo_0-A2.wav
- 2_6-feet_0-A2.wav
- 2_5-dog_3-C3.wav
- 2_1-bird_3-C3.wav
- 2_0-bed_3-C3.wav

In [1]:
# Import statements
import scipy.io as sio
from scipy.io import wavfile
from scipy.io.wavfile import write
import scipy.signal as sis
import scipy.fftpack as fftpack

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots
import os
import csv

# Actual optimally recorded note is 1.33s 
note_duration = 4/3
# For late start/early cutoff notes
expected_duration = 2/3 

In [6]:
# References
label_to_vowel = { 0: "bed",  1: "bird",   2: "boat",  3: "book", 
                   4: "cat",  5: "dog",    6: "feet",  7: "law",  
                   8: "moo",  9: "nut",   10: "pig",  11: "say" }
vowel_to_label = { "bed": 0,  "bird": 1,  "boat":  2, "book":  3,
                   "cat": 4,  "dog":  5,  "feet":  6, "law":   7,
                   "moo": 8,  "nut":  9,  "pig":  10, "say":  11}
noteidx_to_pitch = {  0: "A2",   1: "Bb2",  2: "B2",   3: "C3",
                      4: "Db3",  5: "D3",   6: "Eb3",  7: "E3", 
                      8: "F3",   9: "Gb3", 10: "G3",  11: "Ab3",
                     12: "A3",  13: "Bb3", 14: "B3",  15: "C4" }
noteidx_to_timestamp = np.array([  12/3,  20/3,  28/3,  36/3, 
                                   44/3,  52/3,  60/3,  68/3,
                                   76/3,  84/3,  92/3, 100/3,
                                  108/3, 116/3, 124/3, 132/3 ])

# Input/Output Files
file_list = "file_list.txt"
data_dir = "raw_data"
output_dir = "processed_data"

In [71]:
def timestamp_to_sample(timestamp, samplerate):
    return int(timestamp * samplerate)

def process_files():
    # Compute the trim_amount to be used
    trim_amount = (note_duration - expected_duration) / 2;
    
    # Open and read the file list
    fl = open(file_list)
    lines = fl.readlines();
    
    # Process each file
    for line in lines:
        filename, label = line.split()
        filepath = os.path.join(data_dir, filename)
        print("Splitting file:", filepath)
        
        # Read the raw wav file
        samplerate, short_data = sio.wavfile.read(filepath)
        
        # Write 16 wav files
        for idx, timestamp in enumerate(noteidx_to_timestamp):
            curr_wav = short_data[timestamp_to_sample(timestamp + trim_amount, samplerate) : 
                                  timestamp_to_sample(timestamp + note_duration - trim_amount, samplerate)]
            curr_filename = filename[:-4] + "_" + str(idx) + "-" + noteidx_to_pitch[idx] + ".wav"
            write(os.path.join(output_dir, curr_filename), samplerate, curr_wav)
    
process_files()

Analyzing file: raw_data/0_0-bed.wav
Analyzing file: raw_data/0_1-bird.wav
Analyzing file: raw_data/0_2-boat.wav
Analyzing file: raw_data/0_3-book.wav
Analyzing file: raw_data/0_4-cat.wav
Analyzing file: raw_data/0_5-dog.wav
Analyzing file: raw_data/0_6-feet.wav
Analyzing file: raw_data/0_7-law.wav
Analyzing file: raw_data/0_8-moo.wav
Analyzing file: raw_data/0_9-nut.wav
Analyzing file: raw_data/0_10-pig.wav
Analyzing file: raw_data/0_11-say.wav
Analyzing file: raw_data/1_0-bed.wav
Analyzing file: raw_data/1_1-bird.wav
Analyzing file: raw_data/1_2-boat.wav
Analyzing file: raw_data/1_3-book.wav
Analyzing file: raw_data/1_4-cat.wav
Analyzing file: raw_data/1_5-dog.wav
Analyzing file: raw_data/1_6-feet.wav
Analyzing file: raw_data/1_7-law.wav
Analyzing file: raw_data/1_8-moo.wav
Analyzing file: raw_data/1_9-nut.wav
Analyzing file: raw_data/1_10-pig.wav
Analyzing file: raw_data/1_11-say.wav
Analyzing file: raw_data/2_0-bed.wav
Analyzing file: raw_data/2_1-bird.wav
Analyzing file: raw_data/