# Data Preprocessing

- resampling & normalization
- data augmentation (noise addition, pitch shift, time stretching)

In [1]:
from utility_data import *

First the original data is loaded. The metadata is also loaded, as it will be necessary in order to preprocess the data. The parameter `feature_mode=''` ensures that the audio is simply split and not 

In [2]:
audio_params = {'sample_rate': 32000, 'n_fft': 1024, 'hop_length': 501, 'n_mfcc': 128, 'n_mels': 128, 'feature_size': 2048}

dataset = AudioDataset(
    datafolder="data",
    metadata_csv="train.csv",
    audio_dir="train_audio",
    feature_mode='',
    audio_params=audio_params,
    metadata=True
)

AudioDataset can produce a new dataset '.csv' and '/train' with the correct, preprocessed data, which can be reused by loading the data directly to the gpu, without reprocessing every time, which is a major bottleneck to training.

Only a fraction of the data is actually preprocessed here: just enough to check that the code works. Preprocessing the whole data would take around 50 GB.

In [3]:
# adjust as fit, to export more data
# note that the new files will have the same size
# as the original data.
dataset.data = dataset.data.head(100)
dataset.preprocess(output='train_proc')

The preprocessed data is read back into memory to ensure that the operation was successful.

In [4]:
audio_params = {'sample_rate': 32000, 'n_fft': 1024, 'hop_length': 501, 'n_mfcc': 128, 'n_mels': 128, 'feature_size': 2048}

dataset = AudioDataset(
    datafolder="data",
    metadata_csv="train_proc.csv",
    audio_dir="train_proc",
    feature_mode='mel',
    audio_params=audio_params,
    metadata=False
)

In [None]:
# quantifying the loss of cutting <2.5 sec leftover audios
# in the process of forming the recording 5 sec clips
(dataset.data['duration'] % 5).hist(bins=200)