# Data Preprocessing

We use this notebook to manually guide the operation of splitting the recordings into 5 sec chunks. 
The change is split into three steps:
1. Reading the original Dataset.
2. Splitting the files into fixed-size audios.
3. Reading back the new 

It should be noted that the 'Dataset' object does not read into memory the files, rather it loads the metadata _csv_: reading the whole data at once would be useless and unfeasible.

In [None]:
from utility_data import *

As a first step, we load the dataframe of the dataset, with the `metadata == True`, which allows us to split the recordings based on audio length. Moreover, setting parameter `feature_mode=''`, we ensure that the audio is simply split, while preserving its format: although we first considered saving the Mel Spectrograms directly, to avoid computing it at every training step, we opted to use CUDA to speed up computation, as saving the whole data would have taken 50 GB, surpassing our memory constraints.

In [None]:
audio_params = {'sample_rate': 32000, 'n_fft': 1024, 'hop_length': 501, 'n_mfcc': 128, 'n_mels': 128, 'feature_size': 2048}

df = AudioDataset(
    datafolder="data",
    metadata_csv="train.csv",
    audio_dir="train_audio",
    feature_mode='',
    audio_params=audio_params,
    metadata=True
)

Splitting the AudioDataset into shorter files, with a new '.csv' index and 'train_audio' directory is that it allows us to access a relevant section of the recording directly, without reading the complete file from memory. We perform the , which can be reused by loading the data directly to the gpu, without reprocessing every time, which is a major bottleneck to training.

In order to test the code before shipping it to the queue, we also preprocess a fraction of the data, on which we run our scripts as a preliminary step. Naturally, due to the nature of the cluster, we also maintain two different sets conda environments, which we track using 'yml' files: one for our local devices and the other for our partition in the cluster.

In [None]:
# adjust as fit, to export more data
df.data = df.data.head(1000)
df.preprocess(output='train_proc')

The preprocessed data is read back into memory to ensure that the operation was successful. We use this snippet to examine the data frame.

In [None]:
audio_params = {'sample_rate': 32000, 'n_fft': 1024, 'hop_length': 501, 'n_mfcc': 128, 'n_mels': 128, 'feature_size': 2048}

dataset = AudioDataset(
    datafolder="data",
    metadata_csv="train_proc.csv",
    audio_dir="train_proc",
    feature_mode='mel',
    audio_params=audio_params,
    metadata=False
)

Finally, we quantify the loss of data driven by dropping audios of length <2.5 sec.

In [None]:
# 
(dataset.data['duration'] % 5).hist(bins=200)