# Training and test sets creation
The first step of the project consist of training and test sets creation. 

## Feature extraction
The dataset is created by using the Features class.
Each audio file il loaded in memory and the following features are extracted: 
- mfcc 
- chroma
- rms

Every feature array is then reduced with the following functions: 
- min
- max
- median
- mean

Results are concatenated and a total of 132 features are extracted from each audio.

## Structure
The dataset is organized in this structure
$$\mathit{class}, \; \mathit{feature}_1, \; \dots, \; \mathit{feature}_n$$

## Scaling
A standard scaler is applied to the training set and it's saved to disk.
When processing the test folds, the same scaler is applied to the data.

## Dask speed up
To speed up the computation Dask is used. 
A total of 4 workers works in parallel to extract features more efficiently, reducing the time on a single fold from about 70 seconds to just under 30.

## Training dataset
The first step is to get the training dataset, the considered folds are the first four and the sixth. The total number of samples in the obtained dataset is 4499.


In [1]:
import sys
sys.path.append("..")
from src.data import Features
import pandas as pd
import numpy as np

## Unscaled training set 
The following cell extracts the unscaled training set.

In [2]:
f = Features(save_path="../data/processed/initial",
             save_name="train_unscaled",
             folds=[1,2,3,4,6])

training_dataframe = f.get_dataframe()
f.save_dataframe(training_dataframe)

In [3]:
training_dataframe

Unnamed: 0,class,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,...,f_122,f_123,f_124,f_125,f_126,f_127,f_128,f_129,f_130,f_131
0,3.0,-592.183899,-162.788498,-344.231689,-329.880310,0.000000,229.762238,139.087418,140.868393,-148.199005,...,0.220979,0.297263,0.009965,1.0,0.303738,0.336505,0.000042,0.289709,0.015952,0.047359
1,3.0,-453.945038,-161.955887,-364.052826,-346.856628,82.226860,226.205078,131.974197,141.269028,-144.834259,...,0.203320,0.261760,0.008947,1.0,0.290658,0.342790,0.002271,0.289867,0.011760,0.047650
2,3.0,-448.205292,-176.065674,-373.497986,-350.400360,73.434738,221.718552,126.712044,135.250168,-133.286484,...,0.284472,0.369487,0.006988,1.0,0.268131,0.302224,0.002280,0.272627,0.009693,0.042375
3,3.0,-444.168701,-173.939423,-369.273560,-346.233246,70.035751,220.699463,126.784042,132.986328,-134.645599,...,0.247804,0.303319,0.003098,1.0,0.252041,0.303084,0.002420,0.341308,0.012339,0.052789
4,3.0,-713.720947,-98.949768,-242.909470,-255.045990,0.000000,191.614868,110.675262,111.643570,-165.848145,...,0.400219,0.447586,0.000000,1.0,0.505863,0.526202,0.000000,0.128447,0.027009,0.036163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4494,9.0,-262.955444,-101.359467,-184.823181,-190.716583,134.389908,200.986298,174.159607,173.429230,-67.813835,...,1.000000,0.782015,0.003078,1.0,0.319475,0.338576,0.030073,0.423040,0.296085,0.263723
4495,9.0,-326.015259,-142.138199,-221.734146,-219.478958,121.905540,205.910110,179.783905,176.296677,-89.379761,...,1.000000,0.823710,0.000525,1.0,0.356601,0.377291,0.025899,0.379670,0.165430,0.161492
4496,9.0,-272.998810,-89.848122,-164.126770,-175.900131,158.483871,230.669327,187.851624,189.519730,-71.425461,...,0.292984,0.395287,0.005395,1.0,0.257566,0.344914,0.025470,0.357341,0.167323,0.160282
4497,9.0,-263.229767,-62.980148,-155.296738,-167.399094,145.786682,220.202484,183.713654,185.057480,-68.326485,...,0.170528,0.275586,0.002127,1.0,0.214822,0.298109,0.029145,0.383787,0.194611,0.175971


## Scaling the dataset
A standard scaler is applied to the dataset and saved for later scaling on the test sets.

In [4]:
scaled_df = f.scale_dataframe(training_dataframe, 
                              save_scaler=True)
scaled_df

Unnamed: 0,class,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,...,f_122,f_123,f_124,f_125,f_126,f_127,f_128,f_129,f_130,f_131
0,3.0,-1.645322,0.279495,-0.437740,-0.358281,-1.854037,0.821304,-0.274404,-0.247118,-2.167997,...,-1.103669,-1.074277,-0.879697,0.237888,-0.773028,-0.891234,-0.855815,1.013690,-0.772954,-0.340650
1,3.0,-0.674059,0.285865,-0.590028,-0.493791,-0.484286,0.748931,-0.404350,-0.239558,-2.082986,...,-1.178026,-1.254609,-0.888964,0.237888,-0.829431,-0.858691,-0.795068,1.014815,-0.843106,-0.335789
2,3.0,-0.633732,0.177919,-0.662596,-0.522078,-0.630747,0.657650,-0.500481,-0.353143,-1.791230,...,-0.836328,-0.707427,-0.906801,0.237888,-0.926564,-1.068733,-0.794834,0.891752,-0.877688,-0.423794
3,3.0,-0.605371,0.194186,-0.630139,-0.488815,-0.687368,0.636916,-0.499165,-0.395865,-1.825568,...,-0.990720,-1.043517,-0.942208,0.237888,-0.995945,-1.064282,-0.791013,1.382038,-0.833415,-0.250052
4,3.0,-2.499238,0.767887,0.340726,0.239070,-1.854037,0.045170,-0.793445,-0.798635,-2.613906,...,-0.348968,-0.310731,-0.970406,0.237888,0.098524,0.090998,-0.856946,-0.137492,-0.587931,-0.527438
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4494,9.0,0.667829,0.749452,0.787007,0.752567,0.384656,0.235838,0.366305,0.367353,-0.137052,...,2.176462,1.387950,-0.942391,0.237888,-0.705172,-0.880511,-0.037587,1.965490,3.914811,3.269116
4495,9.0,0.224772,0.437478,0.503417,0.522977,0.176689,0.336016,0.469051,0.421466,-0.681919,...,2.176462,1.599736,-0.965626,0.237888,-0.545086,-0.680045,-0.151320,1.655889,1.728429,1.563522
4496,9.0,0.597264,0.837518,0.946019,0.870837,0.786017,0.839759,0.616434,0.671005,-0.228300,...,-0.800488,-0.576377,-0.921297,0.237888,-0.972120,-0.847692,-0.162988,1.496491,1.760107,1.543340
4497,9.0,0.665901,1.043069,1.013861,0.938695,0.574505,0.626804,0.540841,0.586795,-0.150004,...,-1.316099,-1.184379,-0.951048,0.237888,-1.156433,-1.090040,-0.062875,1.685274,2.216738,1.805084


In [5]:

f.save_dataframe(scaled_df, save_name="train_scaled")

## Test datasets

After getting the training set, multiple test sets are obtained from the other folds.
Each one of them is saved in scaled and unscaled form, to test scaling improvement.

In [6]:
for fold in [5, 7, 8, 9, 10]:
    print(f"Processing fold {fold}")
    
    f = Features(save_path="../data/processed/initial",
                 save_name=f"test_{fold}_unscaled",
                 folds=[fold])
    
    df = f.get_dataframe()
    f.save_dataframe(df)
    
    scaled = f.apply_scaling(df, "../models/scalers/scaler_training.pkl")
    f.save_dataframe(scaled, save_name=f"test_{fold}_scaled")

Processing fold 5
Processing fold 7
Processing fold 8
Processing fold 9
Processing fold 10
