#### Why do we need to preprocess the audio files?

Preprocessing steps are generally necessary before feature extraction to ensure consistency, remove noise, and improve the quality of the features extracted.

1. **Normalize Audio**
    - **Why?**
      - Normalization scales the amplitudes of the audio signal to a consistent range, which can:
         - Avoid biasing features toward audio files with higher or lower volume.
         - Ensure numerical stability during feature extraction (especially for RMS energy).
    - **How?**
      - Normalizing each file to the range [-1,1] or to a consistent amplitude helps eliminate volume differences between clips.

2. **Resample Audio**
    - **Why?**
      - Resampling to a common sample rate (e.g., 22.05 kHz) ensures:
         - Consistency in time and frequency features across the audio files.
         - Comparability of extracted features, since the sample rate affects the time and frequency resolution.
         - Reduced computational cost for higher sample rates (e.g., 44.1 kHz).
    - **How?**
      - If the dataset includes audio clips at different sample rates, resampling ensures uniformity.
      - We don't need resampling because the sample rate for all audio files in our dataset is 22.05 kHz.

3. **Trim Silence**
    - **Why?**
      - Trimming silence removes irrelevant portions of audio, which can:
         - Prevent silence from skewing features like zero-crossing rate, RMS energy, and MFCCs.
         - Focus the analysis on the actual audio content (e.g., music or speech).
    - **How?**
      - Use a threshold-based method to detect and remove leading and trailing silence.

In [8]:
import sys
import os

repository_root_directory = os.path.dirname(os.getcwd())
rrd = "repository_root_directory:\t"
print(rrd, repository_root_directory)

if repository_root_directory not in sys.path:
    sys.path.append(repository_root_directory)
    print(rrd, "added to path")
else:  
    print(rrd, "already in path")

from data_preprocessor import DataPreprocessor
from data_preprocessor_parallel_proc import DataPreprocessorParallelProc
from utils import get_directory_size, compare_directories

repository_root_directory:	 /teamspace/studios/this_studio/csc_461_fp
repository_root_directory:	 already in path


In [9]:
dataset_path = os.path.join(repository_root_directory, "_01_data/genres")  
output_path = os.path.join(repository_root_directory, "_02_data_preprocessed")
sample_rate = 22050                     # default sample rate of the dataset
preprocessor = DataPreprocessor()
preprocessor.process_dataset(dataset_path, output_path, sample_rate)


File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00000.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00001.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00002.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00003.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00004.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00005.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00006.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00007.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed/blues/blues.00008.wav
File already exists: /teamspace/studi

In [10]:
output_path_parallel_proc = os.path.join(repository_root_directory, "_02_data_preprocessed_parallel_proc")
preprocessorParallelProc = DataPreprocessorParallelProc()
preprocessorParallelProc.process_dataset(dataset_path, output_path_parallel_proc, sample_rate)

File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed_parallel_proc/blues/blues.00000.wav
File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed_parallel_proc/blues/blues.00002.wavFile already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed_parallel_proc/blues/blues.00001.wavFile already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed_parallel_proc/blues/blues.00003.wav


File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed_parallel_proc/blues/blues.00007.wavFile already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed_parallel_proc/blues/blues.00006.wav

File already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed_parallel_proc/blues/blues.00004.wavFile already exists: /teamspace/studios/this_studio/csc_461_fp/_02_data_preprocessed_parallel_proc/blues/blues.00005.wav

File already exists: /teamspace/

In [11]:
print("Original Dataset Size:    \t\t", get_directory_size(dataset_path))
print("Preprocessed Dataset Size:\t\t", get_directory_size(output_path))
print(compare_directories(output_path, output_path_parallel_proc))

Original Dataset Size:    		 1324105955
Preprocessed Dataset Size:		 1323967400
True
