# GTZAN Stems
## Overview
This notebook presents an examle usage of **gtzan-stems** dataset. As well as that, it compares the performance of the 3 CNN models:
1. **Vanilla** model trained using extracted audio features from original song files
2. **Enhanced** model trained using extracted audio features from stems that make up a song
3. **Merged** model trained using extracted audio features from original song files _and_ stems 

This notebook also introduces a compact class for extracting any audio features from labeled songs or multiple stems that make up a song. It is very easy to extend the class by adding custom extractions. This notebook only uses **MFCC** vector extraction for multiple segemnts of an audio file. Extracted vectors are interpreted as images by the CNN models and a stem represents a channel. 

## What to take from this notebook
It is adviced to simply look at the results of the last cell without getting into too much detail how everything works. The main purpose of this notebook is to show how stems can contribute to model's performence. From the experiments I've done with different seeds and input structures, there is no big difference between training a model using original songs and training a model using stems, sometimes one approach beats another. However, it is almost always the case that the merged model has a higher performance than the first 2 models.

More training data allows to determine more stable results - it can be observed that the model trained on stems tends to score a higher accuracy by a few percents than the one trained using original songs. Also note that some stems may be empty - for example, if lyrics are not present in the song, the audio file is almost silent and it may not be useful for the model. However it can still learn the patterns about genres from other stems.

In [None]:
import os          # traversing directories
import json        # saving data in json format
import math        # allows to perform ceiling
import librosa     # extracting audio features
import numpy as np # shaping multidimensional data
from tqdm.notebook import tqdm # progress bars


class AudioFeatureExtractor():
    """Class to preprocess raw audio files.
    
    This class allows to extract custom audio features (e.g., amplitude envelope, bandwidth,
    Mel-spectrograms) from labeled audio files. Each file can further be segmented for data
    augmentation. This class supports reading directories of stems instead of audio files too.
    
    Attributes:
        duration (int): The duration of each audio file to be processed (in seconds)
        sample_rate (int): The sample rate used (`22050` by default)
        num_segments (int): The number of segments to divide each audio file (`5` by default)
        samples_per_label (int): The number of samples to process pre label (`-1` by default)
        
        samples_per_track (int): The total number of samples per track (calculated)
        samples_per_segment (int): The total number of samples per segment (calculated)
        
        _DEFAULT_CONSTANTS (dict): The default parameters the extraction functions will use
            All the possibly used conctants are put in the dictionary for any extraction:
            * `n_mfcc`: The number of Mel-frequency cepstral coefficients to extract at each frame
            * `n_fft`: The frame size (number of Fast Fourier Transformations)
            * `hop_length`: The step size when shifting to a new frame
    """
    
    def __init__(self, duration, sample_rate=22050, num_segments=5, samples_per_label=-1):
        """Initializes the Audio Feature Extractor.
        
        Note:
            Keeping `samples_per_label` initialized to `-1` will result in processing every
            available track in the dataset path.
        """
        # Initialize the passed parameters
        self.duration = duration
        self.sample_rate = sample_rate
        self.num_segments = num_segments
        self.samples_per_label = samples_per_label
        
        # Calculate dependent attributes
        self.samples_per_track = duration * sample_rate
        self.samples_per_segment = self.samples_per_track // num_segments
        
        # Define default constants
        self._DEFAULT_CONSTANTS = {
            "n_mfcc": 13,
            "n_fft": 2048,
            "hop_length": 512,
        }
    
    
    def _validate_requests(self, requests):
        """Validates the parameters for feature extraction request.
        
        This method checks if the requested feature extraction is available, parses the parameters
        (or uses the defaut ones) that should be applied to each extraction and returns a map with
        keys as feature extraction functions and values as parameters for them.
        
        Args:
            requests (list): The list of strings/tuples corresponding to the desired extraction
        
        Returns:
            valid_requests (dict): The dictionary which maps extraction function to its parameters
        
        Raises:
            AttributeError: If no extraction is specified or if it is not supported
        """
        # Map available extractions to functions 
        mapping = {
            "mfcc": self._extract_mfcc,
        }
        
        # Define requests
        valid_requests = {}
        requests = [(req, {}) if isinstance(req, str) else req for req in requests]
        
        if len(requests) == 0:
            # If no request is specified, raise an error
            raise AttributeError("At least 1 feature extraction request must be specified")
            
        # Loop through every request
        for request_name, params in requests:
            if request_name not in mapping.keys():
                # If request name is not valid, raise an error
                raise AttributeError(f"'{request_name}' is not a valid feature extraction request")
            
            # Define items for `valid_requests`
            func = mapping[request_name]
            args = dict(self._DEFAULT_CONSTANTS)
            args.update(params)
            
            # Add valid request to dictionary
            valid_requests[func] = args
        
        return valid_requests
    
    
    def _extract_mfcc(self, signal, params):
        """Extracts MFCC vectors for the given signal.

        This method uses ``librosa`` library to extract Mel-frequency cepstral coefficients for the
        given waveform. It also ensures that the number of vectors is of the expected length, i.e.,
        it must be equal to `samples_per_segment` divided by the `hop_length`.
        
        Args:
            signal (ndarray): The waveform of the signal to preprocess
            params (dict): Parameters to use for MFCC feature extraction
            
        Returns:
            mfcc (list): extracted MFCC vectors
        """
        # Extract MFCCs using librosa
        mfcc = librosa.feature.mfcc(signal,
                                    sr=self.sample_rate,
                                    n_mfcc=params["n_mfcc"],
                                    n_fft=params["n_fft"],
                                    hop_length=params["hop_length"])       
        
        # The expected number of generated mfcc vectors (required to ensure consistency)
        expected_n_vectors = math.ceil(self.samples_per_segment / params["hop_length"])
        
        if mfcc.shape[1] != expected_n_vectors:
            # Repeat the last col if it doesn't exist (may occur occasionally)
            mfcc = np.resize(mfcc, (params["n_mfcc"], expected_n_vectors))
        
        return mfcc.T.tolist() # more comforatable representation and parsable by JSON
    
    
    def _segmented_extraction(self, filepath, extractions):
        """Performs feature extraction for every track segment.
        
        Args:
            filepath (str): The path to audio file
            extractions (dict): The dictionary which maps extraction function to its parameters
            
        Returns:
            feature_segments (list): The list of preprocessed track segments with features
        """
        # Load the sound file and initialize segment list
        signal, sr = librosa.load(filepath, sr=self.sample_rate)
        feature_segments = []
        
        # Loop through the specified number of segments
        for segment in range(self.num_segments):
            # Take a segment from signal to preprocess it
            start_sample = self.samples_per_segment * segment
            end_sample = start_sample + self.samples_per_segment
            signal_segment = signal[start_sample:end_sample]
            features_group = []
            
            # Loop through every extraction request
            for func, params in extractions.items():
                # Generate features for the given segment and append to the group
                features_group.append(func(signal_segment, params))
            
            # Append the group of features to the segment
            feature_segments.append([features_group])
            
        return feature_segments
    
    
    def _walk_extract(self, dataset_path, extractions):
        """Traverses directories and performs segmented extraction for each audio file.
        
        This method walks through labeled directories and for each audio file (or a folder of stems)
        it performs segmented feature extraction. The number of segments and the extractions to be
        performed are known in prior.
        
        Note:
            The file structure is important. For example, if "path/to/labels" is passed as an argument
            for `dataset_path`, then `labels` directory should contain labeled subdirectories for each
            song (or a folder of stems).
        
        Args:
            dataset_path (str): The path to a dataset directory
            extractions (dict): The dictionary which maps extraction function to its parameters
            
        Returns:
            data (dict): The dictionary containg labels and inputs. The entires are as follows:
                * `semantic_label`: The name for each label
                * `targets`: Numeric representation of semantic labels
                * `features`: The list of extracted features of the shape (L, N, M, S, F, *F.shape)
                    * `L`: The number of labels
                    * `N`: The number of tracks
                    * `M`: The number of segments
                    * `S`: The number of stems
                    * `F`: The number of requested feature extractions
                    * `F.shape`: The shape of the feature tensor (may not exist if it's a single value)
        """
        # Define the data dictionary containg labels and inputs
        data = {"semantic_labels": [], "targets": [], "features": []}
        
        # Loop through every labeled directory
        for i, f in enumerate(os.scandir(dataset_path)):
            print(f"Processing target {i} ({f.name})")
            
            # Save semantic and target labels
            data["semantic_labels"].append(f.name)
            data["targets"].append(i)
            features = []
            
            # Sort the file paths in case the songs are sequence-dependent data per label
            samples = [sample for sample in os.scandir(f.path)]
            samples.sort(key=lambda x: x.name)
            
            # Loop through the generated paths (up till `samples_per_label`)
            for sample in tqdm(samples[:self.samples_per_label]):
                if sample.is_file():
                    # If it is a file, generate segmented features for a single stem
                    feature_segments = self._segmented_extraction(sample.path, extractions)
                
                if sample.is_dir():
                    # If it is a folder with stems, generate segmented features for each stem
                    feature_segments = [self._segmented_extraction(os.path.join(sample.path, stem),
                                        extractions) for stem in sorted(os.listdir(sample.path))]
                    feature_segments = np.concatenate(feature_segments, axis=1).tolist()
                
                # Append segment-based features
                features.append(feature_segments)
            
            # Append song-based features
            data["features"].append(features)
        
        return data
    
    
    def extract_features(self, dataset_path, *args):
        """Extracts requested features from a dataset.
        
        This method validates the requests for feature extractions and calls ``self.walk_extract()``
        method to traverse through directories in `dataset_path` and perform segmented extractions.
        
        Args:
            dataset_path (str): The path to the root directory of the labeled dataset files
            *args: The requested feature extractions.
                Each argument is a tuple taking 2 values - the name of the feature extraction and the
                parameters to use for that extraction (e.g., "hop_length", "n_fft") or 1 value - just
                the name. Extraction parameters that are not provided are set to default values.
        """
        extractions = self._validate_requests(args)
        return self._walk_extract(dataset_path, extractions)
    
    
    def restructure(self, inputs, targets, meanings, label_level=2, feature_level=None, squeeze=True, tolist=False):
        """Restructures the axes of the data.

        Consider the feature leveling:
            - Level 0: genres
            - Level 1: tracks
            - Level 2: segments
            - level 3: stems
            - level 4: features
            
        This method performs 2 transformations on the inputs and labels based on the feature leveing:
        
            1. **Labeling** - by default, only genres are labeled; this transformation allows to
               increase the labeling level, meaning that labels will be present for all the dimensions
               up to that level. The feature dimensions belonging to that level will in turn be
               concatinated and the number of samples will increase. For instance, if _label_ level is
               `1`, every track will have a label and there will be as many samples as there are tracks.
               
            2. **Refeaturing** - by default, features are positioned at the last level; this transformation
               allows to move the feature axis to a lower level (not lower than _label_ level), meaning
               that a label at a certain level will correspond to all the features at that level (unless
               _label_ level is set to `4`). For instance, if _feature_ level is `3` < _label_ level, then
               each segemnt will have multiple sets of stems, each represented by a certain feature.
        
        Note:
            * If `feature_level` is `None`, it is automatically set to `label_level`, meaning that for every
              feature, the same set of labels will apply.
            * The lowest _feature_ level is actually the _label_ level. E.g., if we have labels for segments,
              the lowest level for features is F extractions each covering a set of segments.
            * If there is only `1` feature, the samples are concatinated to not contain an empty dimension if
              `squeeze` is set to true. This only applies to feature dimension.
        
        Args:
            inputs (ndarray): The extracted features for each track
            targets (ndarray): Numeric representation of semantic labels
            meanings (ndarray): The name for each label
            label_level (int): The level at which each input dimension should have a label (`2` by default)
            feature_level (int): The level at which an input dimension is split to features (`None` by default)
            squeeze (bool): Whether to squeeze the feature dimension if it is of length one (`True` by default)
            tolist (bool): Whether to return the outputs as python list objects (`False` by default)
            
        Returns:
            tuple: A tuple of 3 entries as described in ``self.walk_extract``:
                * `inputs`: The restructured extracted features for each track
                * `targets`: The restructured numeric representation of semantic labels
                * `meanings`: The restructured list of names for each label
        """
        if feature_level is None:
            # Default feature level
            feature_level = label_level
        
        # Feature level cannot be lower than label level
        assert feature_level >= label_level 
        
        # Apply labeling transformation
        targets = np.resize(targets, inputs.shape[label_level::-1]).T.reshape(-1)
        meanings = np.resize(meanings, inputs.shape[label_level::-1]).T.reshape(-1)
        inputs = inputs.reshape(-1, *inputs.shape[label_level+1:])
        
        # Apply "refeaturing" transformation
        inputs = np.moveaxis(inputs, 4 - label_level, feature_level - label_level)
        
        if inputs.shape[feature_level - label_level] == 1 and squeeze:
            # If the feature dimension is unnecessary, remove it
            inputs = inputs.squeeze(axis=feature_level-label_level)
        
        if tolist:
            # Convert to list object
            inputs = inputs.tolist()
            targets = targets.tolist()
            meanings = meanings.tolist()
        
        return inputs, targets, meanings
        
        
    def save_data(self, data, json_path):
        """Saves the data as a JSON file.
        
        Note:
            The `data` must be interpretable by JSON parser, i.e., a python list object. It should
            contain 3 entries as described in ``sef.walk_extract``, of any shape.
        
        Args:
            data (dict): The dictionary containg labels and inputs
            json_path (str): The path to the JSON file. Must end with a file name, e.g., "data.json"
        """
        # Create output dir if it doesn't exist
        if os.path.exists(os.path.dirname(json_path)):
            os.makedirs(os.path.dirname(json_path))
        
        # Save the data to a json file
        with open(json_path, 'w') as fp:
            json.dump(data, fp, indent=4)
            
        print("Data successfully saved as a JSON file.")
            
    
    def load_data(self, json_path):
        """Loads the data from a JSON file.
        
        Note:
            The JSON file must contain 3 entries described in ``sef.walk_extract``, of the exact shape.
        
        Returns:
            tuple: A tuple of 3 entries as described in ``self.walk_extract``:
                * `inputs`: The extracted features for each track
                * `targets`: Numeric representation of semantic labels
                * `meanings`: The name for each label
        """
        # Load the file into memory
        with open(json_path, 'r') as fp:
            data = json.load(fp)
        
        # Each entry should be a numpy array
        inputs = np.array(data["features"])
        targets = np.array(data["targets"])
        meanings = np.array(data["semantic_labels"])
        
        return inputs, targets, meanings

    
    def extract_save(self, dataset_path, json_path, *args):
        """Generic method to extract and save the features.
        
        Args:
            dataset_path (str): The path to the root directory of the labeled dataset files
            json_path (str): The path to the JSON file. Must end with a file name, e.g., "data.json"
            *args: The requested feature extractions.
                Each argument is a tuple taking 2 values - the name of the feature extraction and the
                parameters to use for that extraction (e.g., "hop_length", "n_fft") or 1 value - just
                the name. Extraction parameters that are not provided are set to default values.
        """
        extracted_features = self.extract_features(dataset_path, *args)
        self.save_data(extracted_features, json_path)

        
    def load_restructure(self, json_path, label_level=2, feature_level=None, squeeze=True, tolist=False):
        """Generic method to load and restructure the features.
            
        Args:
            json_path (str): The path to the JSON file. Must end with a file name, e.g., "data.json"
            label_level (int): The level at which each input dimension should have a label (`2` by default)
            feature_level (int): The level at which an input dimension is split to features (`None` by default)
            squeeze (bool): Whether to squeeze the feature dimension if it is of length one (`True` by default)
            tolist (bool): Whether to return the outputs as python list objects (`False` by default)
                
        Returns:
            tuple: A tuple of 3 entries as described in ``self.load_data``:
        """
        extracted_features = self.load_data(json_path)
        return self.restructure(*extracted_features, label_level, feature_level, squeeze, tolist)
            

In [None]:
# File path constants
SONGS_DATA_PATH = "../input/gtzan-stems/Data/genres_original"
STEMS_DATA_PATH = "../input/gtzan-stems/Data/genres_stems"
SONGS_JSON_PATH = "data_songs.json"
STEMS_JSON_PATH = "data_stems.json"

# Our extractor object
extractor = AudioFeatureExtractor(30, num_segments=10, samples_per_label=20)

# Feature extraction for songs ~40 sec
print("Processing original songs\n")
extractor.extract_save(SONGS_DATA_PATH, SONGS_JSON_PATH, "mfcc")

# Feature extraction for stems ~20 min
print("\nProcessing song stems\n")
extractor.extract_save(STEMS_DATA_PATH, STEMS_JSON_PATH, "mfcc")

# Targets and labels will be the same for each segment
inputs_songs, targets, meanings = extractor.load_restructure(SONGS_JSON_PATH)
inputs_stems, _, _ = extractor.load_restructure(STEMS_JSON_PATH)

# Confirm the shapes are what we expect them to be
print("\nSong features shape:", inputs_songs.shape)
print("Stem features shape:", inputs_stems.shape)

In [None]:
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
import tensorflow as tf
import random

SEED = 54321 # gobal seed

def reset_random_seeds():
    """Restes the random seeds for reproducable results."""
    os.environ['PYTHONHASHSEED'] = str(SEED)
    tf.random.set_seed(SEED)
    np.random.seed(SEED)
    random.seed(SEED)

    
def split_data(X, y, test_size=.2, validation_size=.2):
    """Splits the data into train, validation and test sets.
    
    Args:
        X (ndarray): The input data
        y (ndarray): The target labels
        test_size (float): The proportion of the test set
        validation_size (float): The proportion of the validation set
    
    Returns:
        tuple: A tuple of train, validation and test inputs and labels
    """
    # Channel should be moved to the last dimension
    X = np.moveaxis(X, 1, 3)
    
    # Perform the splits
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=SEED)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=validation_size, random_state=SEED)
    
    return X_train, X_val, X_test, y_train, y_val, y_test


def build_model(input_shape):
    """Creates a CNN model.
    
    Args:
        input_shape (tuple): The shape of the input dimensions of teh form (H, W, C)
    
    Returns:
        tuple: the created model and an optimizer (Adam) for the model
    """
    # Create a model
    model = keras.Sequential()
    
    # Add first conv layer
    model.add(keras.layers.Conv2D(32, 3, activation="relu", input_shape=input_shape))
    model.add(keras.layers.MaxPool2D(3, 2, "same"))
    model.add(keras.layers.BatchNormalization())
    
    # Add second conv layer
    model.add(keras.layers.Conv2D(32, 3, activation="relu"))
    model.add(keras.layers.MaxPool2D(3, 2, "same"))
    model.add(keras.layers.BatchNormalization())
    
    # Add third conv layer
    model.add(keras.layers.Conv2D(32, 2, activation="relu"))
    model.add(keras.layers.MaxPool2D(2, 2, "same"))
    model.add(keras.layers.BatchNormalization())
    
    # Flatten the output and add a dense layer
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(64, activation="relu"))
    model.add(keras.layers.Dropout(.3))
    
    # Output layer
    model.add(keras.layers.Dense(10, activation="softmax"))
    
    # Create an optimizer for the model
    optimizer = keras.optimizers.Adam(learning_rate=2e-3)
    
    return model, optimizer

In [None]:
# Assure deterministic results
reset_random_seeds()

# Create a merged representation of inputs
inputs_merge = np.concatenate((inputs_songs, inputs_stems), axis=1)

# Generate train, validation and test data
X_train_songs, X_val_songs, X_test_songs, y_train_songs, y_val_songs, y_test_songs = split_data(inputs_songs, targets)
X_train_stems, X_val_stems, X_test_stems, y_train_stems, y_val_stems, y_test_stems = split_data(inputs_stems, targets)
X_train_merge, X_val_merge, X_test_merge, y_train_merge, y_val_merge, y_test_merge = split_data(inputs_merge, targets)

# Build the models
model_songs, opt_songs = build_model(X_train_songs.shape[1:])
model_stems, opt_stems = build_model(X_train_stems.shape[1:])
model_merge, opt_merge = build_model(X_train_merge.shape[1:])

# Compile the models
model_songs.compile(optimizer=opt_songs, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model_stems.compile(optimizer=opt_stems, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model_merge.compile(optimizer=opt_merge, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the models
print("Training the model using original tracks as inputs...")
model_songs.fit(X_train_songs, y_train_songs, validation_data=(X_val_songs, y_val_songs), batch_size=32, epochs=20)
print("\nTraining the model using stems as inputs...")
model_stems.fit(X_train_stems, y_train_stems, validation_data=(X_val_stems, y_val_stems), batch_size=32, epochs=20)
print("\nTraining the model using merged inputs...")
model_merge.fit(X_train_merge, y_train_merge, validation_data=(X_val_merge, y_val_merge), batch_size=32, epochs=20)

# Get the accuracies
print("\nThe final accuracies for original, stems, merged respectively:")
test_error_songs, test_accuracy_songs = model_songs.evaluate(X_test_songs, y_test_songs)
test_error_stems, test_accuracy_stems = model_stems.evaluate(X_test_stems, y_test_stems)
test_error_merge, test_accuracy_merge = model_merge.evaluate(X_test_merge, y_test_merge)