# Introduction

This Jupyter Notebook is designed to preprocess and prepare data for machine learning tasks, specifically for classifying songs as R&B or not. The workflow involves several steps, including data loading, feature extraction, and merging datasets. Below is an overview of the key steps:

1. **Data Loading**:
    - The dataset is loaded from a `.tsv` file located at `../../data/acousticbrainz-mediaeval_labels...`.
    - The dataset contains metadata about songs, including genres and identifiers.

2. **Label Creation**:
    - A new column, `is_rnb`, is created to label songs as R&B (1) or not (0) based on genre information.

3. **Feature Extraction**:
    - Features are extracted from JSON files located in the folder `../../data/acousticbrainz-mediaeval-train`.
    - These features include timbre, tonal, rhythm, and spectral properties of the songs.

4. **Data Merging**:
    - The extracted features are merged with the labeled dataset using the `recordingmbid` column as the key.

5. **Output**:
    - The final merged dataset is saved as a CSV file (`tmp/rnb_features_labeled.csv`) for further analysis or modeling.

This notebook provides a structured approach to preprocess raw data into a format suitable for machine learning, ensuring that the features and labels are aligned and ready for training.

In [2]:
import pandas as pd
import numpy as np
import os
import json

### Reading Data from File

The dataset is loaded from a `.tsv` file located at `../../data/acousticbrainz-mediaeval_labels...`. This file contains metadata about songs, including genres and identifiers. The data is read into a pandas DataFrame for further processing.

> takes about 30s


In [3]:
datapath = '../data/acousticbrainz-mediaeval_labels_part_a'

data_a = pd.read_csv(datapath + 'a', delimiter='\t')
data_b = pd.read_csv(datapath + 'b', delimiter='\t')

data = pd.concat([data_a, data_b], ignore_index=True)

# uncomment below if you wnat to enter a differnet file for data
# datapath = '../../data/[filename]'
# data = pd.read_csv(datapath, delimiter='\t')

# labels the songs that are rnb as is_rnb
data['is_rnb'] = data.filter(like='genre').apply(lambda x: x.astype(str).str.contains(r'R&B|rnb|r&b_soul|r\'n\'b', case=False, na=False)).any(axis=1).astype(int)

data.keys()

  data_a = pd.read_csv(datapath + 'a', delimiter='\t')
  data_b = pd.read_csv(datapath + 'b', delimiter='\t')


Index(['recordingmbid', 'releasegroupmbid', 'genre1', 'genre2', 'genre3',
       'genre4', 'genre5', 'genre6', 'genre7', 'genre8', 'genre9', 'genre10',
       'genre11', 'genre12', 'genre13', 'genre14', 'genre15', 'genre16',
       'genre17', 'genre18', 'genre19', 'genre20', 'genre21', 'genre22',
       'genre23', 'genre24', 'genre25', 'genre26', 'genre27', 'genre28',
       'genre29', 'genre30', 'is_rnb'],
      dtype='object')

### Dropping Unnecessary Columns

To simplify the dataset and focus on relevant features, unnecessary columns such as `releasegroupmbid`, `genre1`, `genre2`, ..., `genre30` are dropped from the `data` DataFrame. This reduces dimensionality and ensures only essential information is retained for further processing.


In [4]:
# only need the lables for if it is rnb and song identifier
data_labeled = data[['recordingmbid', 'is_rnb']]
print(data_labeled['is_rnb'].value_counts())

is_rnb
0    896896
1      8048
Name: count, dtype: int64


### Helper Functions Description

The following helper functions are used to process and extract features from JSON files containing song data:

1. **`extract_features_from_json(data)`**:
    - **Purpose**: Extracts various audio features from a JSON object.
    - **Details**:
        - Extracts timbre features such as MFCC and GFCC means.
        - Includes tonal features like chords change rate, key scale, pitch salience, and dissonance.
        - Captures rhythm features like BPM and onset rate.
        - Extracts spectral features such as centroid, complexity, rolloff, flux, zero-crossing rate, and spectral contrast coefficients.
        - Includes dynamics, rhythm extensions, energy band shapes, harmonic structure, and tonal energy balance.
    - **Error Handling**: Prints a message if a key is missing in the JSON data.

2. **`build_feature_labels(data_sample)`**:
    - **Purpose**: Generates a list of feature labels based on the structure of the JSON data.
    - **Details**:
        - Creates labels for MFCC and GFCC coefficients.
        - Adds labels for other extracted features such as chords change rate, key scale, pitch salience, spectral features, and dynamics.
        - Ensures the labels align with the extracted features for consistency in the DataFrame.

3. **`process_dataset(root_folder)`**:
    - **Purpose**: Processes all JSON files in a given folder to extract features and compile them into a DataFrame.
    - **Details**:
        - Iterates through all JSON files in the specified folder.
        - Extracts features using `extract_features_from_json`.
        - Initializes feature labels using `build_feature_labels` for the first valid JSON file.
        - Compiles all extracted features into a pandas DataFrame with appropriate labels.
        - Adds a `recordingmbid` column to associate features with song identifiers.
    - **Error Handling**: Prints a message if a file fails to process due to an exception.



In [5]:
# Helper functions to pull features from the JSON files

def extract_features_from_json(data):
    try:
        features = []

        # Timbre
        features += data['lowlevel']['mfcc']['mean']
        features += data['lowlevel']['gfcc']['mean']
        features.append(data['lowlevel']['hfc']['mean'])

        # Tonal - Harmony
        features.append(data['tonal']['chords_changes_rate'])        

        # Tonal - Scale (major=0, minor=1)
        scale = data['tonal'].get('key_scale', 'major')
        features.append(1 if scale == 'minor' else 0)

        # Tonal - Pitch salience & dissonance
        features.append(data['lowlevel']['pitch_salience']['mean'])
        features.append(data['lowlevel']['dissonance']['mean'])

        # Rhythm
        features.append(data['rhythm']['bpm'])
        features.append(data['rhythm']['onset_rate'])

        # Spectrum
        features.append(data['lowlevel']['spectral_centroid']['mean'])
        features.append(data['lowlevel']['spectral_complexity']['mean'])
        features.append(data['lowlevel']['spectral_rolloff']['mean'])
        features.append(data['lowlevel']['spectral_flux']['mean'])
        features.append(data['lowlevel']['zerocrossingrate']['mean'])

        # Spectral contrast (6D, not contrast_coeffs)
        features += data['lowlevel']['spectral_contrast_coeffs']['mean']

        # Dynamics
        features.append(data['lowlevel']['average_loudness'])
        features.append(data['lowlevel']['dynamic_complexity'])

        # Rhythm extension
        features.append(data['rhythm']['beats_loudness']['mean'])

        # Energy band shape
        features.append(data['lowlevel']['spectral_energyband_low']['mean'])
        features.append(data['lowlevel']['spectral_energyband_high']['mean'])

        # Harmonic structure
        features.append(data['tonal']['hpcp_entropy']['mean'])
        features.append(data['tonal']['key_strength'])

        # Tonal energy balance
        features.append(data['lowlevel']['spectral_entropy']['mean'])
        features.append(data['lowlevel']['spectral_strongpeak']['mean'])

        return features
    except KeyError as e:
        print(f"Missing key: {e}")
        return None
    
def build_feature_labels(data_sample):
    labels = []

    labels += [f"mfcc_{i}" for i in range(len(data_sample['lowlevel']['mfcc']['mean']))]
    labels += [f"gfcc_{i}" for i in range(len(data_sample['lowlevel']['gfcc']['mean']))]
    labels += ["hfc"]
    labels += ["chords_changes_rate"]
    
    labels += ["key_scale"]
    labels += ["pitch_salience"]
    labels += ["dissonance"]
    labels += ["bpm", "onset_rate"]
    labels += ["spectral_centroid", "spectral_complexity", "spectral_rolloff", "spectral_flux", "zerocrossingrate"]
    labels += [f"spectral_contrast_{i}" for i in range(len(data_sample['lowlevel']['spectral_contrast_coeffs']['mean']))]
    labels += ["average_loudness", "dynamic_complexity"]

    labels += ["beats_loudness"]
    labels += ["spectral_energyband_low", "spectral_energyband_high"]
    labels += ["hpcp_entropy", "key_strength"]
    labels += ["spectral_entropy", "spectral_strongpeak"]

    return labels

def process_dataset(root_folder):
    all_features = []
    file_ids = []
    labels_initialized = False
    feature_labels = []

    for subdir, _, files in os.walk(root_folder):
        for file in files:
            if file.endswith(".json"):
                file_path = os.path.join(subdir, file)
                try:
                    with open(file_path, "r") as f:
                        data = json.load(f)

                    features = extract_features_from_json(data)
                    if features is None:
                        continue

                    if not labels_initialized:
                        feature_labels = build_feature_labels(data)
                        labels_initialized = True

                    all_features.append(features)
                    file_ids.append(file.replace('.json', ''))

                except Exception as e:
                    print(f"Failed on {file}: {e}")

    df = pd.DataFrame(all_features, columns=feature_labels)
    df['recordingmbid'] = file_ids
    return df

### Feature Extraction Using Helper Functions

The feature extraction process involves using the helper functions defined earlier to extract audio features from JSON files. These features include timbre, tonal, rhythm, and spectral properties of the songs. The steps are as follows:
The resulting DataFrame (`data_features`) contains the extracted features for all songs, which are then merged with the labeled dataset (`data_labeled`) to create the final dataset (`merged_df`) for further analysis or modeling.

> takes about 4 mins at most


In [None]:
# Extract JSON features 
folder_path = '../data/acousticbrainz-mediaeval-train'
data_features = process_dataset(folder_path)
data_features.head()

### Creating a Comprehensive DataFrame with Features and Labels

To create a comprehensive DataFrame that combines all extracted features and their corresponding labels, we merge the `data_features` DataFrame (containing the extracted features) with the `data_labeled` DataFrame (containing the labels). This ensures that each song's features are aligned with its label (`is_rnb`).

The resulting DataFrame, `merged_df`, contains both the features and the labels, making it ready for further analysis or machine learning tasks.

> Finally, the merged DataFrame is saved to a CSV file (`tmp/rnb_features_labeled.csv`) for later use in modeling.

In [None]:
merged_df = pd.merge(data_features, data_labeled, on="recordingmbid", how="inner")

# Check results
print(merged_df.head())
print("Shape:", merged_df.shape)
print("Label value counts:\n", merged_df['is_rnb'].value_counts())

merged_df.to_csv("tmp/rnb_features_labeled.csv", index=False)

Empty DataFrame
Columns: [recordingmbid, is_rnb]
Index: []
Shape: (0, 2)
Label value counts:
 Series([], Name: count, dtype: int64)


### Using the Processed Dataset for Modeling

To use the processed dataset for your own machine learning model, follow these steps:

1. **Read the CSV File**:
    - The processed dataset has been saved as `tmp/rnb_features_labeled.csv`.
    - Use `pandas` to read the CSV file into a DataFrame.

2. **Separate Features and Labels**:
    - The dataset contains both features and labels (`is_rnb`).
    - Separate the features (`X`) and labels (`y`) for training your model.

3. **Use the DataFrame for Modeling**:
    - The `X` DataFrame contains the features, and `y` contains the labels.
    - Use these to train your machine learning model.

    Example:
    - Split the data into training and testing sets.
    - Train a model.
    - Make predictions and evaluate the model's accuracy.

By following these steps, you can easily load the processed dataset and use it to train and evaluate your own machine learning models.


In [None]:
# Separate features and labels
df = pd.read_csv("tmp/rnb_features_labeled.csv")

# Separate features (X) and labels (y) for the dataset
X = df.drop(columns=['recordingmbid', 'is_rnb'])
y = df['is_rnb']