# A look into the BiMMuDa Dataset: Importance and Analysis

## First we will acquire and read in the datasets

- Load the datasets by importing pandas and reading in the csv files
- We use pd.read_csv to load the data from the local directory -> we will go over the origins of the data further below

In [14]:
import pandas as pd


df_songs = pd.read_csv('bimmuda_per_song_full.csv')
df_melody = pd.read_csv('bimmuda_per_melody_full.csv')


## Now let's look at the data using the display function

This gives us a better look at the datasets we are dealing with, and provides a window into the raw data being analyzed in this project

Currently, we are just calling the first 5 rows to give a good picture without loading the entire dataset

In [15]:

print("\nSong Metadata (First 5 Rows) ")
display(df_songs.head())

print("\nMelody Analysis (First 5 Rows) ")
display(df_melody.head())



Song Metadata (First 5 Rows) 


Unnamed: 0,Title,Artist,Year,Position,Link to Audio,Tonic 1,Tonic 2,Tonic 3,Mode 1,Mode 2,Mode 3,BPM 1,BPM 2,BPM 3,Number of Parts,Number of Words,Number of Unique Words,Unique Word Ratio,Number of Syllables
0,Goodnight Irene,Gordon Jenkins & The Weavers,1950,1,https://open.spotify.com/track/3GtfeLBXe15nyEM...,F,,,Major,,,94,141.0,,2.0,134.0,58.0,0.43,207.0
1,Mona Lisa,Nat King Cole,1950,2,https://open.spotify.com/track/5dae01pKNjRQtgO...,Db,,,Major,,,65,,,2.0,145.0,55.0,0.38,192.0
2,Third Man Theme,Anton Karas,1950,3,https://open.spotify.com/track/7rRGujA12UJcRUz...,G,,,Major,,,120,152.0,,7.0,,,,
3,Sam's Song,Gary & Bing Crosby,1950,4,https://open.spotify.com/track/1Wnlagmoyo7M7In...,F,,,Minor,,,118,,,2.0,317.0,134.0,0.42,381.0
4,Simple Melody,Gary & Bing Crosby,1950,5,https://open.spotify.com/track/75lpxrV9sLZIRvz...,Bb,,,Major,,,151,,,2.0,199.0,55.0,0.28,265.0



Melody Analysis (First 5 Rows) 


Unnamed: 0,ID,Year,Position,Label,BPM,Mode,Tonic,Length,Number of Note Events,Tonality,MIC,Pitch STD,MIS,Onset Density,nPVI,RIC
0,1950_01_1,1950,1,chorus,94,Major,F,19.49,25,0.77,4.2806,2.67,2.17,1.28,71.54,2.937204
1,1950_01_2,1950,1,verse,94,Major,F,19.49,35,0.82,4.863496,3.55,2.06,1.8,43.33,2.316326
2,1950_02_1,1950,2,,65,Major,Db,30.9,47,0.79,5.462931,4.1,2.5,1.52,14.2,1.105103
3,1950_02_2,1950,2,,65,Major,Db,46.9,63,0.68,4.455394,4.13,2.5,1.34,30.54,1.283377
4,1950_03_1,1950,3,,120,Major,G,17.9,36,0.58,2.801082,1.24,0.97,2.01,82.79,2.066878


# Data Acquisition & Description

## 1. Data Source
**Dataset Name:** The Billboard Melodic Music Dataset (BiMMuDa)

**Authors:** Madeline Hamilton, Marcus Pearce (Queen Mary University of London)

**Publication:** Transactions of the International Society for Music Information Retrieval (2024)

**Links:** [Official Paper](https://transactions.ismir.net/articles/168/files/66b30cacae2f2.pdf)

**Dataset:** [BiMMuDa Dataset Download Link](https://doi.org/10.5334/tismir.168.s1)

## 2. How the Data Was Produced
This dataset was constructed using a multi-step process involving both archival research and manual musical analysis.

#### **1. The Song Metadata (bimmuda_per_song_full.csv)**
This file establishes the "corpus" of the project.
* **Selection Criteria:** The authors selected the **Top 5 Singles** from the Billboard Year-End charts for every year from 1950 to 2022.
* **Audio Sourcing:** The corresponding audio for each track was identified on Spotify to serve as the reference recording.
* **Lyrical Analysis:** Lyrics were retrieved (likely via Genius or Musixmatch) and processed to calculate the Number of Words and Unique Word Ratio (repetition) for each track.
* **Global Musical Features:** Key features like Mode (Major/Minor) and BPM (Tempo) were determined for the *entire song* to provide a high-level summary.

#### **2. The Melody Analysis (bimmuda_per_melody_full.csv)**
This file represents the "deep dive" analysis.
* **Manual Transcription:** Musicians listened to the reference recordings and manually transcribed the lead vocal melodies into MIDI format. This avoids the errors common in automated AI transcription.
* **Segmentation:** The transcribed melodies were manually sliced into structural sections (e.g., Verse, Chorus, Bridge) to allow for section-specific comparisons.
* **Feature Extraction:** Algorithms were run on these symbolic MIDI files to mathematically calculate features like:
    * Pitch STD (Standard Deviation): A measure of melodic range and complexity.
    * Onset Density: A measure of rhythmic speed (notes per second).

## 3. COLS Table (Selected features for analysis)

#### Table A: Song Metadata (bimmuda_per_song_full.csv)
| Column Name | Data Type | Level of Measurement | Description |
| :--- | :--- | :--- | :--- |
| Title | String | **Nominal** | The name of the song (identifier). |
| Artist | String | **Nominal** | The recording artist(s) (categorical). |
| Year | Integer | **Interval** | The calendar year the song charts (no "true zero" in calendar years). |
| Position | String | **Ordinal** | The rank (1-5) on the Year-End chart. It implies an order. Note that it is a string type becuase farther down in the the dataset there are values like "2a". |
| Mode 1 | String | **Nominal** | The musical scale (Major or Minor). Binary category. |
| BPM 1 | Integer | **Ratio** | Beats Per Minute. Has a true zero (silence) and ratios are meaningful. |
| Unique Word Ratio | Float | **Ratio** | Ratio of unique words to total words (0.0 to 1.0). |
| Number of Words | Float | **Ratio** | Total count of words. Has a true zero. |

#### Table B: Melody Analysis (bimmuda_per_melody_full.csv)
| Column Name | Data Type | Level of Measurement | Description |
| :--- | :--- | :--- | :--- |
| Label | String | **Nominal** | The structural section (e.g., 'verse', 'chorus'). Categorical. |
| Pitch STD | Float | **Ratio** | Standard Deviation of Pitch. 0 means no variation (monotonic). |
| Onset Density | Float | **Ratio** | Notes per second. 0 means no notes played. |
| Tonality | Float | **Ratio** | A score (0-1) indicating strength of key fit. |
| Length | Float | **Ratio** | Duration of the section in seconds. Has a true zero. |