# **Dataset Exploration, Cleaning, and Subset Construction**
This notebook performs the initial end-to-end preparation of the Spotify multimodal dataset used throughout the Spectral Neighbor Plus recommender prototype. It focuses on inspecting the raw data, identifying usable fields, and producing a clean, well-structured subset suitable for downstream lyricâ€“audio similarity modeling.

This block creates the authentication file required for accessing Kaggle datasets programmatically. It builds a kaggle.json file containing your API token, places it in the default Kaggle configuration directory (/root/.kaggle), and sets secure file permissions so the Kaggle client can read it safely. After writing the credentials, the script confirms successful setup, enabling authenticated dataset downloads in later steps.

In [None]:
import json, os

# Replace YOUR_TOKEN_HERE with your actual KGAT token
token = "KGAT_e7686f15ad94b4378eb0b4fe722d0eb7"

kaggle_data = {
    "username": "",
    "key": token
}

os.makedirs("/root/.kaggle", exist_ok=True)

with open("/root/.kaggle/kaggle.json", "w") as f:
    json.dump(kaggle_data, f)

os.chmod("/root/.kaggle/kaggle.json", 600)

print("kaggle.json created successfully!")

kaggle.json created successfully!


The cell installs the Kaggle client and configures the environment variables needed for authenticated downloads. Once the credentials are set, it fetches the 900k Spotify dataset directly into a raw data directory and unzips the archive for inspection. After extraction, the script scans the folder for any CSV, JSON, or Parquet files present in the dataset, returning a list so you can quickly identify which files are available for loading and exploration.

In [None]:
!pip install kaggle

import os, zipfile, pandas as pd

os.environ["KAGGLE_USERNAME"] = "YOUR_USERNAME"
os.environ["KAGGLE_KEY"] = "KGAT_e7686f15ad94b4378eb0b4fe722d0eb7"

!kaggle datasets download -d devdope/900k-spotify -p data/raw/spotify_900k

zip_path = "/content/data/raw/spotify_900k/900k-spotify.zip"
with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall("data/raw/spotify_900k")

# Load the biggest CSV/JSON/Parquet file and peek
files = [f for f in os.listdir("data/raw/spotify_900k") if f.endswith((".csv", ".json", ".parquet"))]
files


Dataset URL: https://www.kaggle.com/datasets/devdope/900k-spotify
License(s): Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
900k-spotify.zip: Skipping, found more recently modified local copy (use --force to force download)


['900k Definitive Spotify Dataset.json',
 'spotify_dataset.csv',
 'final_milliondataset_BERT_500K_revised.json']

The dataset is loaded into a pandas DataFrame, and a quick preview is shown to verify the contents. The column list is printed as well, giving a compact overview of all available features.

In [15]:
import pandas as pd

df = pd.read_csv("//content/data/raw/spotify_900k/spotify_dataset.csv")
df.head()
df.columns

Index(['Artist(s)', 'song', 'text', 'Length', 'emotion', 'Genre', 'Album',
       'Release Date', 'Key', 'Tempo', 'Loudness (db)', 'Time signature',
       'Explicit', 'Popularity', 'Energy', 'Danceability', 'Positiveness',
       'Speechiness', 'Liveness', 'Acousticness', 'Instrumentalness',
       'Good for Party', 'Good for Work/Study',
       'Good for Relaxation/Meditation', 'Good for Exercise',
       'Good for Running', 'Good for Yoga/Stretching', 'Good for Driving',
       'Good for Social Gatherings', 'Good for Morning Routine',
       'Similar Artist 1', 'Similar Song 1', 'Similarity Score 1',
       'Similar Artist 2', 'Similar Song 2', 'Similarity Score 2',
       'Similar Artist 3', 'Similar Song 3', 'Similarity Score 3'],
      dtype='object')

A filtered subset of the dataset is created by keeping only rows with valid lyrics and key audio features, and removing tracks with extremely short text. From the cleaned data, a 20k-track sample containing only the relevant columns is drawn for more manageable downstream processing. The resulting curated sample is then saved to the processed data directory.

In [26]:
import os

cols_needed = [
    "Artist(s)", "song", "text",
    "Genre", "Album", "emotion",
    "Energy", "Danceability", "Positiveness",
    "Acousticness", "Instrumentalness", "Speechiness",
    "Liveness", "Tempo", "Loudness (db)",
    "Popularity"
]

df_small = df.dropna(subset=["text", "Energy", "Danceability", "Positiveness"]).copy()
df_small = df_small[df_small["text"].str.len() > 50]  # filter out super-short lyrics

df_sample = df_small[cols_needed].sample(n=20000, random_state=42)

os.makedirs("data/processed", exist_ok=True)
df_sample.to_csv("data/processed/spotify_900k_sample.csv", index=False)

This cell defines the column groups used throughout the projects audio features, lyrics, and contextual metadata. It also provides a small helper function that loads the processed Spotify sample and returns both the DataFrame and the predefined column lists. This keeps later notebooks cleaner by centralizing column references in one place.

In [23]:
import pandas as pd
from pathlib import Path

AUDIO_FEATURE_COLS = [
    "energy", "danceability", "valence",
    "acousticness", "instrumentalness",
    "speechiness", "liveness", "tempo", "loudness", "popularity"
]

LYRICS_COL = "text"
CONTEXT_COLS = ["artist", "title", "genre", "album", "emotion"]

def load_spotify_sample(path: str = "data/processed/spotify_900k_sample.csv"):
    path = Path(path)
    df = pd.read_csv(path)
    return df, AUDIO_FEATURE_COLS, LYRICS_COL, CONTEXT_COLS