### Embedding Generation Pipeline

To support similarity search over the Spotify dataset, we construct a compact 14-dimensional embedding that incorporates acoustic features, lightweight metadata, and temporal information. This embedding is derived from the cleaned and normalized dataset produced during preprocessing.

We begin by selecting ten core audio features—such as danceability, energy, valence, tempo, and acousticness—and standardize them to ensure a consistent numerical scale. To better reflect musical intuition, we apply domain-inspired feature weights that emphasize rhythmic and energetic attributes while down-weighting features that are less influential for similarity (e.g., liveness, loudness).

Next, we integrate a small set of categorical attributes (key, mode, explicit flag), providing additional structure related to harmony and lyrical content without significantly expanding dimensionality. We also include a standardized release-year feature to capture temporal trends in music pro


In [1]:

import pandas as pd
import numpy as np
import json
from sklearn.preprocessing import StandardScaler

# Load cleaned dataset
spotify = pd.read_parquet("data/spotify_clean.parquet")
print("Loaded dataset:", spotify.shape)



Loaded dataset: (169776, 19)


In [2]:
# Selecting Core Audio Features for Embedding Construction
# We define the set of normalized acoustic features—including danceability, energy, valence, tempo, and others—that form the foundation of our music similarity embedding. These 10 features capture the core timbre, rhythm, and mood characteristics of each track.

audio_features = [
    'danceability', 'energy', 'valence', 'tempo', 'acousticness',
    'instrumentalness', 'liveness', 'speechiness', 'loudness', 'popularity'
]


In [3]:
# We standardize all audio features using `StandardScaler` to ensure they share a comparable numerical scale. This prevents high-variance features from dominating the embedding and improves stability of similarity computations.

scaler = StandardScaler()
Z = scaler.fit_transform(spotify[audio_features]).astype("float32")
print("Numeric matrix:", Z.shape)


Numeric matrix: (169776, 10)


### Applying Domain-Inspired Feature Weights
We introduce manual feature weights to emphasize certain musical characteristics (e.g., danceability and energy) while down-weighting others (e.g., liveness, loudness). This produces a weighted audio vector that better aligns with intuitive music similarity.


In [4]:
# Feature weights tuned for music similarity
weights = np.array([
    1.2,  # danceability
    1.2,  # energy
    1.0,  # valence
    1.0,  # tempo
    0.8,  # acousticness
    0.8,  # instrumentalness
    0.7,  # liveness
    0.6,  # speechiness
    0.5,  # loudness
    1.0   # popularity
], dtype="float32")

Z_weighted = Z * weights
print("Weighted audio embeddings:", Z_weighted.shape)


Weighted audio embeddings: (169776, 10)


In [5]:
# Lightweight categorical metadata included in embedding
cat_cols = ["key", "mode", "explicit"]
cat_emb = spotify[cat_cols].astype("float32").values

print("Categorical matrix:", cat_emb.shape)


Categorical matrix: (169776, 3)


In [6]:
# Standardize the year column
year_scaled = StandardScaler().fit_transform(spotify[['year']]).astype("float32")
print("Year matrix:", year_scaled.shape)


Year matrix: (169776, 1)


In [7]:
embeddings = np.concatenate([Z_weighted, cat_emb, year_scaled], axis=1).astype("float32")
print("Final embedding matrix shape:", embeddings.shape)


Final embedding matrix shape: (169776, 14)


In [8]:
np.save("data/spotify_vectors_14d.npy", embeddings)
print("Saved embeddings to data/spotify_vectors_14d.npy")

# Track ID → row index mapping
id_to_index = {tid: i for i, tid in enumerate(spotify["id"])}
json.dump(id_to_index, open("data/id_to_index.json", "w"))

print("Saved ID-to-index mapping.")


Saved embeddings to data/spotify_vectors_14d.npy
Saved ID-to-index mapping.
