## **Demonstration of Lyrics, Audio, and Hybrid Nearest Neighbors**
This notebook provides an interactive walkthrough of how the system retrieves similar tracks using three different similarity spaces: lyrical semantics, audio features, and their hybrid combination. It showcases how shifting the α parameter changes recommendation behavior, allowing comparisons between narrative-driven matches, sonically aligned tracks, and balanced multimodal neighbors. The goal is to illustrate how the hybrid model blends meaning and sound to produce more contextually aware music recommendations.

Loads the same filtered dataset used during embedding generation and then imports the corresponding lyric embeddings from disk. It verifies that both structures align in size, truncating the DataFrame if needed to avoid indexing mismatches. Basic shape checks confirm that the metadata and embedding matrices are synchronized before running similarity searches.

In [36]:
import pandas as pd
import numpy as np

df_subset = pd.read_csv("/content/spotify_900k_sample.csv")

X_lyrics = np.load("/content/lyrics_embeddings_mpnet.npy")

if df_subset.shape[0] != X_lyrics.shape[0]:
    print(f"Warning: df_subset has {df_subset.shape[0]} rows, but X_lyrics has {X_lyrics.shape[0]} rows. Truncating df_subset to match X_lyrics.")
    df_subset = df_subset.head(X_lyrics.shape[0])

print("df_subset shape:", df_subset.shape)
print("X_lyrics shape:", X_lyrics.shape)

df_subset shape: (1000, 16)
X_lyrics shape: (1000, 768)


Installs FAISS and prepares the lyric embeddings for similarity search by converting them to float32 and applying L2 normalization, enabling cosine similarity through inner-product indexing. A flat FAISS index is then created and populated with all embedding vectors, and the finished index is saved to disk for fast reuse in later queries.

In [37]:
!pip install faiss-cpu
import faiss

X_lyrics = X_lyrics.astype("float32")

# Normalize for cosine similarity via inner product
faiss.normalize_L2(X_lyrics)

d = X_lyrics.shape[1]  # embedding dimension, e.g. 768
index_lyrics = faiss.IndexFlatIP(d)
index_lyrics.add(X_lyrics)

faiss.write_index(index_lyrics, "/content/lyrics_index_mpnet.faiss")



Reloads the previously saved FAISS index from disk so lyric-based nearest-neighbor searches can be performed without rebuilding the index.

In [38]:
import faiss
index_lyrics = faiss.read_index("/content/lyrics_index_mpnet.faiss")

Defines a helper function that retrieves the top lyric-based nearest neighbors for a given track index. It normalizes the query embedding, performs a FAISS similarity search, and returns the indices and scores of the closest matches while excluding the track itself. This function serves as the core lookup for lyric-only recommendations.

In [39]:
def rec_lyrics_ctx(query_idx, k=10):
    # query_idx is the row index in df_subset / X_lyrics
    q = X_lyrics[query_idx:query_idx+1].copy()
    faiss.normalize_L2(q)  # normalize query
    D, I = index_lyrics.search(q, k+1)  # +1 because first is usually itself
    neighbor_indices = I[0][1:]  # drop self
    neighbor_scores  = D[0][1:]
    return neighbor_indices, neighbor_scores

Resets the DataFrame index to ensure clean, consecutive row numbers and adds a simple integer track_id column. This provides a stable identifier for each track when referencing results from similarity searches.

In [40]:
df_subset = df_subset.reset_index(drop=True)
df_subset["track_id"] = df_subset.index  # simple integer id

Creates a reusable function for retrieving lyric-based nearest neighbors using FAISS. It extracts the query embedding, performs an inner-product similarity search, and returns the top neighbor indices and their scores while filtering out the seed track itself. This function generalizes the lookup process for use across different demos or evaluation steps.


In [41]:
import numpy as np
import faiss

def get_lyrics_neighbors(
    seed_idx: int,
    embeddings: np.ndarray,
    index: faiss.Index,
    k: int = 10
):
    # embeddings should already be L2-normalized for cosine via inner product
    query = embeddings[seed_idx : seed_idx + 1]  # shape (1, D)
    scores, idxs = index.search(query, k + 1)    # include self
    idxs = idxs[0]
    scores = scores[0]

    # drop self if present
    mask = idxs != seed_idx
    return idxs[mask][:k], scores[mask][:k]

Runs a lyric-based similarity query for a chosen track, retrieving the top nearest neighbors and their scores. It then displays the seed track followed by the matching songs, allowing you to inspect how well the lyric-only embedding space groups related music.

In [42]:
seed_idx = 123
neighbor_idxs, neighbor_scores = get_lyrics_neighbors(seed_idx, X_lyrics, index_lyrics, k=10)
df_subset.iloc[[seed_idx]][["Artist(s)", "song"]]
df_subset.iloc[neighbor_idxs][["Artist(s)", "song"]]

Unnamed: 0,Artist(s),song
667,Blacklistt,Pinworm
475,Immortal,Unholy Forces of Evil
353,Dog Fashion Disco,Siddhis
134,The Gits,Bob Cousin O.
502,Soen,Modesty
514,Kold-Blooded,Hidden Character
216,Northlane,Talking Heads
888,Bullet For My Valentine,Welcome Home Sanitarium
948,Coma Cinema,Posthumous Release
33,Emperor,With Strength I Burn


Builds the audio-feature matrix used for audio-based similarity by selecting the chosen feature columns, cleaning the loudness field if it contains a “db” suffix, and converting everything to numeric form. The features are then standardized with StandardScaler and L2-normalized so cosine similarity can be computed via simple dot products. The function returns both the normalized matrix and the fitted scaler for later reuse.

In [44]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

AUDIO_FEATURE_COLS = [
    "Energy",
    "Danceability",
    "Positiveness",
    "Acousticness",
    "Instrumentalness",
    "Speechiness",
    "Liveness",
    "Tempo",
    "Loudness (db)",
]

def build_audio_feature_matrix(df: pd.DataFrame):
    # Make a copy to avoid modifying the original DataFrame indirectly
    df_processed = df[AUDIO_FEATURE_COLS].copy()

    # Preprocess 'Loudness (db)' column if it exists and contains 'db' suffix
    if "Loudness (db)" in df_processed.columns:
        # Check if any values actually contain 'db' to avoid unnecessary string operations on purely numeric strings
        if df_processed["Loudness (db)"].astype(str).str.contains("db").any():
            df_processed["Loudness (db)"] = (
                df_processed["Loudness (db)"]
                .astype(str)
                .str.replace("db", "", regex=False)
                .astype(float)
            )

    X = df_processed.astype(float).values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # L2 normalize for cosine via dot product
    norms = np.linalg.norm(X_scaled, axis=1, keepdims=True) + 1e-9
    X_norm = X_scaled / norms
    return X_norm, scaler

Generates the normalized audio-feature embedding matrix for all tracks using the preprocessing function defined earlier. The returned scaler is kept for consistency if additional audio data needs to be transformed later.

In [45]:
audio_embeddings, audio_scaler = build_audio_feature_matrix(df_subset)

Creates a FAISS inner-product index for the audio-feature embeddings and adds all vectors to it, enabling fast audio-based similarity lookup. The index and the audio embedding matrix are then saved to the processed data directory so they can be reused without recomputing in later notebooks.

In [46]:
import faiss
import os

audio_dim = audio_embeddings.shape[1]

audio_index = faiss.IndexFlatIP(audio_dim)  # IP = inner product
audio_index.add(audio_embeddings)

# Create the directory if it doesn't exist
os.makedirs('data/processed/', exist_ok=True)

faiss.write_index(audio_index, "data/processed/audio_index.faiss")
np.save("data/processed/audio_embeddings.npy", audio_embeddings)

Defines a function for retrieving the top audio-based nearest neighbors for a given track. It queries the FAISS audio index, filters out the seed track, and returns the closest matches along with their similarity scores.

In [47]:
def get_audio_neighbors(seed_idx: int, audio_embeddings, audio_index, k: int = 10):
    query = audio_embeddings[seed_idx : seed_idx + 1]
    scores, idxs = audio_index.search(query, k + 1)
    idxs = idxs[0]
    scores = scores[0]
    mask = idxs != seed_idx
    return idxs[mask][:k], scores[mask][:k]

## HYBRID SPACE

Provides a simple helper that computes cosine-based similarity scores between a chosen track and all others by taking the dot product between the query embedding and the full embedding matrix. Since embeddings are already L2-normalized, this dot product directly corresponds to cosine similarity.

In [48]:
import numpy as np

def compute_cosine_scores(seed_idx, emb_matrix):
    query = emb_matrix[seed_idx : seed_idx + 1]        # (1, D)
    scores = emb_matrix @ query.T                      # (N, 1)
    return scores.ravel()

Defines the hybrid neighbor retrieval function by combining lyric-based and audio-based cosine similarity scores. Each score vector is normalized to the
0
,
1
range for comparability, and the final hybrid score is computed as a weighted mixture controlled by α. The function then ranks tracks by descending hybrid similarity, removes the seed track, and returns the top-k neighbors with their blended scores.


In [49]:
def get_hybrid_neighbors(
    seed_idx: int,
    lyrics_emb: np.ndarray,
    audio_emb: np.ndarray,
    alpha: float = 0.5,
    k: int = 10
):
    # cosine via dot product because both emb matrices are L2-normalized
    lyric_scores = compute_cosine_scores(seed_idx, lyrics_emb)
    audio_scores = compute_cosine_scores(seed_idx, audio_emb)

    # normalize each score vector to [0,1] for stability
    def norm01(x):
        x_min, x_max = x.min(), x.max()
        return (x - x_min) / (x_max - x_min + 1e-9)

    lyric_scores_n = norm01(lyric_scores)
    audio_scores_n = norm01(audio_scores)

    hybrid_scores = alpha * audio_scores_n + (1 - alpha) * lyric_scores_n

    # sort descending, drop self
    idxs = np.argsort(-hybrid_scores)
    idxs = idxs[idxs != seed_idx]
    top_idxs = idxs[:k]
    top_scores = hybrid_scores[top_idxs]
    return top_idxs, top_scores

Runs a hybrid similarity query for the chosen seed track using a 40/60 audio–lyrics weighting. The cell displays the seed song followed by its top hybrid neighbors, letting you compare how the blended scoring pulls in tracks that share both semantic and sonic characteristics.

In [50]:
seed_idx = 123
neighbors_hybrid, scores_hybrid = get_hybrid_neighbors(
    seed_idx,
    lyrics_emb=X_lyrics,
    audio_emb=audio_embeddings,
    alpha=0.4,   # 40% audio, 60% lyrics
    k=10
)

df_subset.iloc[[seed_idx]][["Artist(s)", "song", "text"]]
df_subset.iloc[neighbors_hybrid][["Artist(s)", "song", "text", "Genre", "Energy", "Positiveness"]]

Unnamed: 0,Artist(s),song,text,Genre,Energy,Positiveness
888,Bullet For My Valentine,Welcome Home Sanitarium,[Verse 1] Welcome to where time stands still N...,"alternative rock,heavy metal,thrash metal",73,15
33,Emperor,With Strength I Burn,Deep Green Dark Chaos Blinded I run down these...,"progressive metal,metal,black metal",94,4
216,Northlane,Talking Heads,[Verse 1] Tiptoe through the ruins of my mind ...,"industrial,progressive metal,metal",96,13
475,Immortal,Unholy Forces of Evil,[Verse 1] Slowly crossing as red rivers run be...,"metal,black metal",84,17
267,Suicide Silence,Two Steps,[Intro] I'm two steps away So why don't you ki...,"nu metal,deathcore",98,7
865,Slipknot,Skin Ticket,[Verse 1] Zero and zero is nothing but zero Ca...,"alternative rock,heavy metal,metal",96,20
667,Blacklistt,Pinworm,"[Verse 1: Ghostemane] Mission perfect, react, ...",hip hop,88,20
175,Killswitch Engage,This Is Goodbye,[Intro] (This is my goodbye) (This is my goodb...,"metal,metalcore,emo",99,15
31,Anthony Lo Re,Ultimate Battle - Ka Ka Kachi Daze From Dragon...,[Chorus] (Ka Ka Ka Ka) Time to end this (Gun G...,hip hop,93,12
574,"Beach Vacation,Cathedral Bells",Coping,"[Verse 1] Primal, evil, what am I? Tongue-tied...",hip hop,88,40


**FILTER HYBRID NEIGBORS BY SAME EMOTION **

Adds a filtered version of the hybrid neighbor search that returns only tracks sharing the same emotion label as the seed song. It runs the standard hybrid retrieval first, then keeps only those neighbors whose emotion category matches the seed track’s, making it easy to explore emotion-consistent recommendations.

In [51]:
def get_hybrid_neighbors_same_emotion(seed_idx, df, *args, **kwargs):
    neighbors, scores = get_hybrid_neighbors(seed_idx, *args, **kwargs)
    seed_emotion = df.loc[seed_idx, "emotion"]
    mask = df.iloc[neighbors]["emotion"] == seed_emotion
    return neighbors[mask], scores[mask]

Defines a unified recommendation wrapper that lets you request lyric-only, audio-only, or hybrid neighbors through a single function call. Depending on the selected mode, it dispatches to the appropriate retrieval function and returns the seed track, its recommended neighbors, and their similarity scores. This makes it easy to compare different similarity spaces with one consistent interface.

In [52]:
def recommend(
    seed_idx,
    mode="hybrid",
    alpha=0.5,
    k=10
):
    if mode == "lyrics":
        idxs, scores = get_lyrics_neighbors(seed_idx, embeddings, index, k)
    elif mode == "audio":
        idxs, scores = get_audio_neighbors(seed_idx, audio_embeddings, audio_index, k)
    else:
        idxs, scores = get_hybrid_neighbors(seed_idx, embeddings, audio_embeddings, alpha, k)
    return df.iloc[[seed_idx]], df.iloc[idxs], scores

Provides a convenience wrapper for running hybrid recommendations with a given α value and returning a clean, formatted result. It retrieves the top hybrid neighbors, then assembles a small table showing each recommendation along with key metadata and its hybrid similarity score, making the output easier to inspect.

In [58]:
def recommend_hybrid(seed_idx, k=10, alpha=0.4):
    neighbors, scores = get_hybrid_neighbors(
        seed_idx,
        lyrics_emb=X_lyrics,
        audio_emb=audio_embeddings,
        alpha=alpha,
        k=k
    )
    seed = df_subset.iloc[[seed_idx]][["Artist(s)", "song", "text"]]
    recs = df_subset.iloc[neighbors][["Artist(s)", "song", "text", "Genre", "Energy", "Positiveness"]]
    recs = recs.assign(hybrid_score=scores)
    return seed, recs

Runs the hybrid recommender for the chosen seed track and displays both the seed entry and its top recommended neighbors. The resulting tables let you quickly inspect how well the hybrid scoring surfaces songs that balance lyrical and audio similarity.

In [56]:
seed, recs = recommend_hybrid(123, k=10, alpha=0.4)
seed
recs

Unnamed: 0,Artist(s),song,text,Genre,Energy,Positiveness,hybrid_score
888,Bullet For My Valentine,Welcome Home Sanitarium,[Verse 1] Welcome to where time stands still N...,"alternative rock,heavy metal,thrash metal",73,15,0.650318
33,Emperor,With Strength I Burn,Deep Green Dark Chaos Blinded I run down these...,"progressive metal,metal,black metal",94,4,0.645183
216,Northlane,Talking Heads,[Verse 1] Tiptoe through the ruins of my mind ...,"industrial,progressive metal,metal",96,13,0.640158
475,Immortal,Unholy Forces of Evil,[Verse 1] Slowly crossing as red rivers run be...,"metal,black metal",84,17,0.634365
267,Suicide Silence,Two Steps,[Intro] I'm two steps away So why don't you ki...,"nu metal,deathcore",98,7,0.627295
865,Slipknot,Skin Ticket,[Verse 1] Zero and zero is nothing but zero Ca...,"alternative rock,heavy metal,metal",96,20,0.623653
667,Blacklistt,Pinworm,"[Verse 1: Ghostemane] Mission perfect, react, ...",hip hop,88,20,0.621837
175,Killswitch Engage,This Is Goodbye,[Intro] (This is my goodbye) (This is my goodb...,"metal,metalcore,emo",99,15,0.620122
31,Anthony Lo Re,Ultimate Battle - Ka Ka Kachi Daze From Dragon...,[Chorus] (Ka Ka Ka Ka) Time to end this (Gun G...,hip hop,93,12,0.619527
574,"Beach Vacation,Cathedral Bells",Coping,"[Verse 1] Primal, evil, what am I? Tongue-tied...",hip hop,88,40,0.618809
