## **Lyrics** & Audio Embeddings + FAISS Index Construction
This notebook generates the core embedding spaces that power the Spectral Neighbor Plus recommender. It encodes lyrical text into semantic embeddings, standardizes audio feature vectors, and builds FAISS indices to enable fast nearest-neighbor retrieval in both modalities. The resulting matrices and indices form the foundation for multimodal similarity scoring in later stages of the system.

Installs the sentence-transformers library, which provides the pretrained models used to generate semantic embeddings for song lyrics.

In [None]:
!pip install sentence-transformers



This cell defines the utilities for generating lyric embeddings. It loads a sentence-transformer model, encodes all lyrics in batches to produce normalized semantic vectors, and provides a helper to save the resulting embeddings as a NumPy file. These functions streamline the embedding workflow so they can be reused cleanly throughout the project.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
from sentence_transformers import SentenceTransformer

def load_model(model_name: str = "sentence-transformers/all-mpnet-base-v2"):
    """Load the sentence-transformers model."""
    return SentenceTransformer(model_name)

def compute_lyrics_embeddings(
    df: pd.DataFrame,
    lyrics_col: str,
    model,
    batch_size: int = 32,
    normalize: bool = True
):
    """Compute contextual embeddings for lyrics."""
    lyrics = df[lyrics_col].fillna("").astype(str).tolist()

    embeddings = model.encode(
        lyrics,
        batch_size=batch_size,
        show_progress_bar=True,
        normalize_embeddings=normalize
    )

    return embeddings

def save_embeddings(embeddings, path: str):
    """Save embeddings to a .npy file."""
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    np.save(path, embeddings)

def load_embeddings(path: str):
    """Load embeddings from a .npy file."""
    path = Path(path)
    return np.load(path)

This cell loads the sampled Spotify dataset and assigns the key column groups used later: the lyric text column, a collection of audio-related feature columns, and a placeholder for any contextual fields. It finishes by displaying the first few rows so you can confirm the structure before generating embeddings.


In [None]:
import pandas as pd

df = pd.read_csv('spotify_900k_sample.csv')
lyrics_col = 'text'
audio_cols = [col for col in df.columns if col.startswith('audio_') or col in ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']] # Common audio features
context_cols = []
df.head()

Unnamed: 0,Artist(s),song,text,Genre,Album,emotion,Energy,Danceability,Positiveness,Acousticness,Instrumentalness,Speechiness,Liveness,Tempo,Loudness (db),Popularity
0,Shirley Temple,Come And Get Your Happiness,Lyrics/Music Yellen/Pokrass Why are grown up p...,hip hop,Little Miss Shirley Temple,joy,20,66,68,100,1,7,30,64,-11.91db,14
1,Supertramp,Ever Open Door,"Sharing's good, sharing's fine But no one w...","rock,progressive rock,classic rock",Brother Where You Bound (Remastered),joy,25,47,10,85,0,3,23,77,-11.62db,36
2,Demi Lovato,Let It Go,"[Intro] Let it go, let it go Can’t hold it bac...","synthpop,pop,electropop",Demi (Deluxe),joy,66,50,25,3,0,4,26,140,-5.87db,50
3,"Chase & Status,Mozey,Sav'o,Horrid1",Action,"[Intro: Sav'O] (Madara) J'S, J'S [Verse 1: Sa...",hip hop,"2 RUFF, Vol. 1",anger,96,64,33,1,8,17,7,175,-1.88db,52
4,Mick Jenkins,40 Below,"[Produced by: THEMPeople] [Intro: Sample] ""Lo...",country,Wave[s],anger,83,38,22,4,1,26,11,96,-1.75db,17


Loads the sentence-transformer model into memory, preparing it for generating lyric embeddings in the next steps.

In [None]:
model = load_model()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generates semantic embeddings for all lyric texts using the loaded model, processing them in batches for efficiency. The final line shows the shape of the embedding matrix, confirming how many tracks were encoded and the dimensionality of each vector.

In [None]:
embeddings = compute_lyrics_embeddings(
    df=df,
    lyrics_col=lyrics_col,
    model=model,
    batch_size=32
)
embeddings.shape

Batches:   0%|          | 0/625 [00:00<?, ?it/s]

This cell computes lyric embeddings directly with the model, saves them as a NumPy file, and then builds a FAISS inner-product index for fast similarity search. After adding all embeddings to the index, it writes the index to disk so later notebooks can perform efficient nearest-neighbor queries without recomputing anything.

In [None]:
embeddings = model.encode(df["lyrics"].tolist(), batch_size=32)
np.save("data/processed/lyrics_embeddings.npy", embeddings)

import faiss
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)
faiss.write_index(index, "data/processed/lyrics_index.faiss")

This cell selects a fixed set of audio features required for the hybrid similarity model and filters them to include only the columns present in the dataset. It then previews the resulting feature subset to confirm the values before scaling and embedding.

In [None]:
import pandas as pd

df = pd.read_csv("spotify_900k_sample.csv")

AUDIO_FEATURE_COLS = [
    "Energy", "Danceability", "Positiveness",
    "Acousticness", "Instrumentalness", "Speechiness",
    "Liveness", "Tempo", "Loudness (db)"
]
# Keep only the audio columns that actually exist in the dataset
audio_cols = [col for col in AUDIO_FEATURE_COLS if col in df.columns]

print(audio_cols)
df[audio_cols].head()