
Project: Song Recommendation System Based on User Mood
This project aims to create a system that suggests songs based on a user's mood. We will use Spotify and Genius APIs to fetch user data, process this data to create embeddings using a pre-trained transformer model, store these embeddings in a FAISS index, and use LangChain and MLflow to manage the retrieval and generation processes.
 Step-by-Step Guide
 
 1. Setup Environment and Install Dependencies
**Why:** To ensure all necessary packages and tools are available for the project.
**Action:** Install the required libraries such as `lyricsgenius`, `spotipy`, `transformers`, `scikit-learn`, `faiss-cpu`, `tqdm`, and `mlflow`.
**Commands:**


In [2]:
%pip install lyricsgenius
%pip install spotipy
%pip install spotipy lyricsgenius transformers scikit-learn gtts pydub librosa
%pip install faiss-cpu
%pip install tqdm
%pip install torch

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
# import pandas as pd


# data = pd.read_csv("spotify/data/data.csv")
# genre_data = pd.read_csv('spotify/data/data_by_genres.csv')
# year_data = pd.read_csv('spotify/data/data_by_year.csv')


In [4]:
# import os
# import pandas as pd
# import tqdm 

# # show stahe of the progress bar
# tqdm.tqdm.pandas()

# # Setting the base directory using list of directory names
# base_dir = "data"

# # Building paths by further extending the base directory
# data_path = os.path.join(base_dir, "data.csv")
# genre_data_path = os.path.join(base_dir, "data_by_genres.csv")
# year_data_path = os.path.join(base_dir, "data_by_year.csv")

# # Reading the data using pandas
# data = pd.read_csv(data_path)
# genre_data = pd.read_csv(genre_data_path)
# year_data = pd.read_csv(year_data_path)


In [5]:
# # spotify 


# import spotipy
# client_id = '10cc8ee290404da9ab9d7b061d526193'
# client_secret = '0dc9cb56d8bc4454afa1ddbe82a7301d'

# tqdm.tqdm.pandas()

# from spotipy.oauth2 import SpotifyClientCredentials
# client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
# sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# # check health 
# sp.trace = False
# track = sp.track('7qiZfU4dY1lWllzX7mPBI3')
# print(track['name'])
# # Now you have the access token to make requests to the Spotify API


In [6]:
# # login to spotfy account as user and get their playlists
# import spotipy.util as util

# username = 'eqanbww3jh63cgf4ot5zyyr5d'
# scope = 'playlist-read-private'
# token = util.prompt_for_user_token(username, scope, client_id, client_secret, redirect_uri='http://localhost:8888/callback')
# if token:
#     sp = spotipy.Spotify(auth=token)
#     playlists = sp.user_playlists(username)
#     for playlist in playlists['items']:
#         print(playlist['name'])

# else:
#     print("Can't get token for", username)

# # list songs in all the playlist 


In [7]:
# Install required libraries for the project
# This ensures all necessary packages are available for audio processing, text embedding, API interactions, and data management
%pip install lyricsgenius spotipy transformers scikit-learn gtts pydub librosa faiss-cpu tqdm mlflow
%pip install torch  --index-url https://download.pytorch.org/whl/cu118

# Import essential libraries for the project
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import lyricsgenius
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import faiss
import logging
import psutil  # For monitoring system memory
import gc  # For managing memory through garbage collection

# Set up logging to monitor and log the flow of execution and potential issues
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

SPOTIFY_CLIENT_ID = '***REMOVED***'
SPOTIFY_CLIENT_SECRET = '***REMOVED***'
SPOTIFY_REDIRECT_URI = 'http://localhost:8888/callback'
GENIUS_API_TOKEN = '***REMOVED***'


# Initialize the Spotify API with user credentials for accessing music-related data
logger.info("Setting up Spotify API...")
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=SPOTIFY_CLIENT_ID,
                                               client_secret=SPOTIFY_CLIENT_SECRET,
                                               redirect_uri=SPOTIFY_REDIRECT_URI,
                                               scope="user-top-read user-library-read playlist-read-private"))

# Initialize the Genius API with your credentials to fetch song lyrics
logger.info("Setting up Genius API...")
genius = lyricsgenius.Genius(GENIUS_API_TOKEN)

# Load a pre-trained transformer model and tokenizer for processing lyrics into embeddings
logger.info("Loading pre-trained transformer model...")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

# Ensure the model operates on CPU to prevent GPU memory overflow issues
device = torch.device("cpu")

# Define a function to embed textual data using the transformer model to get fixed-size numerical vectors
def embed_text(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(device)
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.detach().numpy()

# Function to monitor and log the memory usage to manage resources efficiently
def log_memory_usage():
    process = psutil.Process()
    mem_info = process.memory_info()
    logger.info(f"Memory usage: {mem_info.rss / 1024 ** 2:.2f} MB")

# Retrieve and log the user's most listened tracks from Spotify
def get_spotify_top_tracks(sp, limit=5, time_range='medium_term'):
    logger.info(f"Fetching top {limit} tracks from Spotify...")
    results = sp.current_user_top_tracks(limit=limit, time_range=time_range)
    tracks = results['items']
    logger.info(f"Fetched {len(tracks)} tracks.")
    return tracks

# Fetch and log playlists created by the user on Spotify
def get_spotify_playlists(sp):
    logger.info("Fetching user playlists from Spotify...")
    results = sp.current_user_playlists()
    playlists = results['items']
    logger.info(f"Fetched {len(playlists)} playlists.")
    return playlists

# Fetch and log the audio features of tracks from Spotify which includes metrics like tempo, energy, etc.
def get_audio_features(sp, track_ids):
    logger.info("Fetching audio features from Spotify...")
    audio_features = sp.audio_features(track_ids)
    logger.info(f"Fetched audio features for {len(audio_features)} tracks.")
    return audio_features

# Retrieve and log lyrics for specified songs using the Genius API
def get_lyrics(artist, title):
    logger.info(f"Fetching lyrics for {title} by {artist} from Genius...")
    song = genius.search_song(title, artist)
    if song:
        logger.info(f"Fetched lyrics for {title}.")
        return song.lyrics
    logger.warning(f"Lyrics for {title} by {artist} not found.")
    return None

# Convert audio features into a numerical vector for processing and comparison
def audio_features_to_vector(audio_features):
    vector = np.array([
        audio_features['danceability'],
        audio_features['energy'],
        audio_features['speechiness'],
        audio_features['acousticness'],
        audio_features['instrumentalness'],
        audio_features['liveness'],
        audio_features['valence'],
        audio_features['tempo']
    ])
    return vector

# Create and log a FAISS index for efficient similarity searches among large datasets
def create_faiss_index(data, dimension):
    logger.info(f"Creating FAISS index with dimension {dimension}...")
    index = faiss.IndexFlatL2(dimension)
    index.add(data)
    logger.info("FAISS index created.")
    log_memory_usage()
    return index

# Batch processing to manage memory usage while fetching and processing data from Spotify
def fetch_and_process_data(sp, limit=5, batch_size=2):
    tracks = get_spotify_top_tracks(sp, limit=limit)
    track_ids = [track['id'] for track in tracks]
    audio_features = get_audio_features(sp, track_ids)
    
    playlists = get_spotify_playlists(sp)
    playlist_names = [playlist['name'] for playlist in playlists]
    
    lyrics_data = []
    audio_vectors = []
    
    for i in range(0, len(tracks), batch_size):
        batch_tracks = tracks[i:i+batch_size]
        batch_audio_features = audio_features[i:i+batch_size]
        
        for track, audio_feature in zip(batch_tracks, batch_audio_features):
            artist = track['artists'][0]['name']
            title = track['name']
            lyrics = get_lyrics(artist, title)
            if lyrics:
                lyrics_embedding = embed_text(lyrics)
                audio_vector = audio_features_to_vector(audio_feature)
                lyrics_data.append({'id': track['id'], 'embedding': lyrics_embedding, 'artist': artist, 'title': title, 'lyrics': lyrics})
                audio_vectors.append(audio_vector)
        
        # Release memory after processing each batch
        del batch_tracks, batch_audio_features
        gc.collect()
        log_memory_usage()
    
    lyrics_embeddings = np.vstack([song['embedding'] for song in lyrics_data])
    audio_vectors = np.vstack(audio_vectors)
    
    return lyrics_data, lyrics_embeddings, audio_vectors, playlist_names

# Execute the data fetching and processing
logger.info("Fetching and processing data...")
lyrics_data, lyrics_embeddings, audio_vectors, playlist_names = fetch_and_process_data(sp, limit=5)

# Create indices for the embeddings and vectors to facilitate efficient similarity searches
lyrics_index = create_faiss_index(lyrics_embeddings, 768)
audio_index = create_faiss_index(audio_vectors, 8)



Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://download.pytorch.org/whl/cu118
Note: you may need to restart the kernel to use updated packages.


INFO:__main__:Setting up Spotify API...
INFO:__main__:Setting up Genius API...
INFO:__main__:Loading pre-trained transformer model...
INFO:__main__:Fetching and processing data...
INFO:__main__:Fetching top 5 tracks from Spotify...
INFO:__main__:Fetched 5 tracks.
INFO:__main__:Fetching audio features from Spotify...
INFO:__main__:Fetched audio features for 5 tracks.
INFO:__main__:Fetching user playlists from Spotify...
INFO:__main__:Fetched 49 playlists.
INFO:__main__:Fetching lyrics for Lucky Strike by Maroon 5 from Genius...


Searching for "Lucky Strike" by Maroon 5...


INFO:__main__:Fetched lyrics for Lucky Strike.


Done.


INFO:__main__:Fetching lyrics for Love Me by Lil Wayne from Genius...


Searching for "Love Me" by Lil Wayne...


INFO:__main__:Fetched lyrics for Love Me.
INFO:__main__:Memory usage: 918.41 MB
INFO:__main__:Fetching lyrics for DO IT AGAIN (feat. 2Rare) by NLE Choppa from Genius...


Done.
Searching for "DO IT AGAIN (feat. 2Rare)" by NLE Choppa...


INFO:__main__:Fetched lyrics for DO IT AGAIN (feat. 2Rare).
INFO:__main__:Fetching lyrics for Tell Em by Cochise from Genius...


Done.
Searching for "Tell Em" by Cochise...


INFO:__main__:Fetched lyrics for Tell Em.
INFO:__main__:Memory usage: 963.02 MB
INFO:__main__:Fetching lyrics for Koi Si by Afsana Khan from Genius...


Done.
Searching for "Koi Si" by Afsana Khan...


INFO:__main__:Fetched lyrics for Koi Si.


Done.


INFO:__main__:Memory usage: 978.39 MB
INFO:__main__:Creating FAISS index with dimension 768...
INFO:__main__:FAISS index created.
INFO:__main__:Memory usage: 978.47 MB
INFO:__main__:Creating FAISS index with dimension 8...
INFO:__main__:FAISS index created.
INFO:__main__:Memory usage: 978.47 MB


In [8]:
%pip install langchain langchain-community
%pip install sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [15]:
# step 4

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="distilbert-base-uncased")

def embed_user_query(user_input):
    logger.info("Embedding user query using LangChain...")
    user_embedding = embeddings.embed_query(user_input)
    logger.info("User query embedded.")
    user_embedding = np.array(user_embedding) # Reshape to match FAISS input format
    print("User embedding shape:", user_embedding.shape)  # Debugging: print shape
    return user_embedding


INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: distilbert-base-uncased


In [31]:
# steps 5
from langchain.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore


def retrieve_lyrics_with_langchain(query_embedding):
    logger.info("Performing lyrics retrieval using LangChain...")
    retriever = FAISS(embedding_function=embeddings.embed_query, index=lyrics_index, docstore=InMemoryDocstore(lyrics_data), index_to_docstore_id={})
    docs = retriever.similarity_search_by_vector(query_embedding, k=5)
    logger.info(f"Retrieved top 5 lyrics using LangChain.")
    return docs

def retrieve_audio_features_with_langchain(query_embedding):
    logger.info("Performing audio feature retrieval using LangChain...")
    retriever = FAISS(embedding_function=embeddings.embed_query, index=audio_index)
    docs = retriever.similarity_search(query_embedding, k=5)
    logger.info(f"Retrieved top 5 audio features using LangChain.")
    return docs
    
def combine_retrieval_results(lyrics_docs, audio_docs):
    logger.info("Combining retrieval results...")
    combined_results = lyrics_docs + audio_docs  # This could be a simple concatenation or more sophisticated merging
    logger.info(f"Combined {len(combined_results)} results.")
    return combined_results


In [32]:


# use the above in the cell w retrieve lyrics w langchain


In [12]:
# step 6
def format_recommendations(retrieved_docs):
    logger.info("Formatting recommendations...")
    formatted_response = "\n".join([f"Song: {doc.metadata['title']} by {doc.metadata['artist']}\n{doc.page_content[:100]}..." for doc in retrieved_docs])
    logger.info("Recommendations formatted.")
    return formatted_response

from transformers import pipeline

# Initialize the generation pipeline using an open-source model
generator = pipeline('text-generation', model='gpt2')

def generate_personalized_response(formatted_recommendations, user_query):
    logger.info("Generating personalized response using LangChain...")
    response = generator(f"Context: {formatted_recommendations}\n\nQuestion: {user_query}\nAnswer:", max_length=200, num_return_sequences=1)
    return response[0]['generated_text']



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [24]:
#testing


INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: distilbert-base-uncased
INFO:__main__:Embedding user query using LangChain...
INFO:__main__:User query embedded.


User embedding shape: (768,)
Query embedding shape: (768,)


In [33]:
# Retrieve lyrics using the query embedding

# Example user query
embeddings = HuggingFaceEmbeddings(model_name="distilbert-base-uncased")

def embed_user_query(user_input):
    logger.info("Embedding user query using LangChain...")
    user_embedding = embeddings.embed_query(user_input)
    logger.info("User query embedded.")
    user_embedding = np.array(user_embedding) # Reshape to match FAISS input format
    print("User embedding shape:", user_embedding.shape)  # Debugging: print shape
    return user_embedding


user_query = "summer happy vibes"

# Create an embedding for the user query
query_embedding = embed_user_query(user_query)

# Check if the embedding shape matches what FAISS expects (should be 2D, with one row per item)
print(f"Query embedding shape: {query_embedding.shape}")

def retrieve_songs(query):
    # Preprocess the query to get embeddings
    query_embedding = preprocess_query(query)
    
    # Search the FAISS indices
    lyrics_distances, lyrics_indices = lyrics_index.search(query_embedding, k=5)
    lyrics_results = [lyrics_data[idx] for idx in lyrics_indices[0]]
    
    audio_distances, audio_indices = audio_index.search(query_embedding, k=5)
    audio_results = [tracks[idx] for idx in audio_indices[0]]
    
    # Combine and rank results
    combined_results = merge_and_rank_results(lyrics_results, audio_results)
    return combined_results



def retrieve_lyrics_with_langchain(query_embedding):
    logger.info("Performing lyrics retrieval using LangChain...")
    retriever = FAISS(embedding_function=embeddings.embed_query, index=lyrics_index, docstore=InMemoryDocstore(lyrics_data), index_to_docstore_id={})
    docs = retriever.similarity_search_by_vector(query_embedding, k=5)
    logger.info(f"Retrieved top 5 lyrics using LangChain.")
    return docs


lyrics_docs = retrieve_lyrics_with_langchain(query_embedding)

# Check the results
print("Lyrics retrieval results:")
for doc in lyrics_docs:
    print(doc.metadata['title'], doc.metadata['artist'])



INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: distilbert-base-uncased
INFO:__main__:Embedding user query using LangChain...
INFO:__main__:User query embedded.
INFO:__main__:Performing lyrics retrieval using LangChain...


User embedding shape: (768,)
Query embedding shape: (768,)


KeyError: 2

In [40]:

# working version 

# Import essential libraries for the project
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import lyricsgenius
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import faiss
import logging
import psutil  # For monitoring system memory
import gc  # For managing memory through garbage collection

# Set up logging to monitor and log the flow of execution and potential issues
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

SPOTIFY_CLIENT_ID = '***REMOVED***'
SPOTIFY_CLIENT_SECRET = '***REMOVED***'
SPOTIFY_REDIRECT_URI = 'http://localhost:8888/callback'
GENIUS_API_TOKEN = '***REMOVED***'


# Initialize the Spotify API with user credentials for accessing music-related data
logger.info("Setting up Spotify API...")
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=SPOTIFY_CLIENT_ID,
                                               client_secret=SPOTIFY_CLIENT_SECRET,
                                               redirect_uri=SPOTIFY_REDIRECT_URI,
                                               scope="user-top-read user-library-read playlist-read-private"))

# Initialize the Genius API with your credentials to fetch song lyrics
logger.info("Setting up Genius API...")
genius = lyricsgenius.Genius(GENIUS_API_TOKEN)

# Load a pre-trained transformer model and tokenizer for processing lyrics into embeddings
logger.info("Loading pre-trained transformer model...")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

# Ensure the model operates on CPU to prevent GPU memory overflow issues
device = torch.device("cpu")

# Define a function to embed textual data using the transformer model to get fixed-size numerical vectors
def embed_text(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(device)
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.detach().numpy()

# Function to monitor and log the memory usage to manage resources efficiently
def log_memory_usage():
    process = psutil.Process()
    mem_info = process.memory_info()
    logger.info(f"Memory usage: {mem_info.rss / 1024 ** 2:.2f} MB")

# Retrieve and log the user's most listened tracks from Spotify
def get_spotify_top_tracks(sp, limit=100, time_range='medium_term'):
    logger.info(f"Fetching top {limit} tracks from Spotify...")
    results = sp.current_user_top_tracks(limit=limit, time_range=time_range)
    tracks = results['items']
    logger.info(f"Fetched {len(tracks)} tracks.")
    return tracks

# Fetch and log playlists created by the user on Spotify
def get_spotify_playlists(sp):
    logger.info("Fetching user playlists from Spotify...")
    results = sp.current_user_playlists()
    playlists = results['items']
    logger.info(f"Fetched {len(playlists)} playlists.")
    return playlists

# Fetch and log the audio features of tracks from Spotify which includes metrics like tempo, energy, etc.
def get_audio_features(sp, track_ids):
    logger.info("Fetching audio features from Spotify...")
    audio_features = sp.audio_features(track_ids)
    logger.info(f"Fetched audio features for {len(audio_features)} tracks.")
    return audio_features

# Retrieve and log lyrics for specified songs using the Genius API
def get_lyrics(artist, title):
    logger.info(f"Fetching lyrics for {title} by {artist} from Genius...")
    song = genius.search_song(title, artist)
    if song:
        logger.info(f"Fetched lyrics for {title}.")
        return song.lyrics
    logger.warning(f"Lyrics for {title} by {artist} not found.")
    return None

# Convert audio features into a numerical vector for processing and comparison
def audio_features_to_vector(audio_features):
    vector = np.array([
        audio_features['danceability'],
        audio_features['energy'],
        audio_features['speechiness'],
        audio_features['acousticness'],
        audio_features['instrumentalness'],
        audio_features['liveness'],
        audio_features['valence'],
        audio_features['tempo']
    ])
    return vector

# Batch processing to manage memory usage while fetching and processing data from Spotify
def fetch_and_process_data(sp, limit=400, batch_size=20):
    tracks = get_spotify_top_tracks(sp, limit=limit)
    track_ids = [track['id'] for track in tracks]
    audio_features = get_audio_features(sp, track_ids)
    
    lyrics_data = []
    audio_vectors = []
    metadata = []  # Store metadata for each track
    
    for i in range(0, len(tracks), batch_size):
        batch_tracks = tracks[i:i+batch_size]
        batch_audio_features = audio_features[i:i+batch_size]
        
        for track, audio_feature in zip(batch_tracks, batch_audio_features):
            artist = track['artists'][0]['name']
            title = track['name']
            lyrics = get_lyrics(artist, title)
            if lyrics:
                lyrics_embedding = embed_text(lyrics)
                audio_vector = audio_features_to_vector(audio_feature)
                lyrics_data.append(lyrics_embedding)
                audio_vectors.append(audio_vector)
                metadata.append({'artist': artist, 'title': title, 'lyrics': lyrics})
        
        # Release memory after processing each batch
        del batch_tracks, batch_audio_features
        gc.collect()
        log_memory_usage()
    
    lyrics_embeddings = np.vstack(lyrics_data)
    audio_vectors = np.vstack(audio_vectors)
    
    return lyrics_embeddings, audio_vectors, metadata

# Execute the data fetching and processing
logger.info("Fetching and processing data...")
lyrics_embeddings, audio_vectors, metadata = fetch_and_process_data(sp, limit=5)

# Create indices for the embeddings and vectors to facilitate efficient similarity searches
lyrics_index = faiss.IndexFlatL2(lyrics_embeddings.shape[1])
lyrics_index.add(lyrics_embeddings)

audio_index = faiss.IndexFlatL2(audio_vectors.shape[1])
audio_index.add(audio_vectors)

# Function to retrieve similar lyrics using FAISS
def retrieve_similar_lyrics(query_embedding, k=3):
    _, indices = lyrics_index.search(query_embedding, k)
    return [metadata[i] for i in indices[0]]

# Example usage
user_query = "sad song about heartbreak"
query_embedding = embed_text(user_query)
lyrics_results = retrieve_similar_lyrics(query_embedding)

# Display results
for result in lyrics_results:
    print(f"Title: {result['title']}, Artist: {result['artist']}")
    print(f"Lyrics snippet: {result['lyrics'][:100]}...\n")


INFO:__main__:Setting up Spotify API...
INFO:__main__:Setting up Genius API...
INFO:__main__:Loading pre-trained transformer model...
INFO:__main__:Fetching and processing data...
INFO:__main__:Fetching top 5 tracks from Spotify...
INFO:__main__:Fetched 5 tracks.
INFO:__main__:Fetching audio features from Spotify...
INFO:__main__:Fetched audio features for 5 tracks.
INFO:__main__:Fetching lyrics for THE GREATEST by Billie Eilish from Genius...


Searching for "THE GREATEST" by Billie Eilish...


INFO:__main__:Fetched lyrics for THE GREATEST.
INFO:__main__:Fetching lyrics for No Surprises by Radiohead from Genius...


Done.
Searching for "No Surprises" by Radiohead...


INFO:__main__:Fetched lyrics for No Surprises.
INFO:__main__:Fetching lyrics for Bunker by Balthazar from Genius...


Done.
Searching for "Bunker" by Balthazar...


INFO:__main__:Fetched lyrics for Bunker.
INFO:__main__:Fetching lyrics for Candy by Paolo Nutini from Genius...


Done.
Searching for "Candy" by Paolo Nutini...


INFO:__main__:Fetched lyrics for Candy.
INFO:__main__:Fetching lyrics for Fake Plastic Trees by Radiohead from Genius...


Done.
Searching for "Fake Plastic Trees" by Radiohead...


INFO:__main__:Fetched lyrics for Fake Plastic Trees.


Done.


INFO:__main__:Memory usage: 1009.61 MB


Title: No Surprises, Artist: Radiohead
Lyrics snippet: 145 ContributorsTranslationsРусскийDeutschEspañolFrançaisNo Surprises Lyrics[Verse 1]
A heart that's...

Title: THE GREATEST, Artist: Billie Eilish
Lyrics snippet: 70 ContributorsTranslationsHebrewPolskiالعربيةDeutschPortuguêsEspañolItalianoTürkçeΕλληνικάFrançaisР...

Title: Fake Plastic Trees, Artist: Radiohead
Lyrics snippet: 129 ContributorsTranslationsEspañolPortuguêsFrançaisFake Plastic Trees Lyrics[Verse 1]
A green plast...

