## Load and explore the dataset
First of all, we have to download the dataset and load it into variable.
For this, we use:
- Kagglehub to download the dataset from kaggle.
- Pandas to handle the csv.

In [1]:
import kagglehub
import pandas as pd
import warnings

path = kagglehub.dataset_download("notshrirang/spotify-million-song-dataset")
print(path)
warnings.filterwarnings("ignore")

songs = pd.read_csv(f"{path}/spotify_millsongdata.csv")

songs

C:\Users\Martin Caballero\.cache\kagglehub\datasets\notshrirang\spotify-million-song-dataset\versions\1


Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...
...,...,...,...,...
57645,Ziggy Marley,Good Old Days,/z/ziggy+marley/good+old+days_10198588.html,Irie days come on play \r\nLet the angels fly...
57646,Ziggy Marley,Hand To Mouth,/z/ziggy+marley/hand+to+mouth_20531167.html,Power to the workers \r\nMore power \r\nPowe...
57647,Zwan,Come With Me,/z/zwan/come+with+me_20148981.html,all you need \r\nis something i'll believe \...
57648,Zwan,Desire,/z/zwan/desire_20148986.html,northern star \r\nam i frightened \r\nwhere ...


## Pre-processing
Sometimes, pre-processing is needed, due the irregularities of the dataset.
In this case, we:
- Erased all songs without lyrics.
- Replaced escaped strings.
- Add an id for each song.

In [2]:
songs = songs.dropna(subset=["text"])
songs["text_cleaned"] = songs["text"].str.replace(r"\r\n", " ", regex=True).str.strip()
songs = songs[["artist", "song", "link", "text_cleaned"]]
songs = songs[songs["text_cleaned"].str.split().str.len() > 10]
songs["id"] = songs.index

songs

Unnamed: 0,artist,song,link,text_cleaned,id
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face And ...",0
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please Touch me gently...",1
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go Why I had to...,2
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...,3
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...,4
...,...,...,...,...,...
57645,Ziggy Marley,Good Old Days,/z/ziggy+marley/good+old+days_10198588.html,Irie days come on play Let the angels fly le...,57645
57646,Ziggy Marley,Hand To Mouth,/z/ziggy+marley/hand+to+mouth_20531167.html,Power to the workers More power Power to t...,57646
57647,Zwan,Come With Me,/z/zwan/come+with+me_20148981.html,all you need is something i'll believe fla...,57647
57648,Zwan,Desire,/z/zwan/desire_20148986.html,northern star am i frightened where can i ...,57648


## Embedding creation
Now, we have to create an embedding to each song.
An embedding is a semantic representation of an entitie, in a n-dimensional space.

For this, we have to:
- Create an embedding_text.
- Create a vector database to store the embeddings generated using the embedding_text.

In [3]:
import os
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document

songs["embedding_text"] = songs.apply(lambda row: f"{row['text_cleaned']}", axis=1)

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

persist_directory = "chroma_db"

if os.path.exists(persist_directory):
    db_songs = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)
else:
    documents = [Document(page_content=row["text_cleaned"], metadata={"id": row["id"]}) for _, row in songs.iterrows()]
    db_songs = Chroma.from_documents(documents, embedding=embedding_model, persist_directory=persist_directory)
    db_songs.persist()


  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  db_songs = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)


## Spotify data
This step is to get data from the song, that is not in the dataset, like album name, image, and the real url.

In [4]:
import os
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

client_id = os.getenv("CLIENT_ID")
client_secret = os.getenv("CLIENT_SECRET")
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

def get_spotify_info(song, artist):
    """Fetch album name, image, and Spotify track link."""
    results = sp.search(q=f"track:{song} artist:{artist}", type="track", limit=1)
    if results["tracks"]["items"]:
        track = results["tracks"]["items"][0]
        album_name = track["album"]["name"]
        album_image_url = track["album"]["images"][0]["url"]
        track_url = track["external_urls"]["spotify"]
        return album_name, album_image_url, track_url
    return "Unknown Album", None, None


## Query

Now you can make a query, and the vector db allow us to search by similarity using the song embeddings.

In [None]:
query = "A song about a guy who is depressed"
docs = db_songs.similarity_search(query, k=10)

ids = [doc.metadata['id'] for doc in docs]

similar_songs = songs[songs['id'].isin(ids)]

for _, row in similar_songs.iterrows():
    album_name, album_image, link = get_spotify_info(row["song"], row["artist"])
    
    print(f"🎵 Song: {row['song']} by {row['artist'] } ID = {row['id']}")
    print(f"📀 Album: {album_name}")
    print(f"📝 Lyrics: {row['text_cleaned']}")
    print(f"🔗 Link: {link}")
    print(f"🖼 Album Cover: {album_image}\n")


🎵 Song: Mannequin by Britney Spears ID = 1988
📀 Album: Circus (Deluxe Version)
📝 Lyrics: Always talking around this,   He wants me,   I get things, everything I wanted,   My own way, your time, goldmines,   Loose guys, on my backless, dresses, exes.      I cannot help myself, I'm just doing what I do,   Got my heart set, do anything that I want so thank you,   I like it and I do what I like,   And if you do what I like, then you'll like it,      If you wanna just   Scream,   Scream your lungs out,   If you wanna just   Cry,   Cry your eyes out,   I'm not doing that      That's what I'm about.      [Chorus]   You can cry your eyes out of your head,   Baby, baby,   I don't care, I don't care,   I don't care, I don't care,   You can cry-cry-cry again-gain-gain,   My face like a mannequin,   (Scream)   Mannequin, yea I did,   It again and again,   You can cry-cry-cry again-gain-gain,   My face like a mannequin.      You told me more than he did,   And then you were frozen, imposin',   Ther

: 

: 