## Test script for app

This notebook includes all of the test code I ran to try out different query methods, including: 
- Sentiment analysis for lyrics: use VADER and HuggingFace pipeline, results may be usable for English lyrics, but was not optimal for other languages
- TF-IDF to query for top keywords: use [text2text library](https://github.com/artitw/text2text#tf-idf)
- Summarize lyrics and mood to use as query: use NLTK

### Spotify and Genius Authentication

In [42]:
import os
from dotenv import load_dotenv
load_dotenv()

import spotipy
from spotipy.oauth2 import SpotifyOAuth, SpotifyClientCredentials

# Get Spotify app credentials
client_id = os.getenv("SPOTIFY_CLIENT_ID")
client_secret = os.getenv("SPOTIFY_CLIENT_SECRET")
redirect_uri = os.getenv("SPOTIFY_REDIRECT_URI")
scope = "user-library-read playlist-read-private user-top-read"

# Authenticate 
client_credentials_manager = SpotifyClientCredentials(
    client_id=client_id,
    client_secret=client_secret
)

oauth_manager = SpotifyOAuth(
    client_id=client_id,
    client_secret=client_secret,
    redirect_uri=redirect_uri,
    scope=scope
)

sp = spotipy.Spotify(oauth_manager=oauth_manager)
sp.current_user()

{'display_name': 'rile',
 'external_urls': {'spotify': 'https://open.spotify.com/user/4ssmxat4kvsarq280tilxd9cw'},
 'followers': {'href': None, 'total': 15},
 'href': 'https://api.spotify.com/v1/users/4ssmxat4kvsarq280tilxd9cw',
 'id': '4ssmxat4kvsarq280tilxd9cw',
 'images': [{'height': 300,
   'url': 'https://i.scdn.co/image/ab6775700000ee85b075d0a3febdc0142cda438f',
   'width': 300},
  {'height': 64,
   'url': 'https://i.scdn.co/image/ab67757000003b82b075d0a3febdc0142cda438f',
   'width': 64}],
 'type': 'user',
 'uri': 'spotify:user:4ssmxat4kvsarq280tilxd9cw'}

### Fetch Spotify songs

5YkfW67PCvWAOO63TjYwLl

In [79]:
user_list = sp.playlist_items('1GfuMQxvw8BcgUx0ozH3fl')

In [81]:
user_songs = []
for item in user_list['items']:
    track = item['track']
    user_songs.append({
        'name': track['name'],
        'artist': track['artists'][0]['name'],
        'album': track['album']['name'],
        'release_date': track['album']['release_date'],
        'popularity': track['popularity']
    })

In [82]:
import pandas as pd
user_songs = pd.DataFrame(user_songs)
user_songs

Unnamed: 0,name,artist,album,release_date,popularity
0,exes,Tate McRae,THINK LATER,2023-12-08,77
1,Big Energy,Latto,Big Energy,2021-09-24,60
2,It's ok I'm ok,Tate McRae,It's ok I'm ok (remixes),2024-10-11,52
3,Guess featuring billie eilish,Charli xcx,Brat and it’s completely different but also st...,2024-10-11,80
4,Woman,Doja Cat,Planet Her,2021-06-24,50
...,...,...,...,...,...
95,J CHRIST,Lil Nas X,J CHRIST,2024-01-12,53
96,NO MIENTEN,Becky G,NO MIENTEN,2022-04-20,43
97,Sorry,Justin Bieber,Nacido para jugar,2023-12-01,38
98,Blow,Kesha,Cannibal (Expanded Edition),2010-11-19,67


### Match songs with lyrics

In [83]:
import lyricsgenius
gen_client_access_token = os.getenv("GENIUS_CLIENT_TOKEN")
genius = lyricsgenius.Genius(gen_client_access_token, timeout=10)

list_lyrics = []

for i, song in user_songs.iterrows():
    title = song['name']
    artist = song['artist']
    retries = 0
    while retries < 3:
        try:
            lyrics = genius.search_song(title, artist)
        except:
            retries += 1
            continue
        if lyrics:
            list_lyrics.append({
                'title': title,
                'lyrics': lyrics.lyrics
            })
        break

Searching for "exes" by Tate McRae...
Done.
Searching for "Big Energy" by Latto...
Done.
Searching for "It's ok I'm ok" by Tate McRae...
Done.
Searching for "Guess featuring billie eilish" by Charli xcx...
Done.
Searching for "Woman" by Doja Cat...
Done.
Searching for "Dear god" by Tate McRae...
Done.
Searching for "7 rings" by Ariana Grande...
Done.
Searching for "greedy" by Tate McRae...
Done.
Searching for "1, 2 Step (feat. Missy Elliott)" by Ciara...
Done.
Searching for "2 hands" by Tate McRae...
Done.
Searching for "Flowers" by Miley Cyrus...
Done.
Searching for "Breakin' Dishes" by Rihanna...
Done.
Searching for "Your Love Is My Drug" by Kesha...
Done.
Searching for "Toxic" by Britney Spears...
Done.
Searching for "Get Into It (Yuh)" by Doja Cat...
Done.
Searching for "Rockstar" by LISA...
Done.
Searching for "Diva" by Beyoncé...
Done.
Searching for "hurt my feelings" by Tate McRae...
Done.
Searching for "Kill Bill" by SZA...
Done.
Searching for "Tia Tamera (feat. Rico Nasty)" by

Clean and tokenize lyrics

In [47]:
import random
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist

# nltk.download('stopwords')
# nltk.download('wordnet')   
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')   

def clean_lyrics(df, column):
    """
    Cleans the words without importance and fix the format of the  dataframe's column lyrics 
    Args:
        df (DataFrame): df containing song information
        column (str): column to clean
    Returns:
        df (DataFrame): DataFrame containing the cleaned lyrics
    """
    df[column] = df[column].str.lower()
    # remove section marker
    df[column] = df[column].str.replace(r"(verse\s?\d*|chorus|bridge|outro|intro)", "", regex=True)
    df[column] = df[column].str.replace(r"(instrumental|guitar|solo)", "", regex=True) 
    df[column] = df[column].str.replace(r"\[.*?\]", "", regex=True)
    # remove new line
    df[column] = df[column].str.replace(r"\n", ". ", regex=True)
    # remove special characters
    df[column] = df[column].str.replace(r"[^\w\d'\s.]+", "", regex=True)
    df[column] = df[column].str.strip()
    df[column] = df[column].str[2:] # impromptu removing the 1st period when replacing "\n"

    return df

In [84]:
# Convert list_lyrics to DataFrame
list_lyrics_df = pd.DataFrame(list_lyrics)

In [85]:
# Clean and tokenize lyrics
list_lyrics_df = clean_lyrics(list_lyrics_df, 'lyrics')
list_lyrics_df

Unnamed: 0,title,lyrics
0,exes,oh i'm sorry sorry that you love me hahahahaha...
1,Big Energy,got that real big energy uhhuh. got that big b...
2,It's ok I'm ok,mm. yeah uh. . . see you so excited mm. you go...
3,Guess featuring billie eilish,hey billie you there. uhuh. . . you wanna gues...
4,Woman,hey woman. hey woman. . . ayy woman. let me be...
...,...,...
94,J CHRIST,hey look look look but look. we going all the ...
95,NO MIENTEN,. . uhwuh. . . puedes mentir de nuevo. pero yo...
96,Sorry,. . you gotta go and get angry at all of my ho...
97,Blow,haha. dance. . . back door cracked we don't ne...


In [86]:
df = pd.read_csv('songs_lyrics.csv', index_col=0)
lyrics = pd.concat([df, list_lyrics_df], ignore_index=True).drop_duplicates(subset=['title'], keep='first')

In [87]:
lyrics.to_csv('songs_lyrics.csv')

### Add songs and lyrics to db

In [78]:
import sqlite3

with open('spotify_db/playlist_schema.sql', 'r') as file:
    create_table_query = file.read()
    
connection = sqlite3.connect('spotify_db/spotify.db')
connection.execute(create_table_query)

list_lyrics_df.to_sql('songs_lyrics', connection, if_exists='append', index=False)

connection.commit()
connection.close()

### Sentiment analysis 

Using NLTK VADER

In [12]:
from nltk.sentiment import vader
# nltk.download('vader_lexicon')

negative = []
neutral = []
positive = []
compound = []

analyzer = vader.SentimentIntensityAnalyzer()

for text in list_lyrics_df['lyrics']:
    scores = analyzer.polarity_scores(text)
    negative.append(scores['neg'])
    neutral.append(scores['neu'])
    positive.append(scores['pos'])
    compound.append(scores['compound'])

list_lyrics_df['negative'] = negative
list_lyrics_df['neutral'] = neutral
list_lyrics_df['positive'] = positive
list_lyrics_df['compound'] = compound

In [13]:
# get mean sentiment score for dataset
list_sentiment = list_lyrics_df[['negative', 'neutral', 'positive', 'compound']].mean(axis=1)
list_sentiment.mean()

np.float64(0.3719018041237113)

Sentiment analysis using [this model](tabularisai/multilingual-sentiment-analysis) from HuggingFace:
- Better than VADER: 
  - Less neutral results
  - Language support
- VADER: 
  - Excellent sentiment for English
  - Tend to lean toward neutral?

In [23]:
# Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

model_name = "tabularisai/multilingual-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# def predict_sentiment(text):
#         inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
#         with torch.no_grad():
#             outputs = model(**inputs)
#         probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
#         sentiment_map = {
#             0: "sombre",
#             1: "sad",
#             2: "neutral",
#             3: "happy",
#             4: "ecstatic"
#         }
#         return [sentiment_map[p] for p in torch.argmax(probabilities, dim=-1).tolist()] 

def predict_sentiment(texts):
    # sentiment_map = {0: "Somber", 1: "Sad", 2: "Neutral", 3: "Happy", 4: "Estactic"}
    pipe = pipeline(task="sentiment-analysis", model=model_name)
    sentiments = pipe(texts)
    labels = [sentiment['label'] for sentiment in sentiments]
    # return [sentiment_map[int(p)] for p in labels]
    return labels
    
# texts = [
#     # English
#     "I absolutely love the new design of this app!", "The customer service was disappointing.", "The weather is fine, nothing special.",
#     # Chinese
#     "这家餐厅的菜味道非常棒！", "我对他的回答很失望。", "天气今天一般。",
#     # Spanish
#     "¡Me encanta cómo quedó la decoración!", "El servicio fue terrible y muy lento.", "El libro estuvo más o menos.",
#     # Arabic
#     "الخدمة في هذا الفندق رائعة جدًا!", "لم يعجبني الطعام في هذا المطعم.", "كانت الرحلة عادية。",
#     # Ukrainian
#     "Мені дуже сподобалася ця вистава!", "Обслуговування було жахливим.", "Книга була посередньою。",
#     # Hindi
#     "यह जगह सच में अद्भुत है!", "यह अनुभव बहुत खराब था।", "फिल्म ठीक-ठाक थी।",
#     # Bengali
#     "এখানকার পরিবেশ অসাধারণ!", "সেবার মান একেবারেই খারাপ।", "খাবারটা মোটামুটি ছিল।",
#     # Portuguese
#     "Este livro é fantástico! Eu aprendi muitas coisas novas e inspiradoras.", 
#     "Não gostei do produto, veio quebrado.", "O filme foi ok, nada de especial.",
#     # Japanese
#     "このレストランの料理は本当に美味しいです！", "このホテルのサービスはがっかりしました。", "天気はまあまあです。",
#     # Russian
#     "Я в восторге от этого нового гаджета!", "Этот сервис оставил у меня только разочарование.", "Встреча была обычной, ничего особенного.",
#     # French
#     "J'adore ce restaurant, c'est excellent !", "L'attente était trop longue et frustrante.", "Le film était moyen, sans plus.",
#     # Turkish
#     "Bu otelin manzarasına bayıldım!", "Ürün tam bir hayal kırıklığıydı.", "Konser fena değildi, ortalamaydı.",
#     # Italian
#     "Adoro questo posto, è fantastico!", "Il servizio clienti è stato pessimo.", "La cena era nella media.",
#     # Polish
#     "Uwielbiam tę restaurację, jedzenie jest świetne!", "Obsługa klienta była rozczarowująca.", "Pogoda jest w porządku, nic szczególnego.",
#     # Tagalog
#     "Ang ganda ng lugar na ito, sobrang aliwalas!", "Hindi maganda ang serbisyo nila dito.", "Maayos lang ang palabas, walang espesyal.",
#     # Dutch
#     "Ik ben echt blij met mijn nieuwe aankoop!", "De klantenservice was echt slecht.", "De presentatie was gewoon oké, niet bijzonder.",
#     # Malay
#     "Saya suka makanan di sini, sangat sedap!", "Pengalaman ini sangat mengecewakan.", "Hari ini cuacanya biasa sahaja.",
#     # Korean
#     "이 가게의 케이크는 정말 맛있어요!", "서비스가 너무 별로였어요.", "날씨가 그저 그렇네요.",
#     # Swiss German
#     "Ich find dä Service i de Beiz mega guet!", "Däs Esä het mir nöd gfalle.", "D Wätter hüt isch so naja.",
#     # Vietnamese
#     "Tôi thích cách trang trí mới của quán!", "Dịch vụ khách hàng thì thất vọng.", "Bộ phim này tạm ổn, không gì đặc biệt."
# ]

# for text, sentiment in zip(texts, predict_sentiment(texts)):
#     print(f"Text: {text}\nSentiment: {sentiment}\n")

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [24]:
sentiments = []
for text in list_lyrics_df['lyrics']:
    sentiment = (predict_sentiment([text[:512]])[0])
    sentiments.append(sentiment)

list_lyrics_df['sentiment'] = sentiments

Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set

### Get embedding for similarity search
[Embedding model: ](sentence-transformers/distiluse-base-multilingual-cased)

In [14]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

sentences = ["This is an example", "Đây là một ví dụ", "C'est un exemple", "这是一个例子"]

model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased')
embeddings = model.encode(sentences)

for i, sentence in enumerate(sentences):
    print(f"Sentences: {sentence}")
    for j, sentence in enumerate(sentences):
        print(str(j) + f": {cosine_similarity([embeddings[i]], [embeddings[j]])[0][0]:.4f}")

  from .autonotebook import tqdm as notebook_tqdm


Sentences: This is an example
0: 1.0000
1: 0.9409
2: 0.9583
3: 0.9228
Sentences: Đây là một ví dụ
0: 0.9409
1: 1.0000
2: 0.9728
3: 0.9859
Sentences: C'est un exemple
0: 0.9583
1: 0.9728
2: 1.0000
3: 0.9804
Sentences: 这是一个例子
0: 0.9228
1: 0.9859
2: 0.9804
3: 1.0000


In [26]:
print(embeddings[0].shape)

(512,)


In [15]:
lyrics_embed = []
for text in list_lyrics_df['lyrics']:
    lyrics_embed.append(model.encode(text))

list_lyrics_df['lyrics_embed'] = lyrics_embed

# TF-IDF to get top keywords

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=50)
all_lyrics = "".join(list_lyrics_df['lyrics'])
lyrics_tfidf = vectorizer.fit_transform([all_lyrics])
query = vectorizer.get_feature_names_out()
query.tolist()

['ah',
 'aheh',
 'away',
 'cause',
 'come',
 'day',
 'deserves',
 'disappear',
 'don',
 'feel',
 'gay',
 'gonna',
 'good',
 'goodbye',
 'got',
 'hear',
 'heart',
 'home',
 'just',
 'know',
 'learn',
 'let',
 'life',
 'like',
 'little',
 'll',
 'look',
 'love',
 'make',
 'man',
 'need',
 'new',
 'oh',
 'said',
 'say',
 'shot',
 'story',
 'summer',
 'tell',
 'things',
 'think',
 'time',
 've',
 'wanna',
 'want',
 'way',
 'woah',
 'won',
 'world',
 'yeah']

This is a library that supports multiple languages, including Vietnamese and Korean

In [None]:
# example: https://colab.research.google.com/drive/1RaWj5SqWvyC2SsCTGg8IAVcl9G5hOB50?usp=sharing
import text2text as t2t

top_kw = t2t.Tfidfer().transform(all_lyrics)

kw_query = []
for kw in top_kw:
    kw_query += list(kw.keys())

kw_query

### Get recommendations using spotify API search function

In [None]:
# take random 50 top keywords
kw_query = random.sample(kw_query, 50)
search_recs = sp.search(q="".join(kw_query), type="track", limit=30)
recs = []
for track in search_recs['tracks']['items']:
    recs.append({
        'name': track['name'],
        'artist': track['artists'][0]['name']
    })
pd.DataFrame(recs)

### Summarization 

Instead of using top keywords, use a brief query that acts as a summary for the playlist

In [21]:
def text_summarizer(text, num_sen = 1):
    languages = stopwords.fileids() # list of supported languages
    stopWords = set(stopwords.words([language for language in languages]))
    
    sentences = []
    for sentence in text.split('.'):
        sentences.append(sentence)
        
    words = word_tokenize(text)
    words = [word for word in words if word not in stopWords]
    
    fdict = FreqDist(words) # frequency distribution
    
    # assign scores to senteces based on word frequencies
    sentence_scores = [sum(fdict[word] for word in word_tokenize(sentence) if word in fdict) for sentence in sentences]
    sentence_scores = list(enumerate(sentence_scores))
    
    # sort descending
    sorted_sentences = sorted(sentence_scores, key = lambda x: x[1], reverse = True)
    
    # Randomly select the top `num_sentences` sentences for the summary
    random_sentences = random.sample(sorted_sentences[:10], num_sen)

    # Sort the randomly selected sentences based on their original order in the text
    summary_sentences = sorted(random_sentences, key=lambda x: x[0])

    # Create the summary
    summary = ' '.join([sentences[i] for i, _ in summary_sentences])

    return summary

In [22]:
summary = []
for lyrics in list_lyrics_df['lyrics']:
    summary.append(text_summarizer(lyrics))
list_lyrics_df['summary'] = summary
list_lyrics_df

Unnamed: 0.1,Unnamed: 0,title,lyrics,negative,neutral,positive,compound,lyrics_embed,summary
0,0,In a Crowd of Thousands,it was june. i was ten. i still think of that ...,0.022,0.843,0.135,0.9851,"[0.0065606576, 0.051238872, -0.07534626, -0.01...",in that crowd of thousands
1,1,A Rumor in St. Petersburg,the neva flows a new wind blows. and soon it w...,0.074,0.752,0.174,0.9983,"[-0.0106800245, -0.011425582, 0.06578234, -0.0...",it's genuine romanov i could never part with it
2,2,Wake Up,i don't wanna wake up. i want you spread out o...,0.065,0.699,0.236,0.9953,"[0.045773048, -0.02060557, -0.02354185, -0.011...",i don't wanna wake up
3,3,No,my mind is invaded. my gates are ignored. my t...,0.192,0.756,0.052,-0.9688,"[-0.008551443, -0.043120474, -0.023232775, 0.0...",do you not understand
4,4,Perfect,sometimes is never quite enough. if you're fla...,0.052,0.679,0.269,0.9915,"[-0.04474713, -0.0011016198, -0.06103466, -0.0...",that wasn't fast enough
...,...,...,...,...,...,...,...,...,...
92,92,All Of You,look at this home we need a new foundation. it...,0.037,0.774,0.189,0.9985,"[-0.06417248, 0.039549667, 0.011905943, 0.0079...",that's what i'm always saying bro
93,93,"Colombia, Mi Encanto",. . colombia. . . noche de fiesta todos vienen...,0.008,0.956,0.035,0.7906,"[0.018363135, 0.023797853, -0.0005636969, -0.0...",aheh aheh aheh aheh aheh aheh aheh aheh encanto
94,94,Two Oruguitas,two oruguitas in love and yearning. spend ever...,0.011,0.900,0.089,0.9726,"[0.0007829615, -0.12832017, 0.0033471142, -0.0...",ay mariposas don't you hold on too tight
95,95,Goodbye (Live),oken. this is goodbye. . it's my happy ending....,0.088,0.782,0.130,0.9202,"[-0.010454756, -0.0073421397, -0.013250745, -0...",there's only one word left to sing


In [26]:
mood = "The princess to your princehood"
lyrics_sample = list_lyrics_df.sample(6)['summary']
mood = mood + "." + ".".join(lyrics_sample)
mood

"The princess to your princehood. at night all alone in my dreams. straightforward straight girl. you will always be my little brothers. feels like we could go on for forever this way this way. but it's beautiful and it's mine. tommy spoken"

In [27]:
mood = mood[:250] if len(mood) > 250 else mood
len(mood)

239

In [28]:
search_recs = sp.search(q = mood, type="track", limit=30)

recs = []
for track in search_recs['tracks']['items']:
    recs.append({
        'name': track['name'],
        'artist': track['artists'][0]['name']
    })
pd.DataFrame(recs)

Unnamed: 0,name,artist
0,Puppy Princess,Hot Freaks
1,Hope Shines Eternal,Sunset Shimmer
2,Princess,フォーエイト48
3,Right There in Front of Me,Twilight Sparkle
4,Legend You Are Meant to Be,Sunset Shimmer
5,난 널 사랑해 너만 사랑해 2,Stay
6,Legend of Everfree,Sunset Shimmer
7,Chs Rally Song,Rainbow Dash
8,追随光 (电视剧 《你微笑时很美》 主题曲),Krystal Chan
9,望歸人（網劇《少女大人》片尾曲）,葉炫清
