## Setup Environment

In [1]:
# Required package for this project 
# !pip install pandas \
#             nltk \
#             gensim \
#             scikit-learn \
#             numpy

In [2]:
import pandas as pd
import nltk
# nltk.download('punkt') # for the first time need to download this for tokenization
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.preprocessing import StandardScaler
import numpy as np

## Load dataset of songs

Dataset: https://www.kaggle.com/datasets/rodolfofigueroa/spotify-12m-songs, an open source dataset on Kaggle. It provides nearly 1.2 million of songs in Spotify. Those songs were retreived by using Spotify API.

In [3]:
file_path = '../tracks_features.csv'
songs_df = pd.read_csv(file_path)
print(songs_df.head())

                       id                   name                      album  \
0  7lmeHLHBe4nmXzuXc0HDjk                Testify  The Battle Of Los Angeles   
1  1wsRitfRRtWyEapl0q22o8        Guerrilla Radio  The Battle Of Los Angeles   
2  1hR0fIFK2qRG3f3RF70pb7       Calm Like a Bomb  The Battle Of Los Angeles   
3  2lbASgTSoDO7MTuLAXlTW0              Mic Check  The Battle Of Los Angeles   
4  1MQTmpYOZ6fcMQc56Hdo7T  Sleep Now In the Fire  The Battle Of Los Angeles   

                 album_id                       artists  \
0  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   
1  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   
2  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   
3  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   
4  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   

                   artist_ids  track_number  disc_number  explicit  \
0  ['2d0hyoQ5ynDBnkvAbJKORj']             1            1     False   
1  ['2d0hyoQ5ynDBnkvAbJKORj'] 

## Preprocessing data

We want to perform some operations to select the numeric audio features we want, and also convert those categorical values into numeric one to create the vector embeddings.
The selected features include 2 categorical features (name + artists), and 14 numeric audio features:
- id (unique index)
- name
- artists
- danceability
- energy
- key
- loudness
- mode
- speechiness
- acousticness
- instrumentalness
- liveness
- valence
- tempo
- duration_ms
- time_signature
- year

In [4]:
selected_features_df = songs_df.drop(columns=["album", "album_id", "artist_ids", "track_number", "disc_number", "explicit", "release_date"])
print(selected_features_df.head())

                       id                   name  \
0  7lmeHLHBe4nmXzuXc0HDjk                Testify   
1  1wsRitfRRtWyEapl0q22o8        Guerrilla Radio   
2  1hR0fIFK2qRG3f3RF70pb7       Calm Like a Bomb   
3  2lbASgTSoDO7MTuLAXlTW0              Mic Check   
4  1MQTmpYOZ6fcMQc56Hdo7T  Sleep Now In the Fire   

                        artists  danceability  energy  key  loudness  mode  \
0  ['Rage Against The Machine']         0.470   0.978    7    -5.399     1   
1  ['Rage Against The Machine']         0.599   0.957   11    -5.764     1   
2  ['Rage Against The Machine']         0.315   0.970    7    -5.424     1   
3  ['Rage Against The Machine']         0.440   0.967   11    -5.830     0   
4  ['Rage Against The Machine']         0.426   0.929    2    -6.729     1   

   speechiness  acousticness  instrumentalness  liveness  valence    tempo  \
0       0.0727       0.02610          0.000011    0.3560    0.503  117.906   
1       0.1880       0.01290          0.000071    0.1550    0.

In [5]:
# check if our filtered features contain any missing value
selected_features_df.isna().any()

id                  False
name                 True
artists             False
danceability        False
energy              False
key                 False
loudness            False
mode                False
speechiness         False
acousticness        False
instrumentalness    False
liveness            False
valence             False
tempo               False
duration_ms         False
time_signature      False
year                False
dtype: bool

In [6]:
# remove those missing value rows
print("Shape before drop missing value: ", selected_features_df.shape)
selected_features_df = selected_features_df.dropna()
print("Shape after drop missing value: ", selected_features_df.shape)

Shape before drop missing value:  (1204025, 17)
Shape after drop missing value:  (1204022, 17)


In [7]:
# some rows contain 0 value for year, we want to filter those row out as well
selected_features_df = selected_features_df[selected_features_df['year'] != 0] 
print("Shape after drop invalid year: ", selected_features_df.shape)

Shape after drop invalid year:  (1204012, 17)


Some songs have multiple artists, we want to convert them from a list to string.
Example: ['Pietro Locatelli', 'Capella Istropolitana', 'Jaroslav Krcek'] to 'Pietro Locatelli, Capella Istropolitana, Jaroslav Krcek'

In [8]:
def convert_artists_name(artists_list):
    items_list = artists_list.strip("[]").replace("'", "").split(", ")
    return ", ".join(items_list)

selected_features_df["artists"] = selected_features_df["artists"].apply(convert_artists_name)
selected_features_df.iloc[1184]["artists"]

'Pietro Locatelli, Capella Istropolitana, Jaroslav Krcek'

In [9]:
# remove duplicated rows by song name and artists name
selected_features_df = selected_features_df.drop_duplicates(subset=['name', 'artists'])
print("Shape after duplicated removal: ", selected_features_df.shape)

Shape after duplicated removal:  (1141542, 17)


## Create vector embeddings model

### Create categorical feature vector embeddings

We first need to convert those song and artists name into vector. The converted vector representation will have length of 14, so we can combine these with 14 numeric column values. We will combine the song name with artists name to one column for better tokenize

In [10]:
# perform tokenization operation on the song name and artist columns
def create_tokenized_summary(df, name_col, artist_col):
    # Combine song name and artists columns into a new 'string_summary' column
    df['string_summary'] = df[name_col] + ' - ' + df[artist_col]
    df['string_summary'] = df['string_summary'].astype(str)

    # Drop the original 'name' and 'artists' columns
    df.drop([name_col, artist_col], axis=1, inplace=True)

    # Convert string summaries to lowercase and then tokenize
    df['tokenized_summary'] = df['string_summary'].apply(lambda x: word_tokenize(x.lower()))

In [11]:
# Convert string (tokenized) summaries to vectors
def get_summary_vector(summary, model):
    summary_vector = [model.wv[word] for word in summary if word in model.wv]
    return sum(summary_vector) / len(summary_vector) if summary_vector else [0] * vector_size

In [12]:
def clean_tokenized_summary(df):
    df.drop(['string_summary', 'tokenized_summary'], axis=1, inplace=True)

In [13]:
create_tokenized_summary(selected_features_df, 'name', 'artists')

In [14]:
# Define Word2Vec model parameters (may adjust later)
vector_size = 14
window_size = 5
min_count = 1

# Train Word2Vec model
word2vec_model = Word2Vec(selected_features_df['tokenized_summary'], vector_size=vector_size, window=window_size, min_count=min_count)

In [15]:
summary_vector = selected_features_df['tokenized_summary'].apply(lambda x: get_summary_vector(x, word2vec_model))
clean_tokenized_summary(selected_features_df)
print(summary_vector[0])

[ 2.746404    0.58580625 -1.5455157   1.8050575   0.5681629  -0.7359647
 -0.6501735   2.0985696   0.1496685  -0.36264658  0.01724947  0.70348424
 -2.1843083   0.6678187 ]


### Create numerical features vector embeddings

The numerical columns are audio characteristics of the song, and we want to scale all the values to make it become the embeddings.

In [16]:
def scaled_numeric_columns(df):
    # Standardize the numeric columns
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(df)
    return scaled_data

In [17]:
# Extract the numeric columns (excluding 'id')
numeric_columns = selected_features_df.drop(['id'], axis=1)
scaled_columns = scaled_numeric_columns(numeric_columns)
# Display the resulting DataFrame
print(scaled_columns[0])

[-0.11474546  1.59347817  0.5103876   0.92309668  0.70122776 -0.1040544
 -1.09743536 -0.76135497  0.86473578  0.28529403  0.01103821 -0.23857829
  0.30032326 -0.80939027]


### Merged vector embeddings to create final one

Finally, we want to merge those summary vector (name & artisits) with scaled vector (audio charactersitcs) to make the embeddings for each song.

In [18]:
def merged_embeddings(summary_vector, scaled_columns):
    song_embeddings = [
        np.concatenate([summary_row, scaled_row])
        for summary_row, scaled_row in zip(summary_vector, scaled_columns)
    ]
    print("First song's embedding: ", song_embeddings[0])
    print("Size for entire dataset: ", len(song_embeddings), ", ", len(song_embeddings[0]))
    return song_embeddings

In [19]:
song_embeddings = merged_embeddings(summary_vector, scaled_columns)

First song's embedding:  [ 2.74640393  0.58580625 -1.54551566  1.80505753  0.56816292 -0.73596472
 -0.65017349  2.09856963  0.1496685  -0.36264658  0.01724947  0.70348424
 -2.18430829  0.66781873 -0.11474546  1.59347817  0.5103876   0.92309668
  0.70122776 -0.1040544  -1.09743536 -0.76135497  0.86473578  0.28529403
  0.01103821 -0.23857829  0.30032326 -0.80939027]
Size for entire dataset:  1141542 ,  28


In [20]:
# Combining those things into our final table for uploading to Pinecone. The table should have two columns, one is id, and another one is song embeddings representation.
embedded_features = selected_features_df[["id"]].copy()
embedded_features.loc[:, "values"] = song_embeddings
print(embedded_features.head())
print(embedded_features.shape)

                       id                                             values
0  7lmeHLHBe4nmXzuXc0HDjk  [2.746403932571411, 0.5858062505722046, -1.545...
1  1wsRitfRRtWyEapl0q22o8  [2.894444704055786, 0.8064025044441223, -1.256...
2  1hR0fIFK2qRG3f3RF70pb7  [3.7159016132354736, 0.9189034700393677, -1.91...
3  2lbASgTSoDO7MTuLAXlTW0  [2.688814878463745, 0.6879237294197083, -1.393...
4  1MQTmpYOZ6fcMQc56Hdo7T  [3.273085832595825, 0.5289910435676575, -1.764...
(1141542, 2)


## Prepare dataset for searching similar songs

Two different search strategies:
1. Combined all history songs into one embedding for query, get top 10 recommendations
2. Convert each individual into one embedding, perform 10 queries to get the top 1 recommendation for each one

Two query sources:
1. Personal favorite song & listening history
2. Spotify 2023 top hit 100 songs

Pinecone search metrics:
1. Cosine
2. Euclidean
3. Dotproduct

### Prepare Spotify top 100 song data

Get the most streamed songs in 2023 (datasets: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023/data, https://www.kaggle.com/datasets/amitanshjoshi/spotify-1million-tracks)

In [21]:
# We are missing loudness information in here, so we need to use another dataset info
file_path_top_songs = '../spotify-2023.csv'
top_songs = pd.read_csv(file_path_top_songs, encoding='latin-1')
list(top_songs.columns)

['track_name',
 'artist(s)_name',
 'artist_count',
 'released_year',
 'released_month',
 'released_day',
 'in_spotify_playlists',
 'in_spotify_charts',
 'streams',
 'in_apple_playlists',
 'in_apple_charts',
 'in_deezer_playlists',
 'in_deezer_charts',
 'in_shazam_charts',
 'bpm',
 'key',
 'mode',
 'danceability_%',
 'valence_%',
 'energy_%',
 'acousticness_%',
 'instrumentalness_%',
 'liveness_%',
 'speechiness_%']

In [22]:
# get top 10 hits songs in 2023 that released in recent 10 years
filtered_songs = top_songs[(top_songs['released_year'] > 2014) & (top_songs['released_year'] < 2023)]
top_10_songs = filtered_songs.sort_values(by = "streams", ascending = False).iloc[:10,:]

# Remove the last song from top_10_songs as the last one did not in all songs dataset
top_10_songs = top_10_songs.iloc[:-1, :]

# Get the next song in the sorted order
next_song = filtered_songs.sort_values(by="streams", ascending=False).iloc[10:11, :]

# Concatenate top_10_songs and next_song
top_10_songs = pd.concat([top_10_songs, next_song], ignore_index=True)
top_10_songs

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Anti-Hero,Taylor Swift,1,2022,10,21,9082,56,999748277,242,...,97,E,Major,64,51,63,12,0,19,5
1,Arcade,Duncan Laurence,1,2019,3,7,6646,0,991336132,107,...,72,A,Minor,45,27,33,82,0,14,4
2,Glimpse of Us,Joji,1,2022,6,10,6330,6,988515741,109,...,170,G#,Major,44,27,32,89,0,14,5
3,Seek & Destroy,SZA,1,2022,12,9,1007,0,98709329,5,...,152,C#,Major,65,35,65,44,18,21,7
4,"Come Back Home - From ""Purple Hearts""",Sofia Carson,1,2022,7,12,367,0,97610446,28,...,145,G,Major,56,43,53,24,0,12,4
5,Where Are You Now,"Lost Frequencies, Calum Scott",2,2021,7,30,10565,44,972509632,238,...,121,F#,Minor,67,26,64,52,0,17,10
6,Alone,Burna Boy,1,2022,11,4,782,2,96007391,27,...,90,E,Minor,61,32,67,15,0,11,5
7,No Lie,"Sean Paul, Dua Lipa",2,2016,11,18,7370,0,956865266,92,...,102,G,Major,74,45,89,5,0,26,13
8,HEARTBREAK ANNIVERSARY,Giveon,1,2020,2,21,5398,4,951637566,111,...,129,,Major,61,59,46,56,0,13,5
9,Used (feat. Don Toliver),"SZA, Don Toliver",2,2022,12,8,1042,0,94005786,7,...,150,A#,Minor,73,71,69,53,0,32,9


In [23]:
# extract top 10 songs name to search in all songs dataset
top_10_songs_to_search = top_10_songs[['track_name', 'artist(s)_name']]
top_10_songs_to_search = top_10_songs_to_search.rename(columns={'track_name': 'track_name', 'artist(s)_name': 'artist_name'})

# Split 'artists_name' and keep only the first part, because the another dataset only keep one artist
top_10_songs_to_search['artist_name'] = top_10_songs_to_search['artist_name'].str.split(',').str[0]

top_10_songs_to_search

Unnamed: 0,track_name,artist_name
0,Anti-Hero,Taylor Swift
1,Arcade,Duncan Laurence
2,Glimpse of Us,Joji
3,Seek & Destroy,SZA
4,"Come Back Home - From ""Purple Hearts""",Sofia Carson
5,Where Are You Now,Lost Frequencies
6,Alone,Burna Boy
7,No Lie,Sean Paul
8,HEARTBREAK ANNIVERSARY,Giveon
9,Used (feat. Don Toliver),SZA


In [24]:
# manually exchange song name values as they are not the same across the dataset
top_10_songs_to_search.loc[4, "track_name"] = "Come Back Home"
top_10_songs_to_search.loc[8, "track_name"] = "Heartbreak Anniversary"
top_10_songs_to_search

Unnamed: 0,track_name,artist_name
0,Anti-Hero,Taylor Swift
1,Arcade,Duncan Laurence
2,Glimpse of Us,Joji
3,Seek & Destroy,SZA
4,Come Back Home,Sofia Carson
5,Where Are You Now,Lost Frequencies
6,Alone,Burna Boy
7,No Lie,Sean Paul
8,Heartbreak Anniversary,Giveon
9,Used (feat. Don Toliver),SZA


In [25]:
file_path_all_songs = '../spotify_data.csv'
all_songs = pd.read_csv(file_path_all_songs, index_col = 0)
print(all_songs.head())

     artist_name        track_name                track_id  popularity  year  \
0     Jason Mraz   I Won't Give Up  53QF56cjZA9RTuuMZDrSA6          68  2012   
1     Jason Mraz  93 Million Miles  1s8tP3jP4GZcyHDsjvw218          50  2012   
2  Joshua Hyslop  Do Not Let Me Go  7BRCa8MPiyuvr2VU3O9W0F          57  2012   
3   Boyce Avenue          Fast Car  63wsZUhUZLlh1OsyrZq7sz          58  2012   
4   Andrew Belle  Sky's Still Blue  6nXIYClvJAfi6ujLiKqEq8          54  2012   

      genre  danceability  energy  key  loudness  mode  speechiness  \
0  acoustic         0.483   0.303    4   -10.058     1       0.0429   
1  acoustic         0.572   0.454    3   -10.286     1       0.0258   
2  acoustic         0.409   0.234    3   -13.711     1       0.0323   
3  acoustic         0.392   0.251   10    -9.845     1       0.0363   
4  acoustic         0.430   0.791    6    -5.419     0       0.0302   

   acousticness  instrumentalness  liveness  valence    tempo  duration_ms  \
0        0.694

In [26]:
# Get all the top songs completed information
selected_10_songs = pd.merge(all_songs, top_10_songs_to_search, on=['track_name', 'artist_name'], how='inner')
selected_10_songs

Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Sean Paul,No Lie,1Vb4HQnN2kZ5Y2KgYF5TDV,57,2016,dance,0.742,0.882,7,-2.862,1,0.117,0.0466,0.0,0.206,0.463,102.04,221176,4
1,Duncan Laurence,Arcade,1Xi84slp6FryDSCbzq4UCD,77,2019,pop,0.45,0.329,9,-12.603,0,0.0441,0.818,0.00109,0.135,0.266,71.884,183624,3
2,Giveon,Heartbreak Anniversary,3FAJ6O0NOHQV8Mc5Ri6ENp,79,2020,pop,0.449,0.465,0,-8.964,1,0.0791,0.524,1e-06,0.303,0.543,89.087,198371,3
3,Lost Frequencies,Where Are You Now,3uUuGVFu1V7jTQL60S1r8z,84,2021,dance,0.671,0.636,6,-8.117,0,0.103,0.515,0.000411,0.172,0.262,120.966,148197,4
4,Burna Boy,Alone,0AoBY2Y3qs6dtGgOD6c91N,77,2022,dance,0.6,0.659,4,-7.264,0,0.0542,0.176,0.0,0.111,0.307,89.955,221747,4
5,Taylor Swift,Anti-Hero,0V3wPSX9ygBnCm8psDIegu,92,2022,pop,0.637,0.643,4,-6.571,1,0.0519,0.13,2e-06,0.142,0.533,97.008,200690,4
6,SZA,Seek & Destroy,6eT2V7nKXyMf47TwPbtgAD,79,2022,pop,0.651,0.647,1,-5.415,1,0.0654,0.437,0.175,0.205,0.345,152.069,203733,4
7,Joji,Glimpse of Us,6xGruZOHLs39ZbVccQTuPZ,85,2022,pop,0.44,0.317,8,-9.258,1,0.0531,0.891,5e-06,0.141,0.268,169.914,233456,3
8,SZA,Used (feat. Don Toliver),1TweDM3JC49LNeelLVg3yX,76,2022,pop,0.734,0.689,10,-6.454,0,0.0871,0.532,8.5e-05,0.322,0.705,149.579,70160,4
9,Sofia Carson,Come Back Home,1I4dwH7C0jBAEtz5DjlJgQ,73,2022,pop,0.552,0.531,7,-7.732,1,0.0421,0.241,1.2e-05,0.122,0.438,144.946,176859,4


In [27]:
top_10_songs_to_search = selected_10_songs[['artist_name', 'track_name']]
top_10_songs_to_search

Unnamed: 0,artist_name,track_name
0,Sean Paul,No Lie
1,Duncan Laurence,Arcade
2,Giveon,Heartbreak Anniversary
3,Lost Frequencies,Where Are You Now
4,Burna Boy,Alone
5,Taylor Swift,Anti-Hero
6,SZA,Seek & Destroy
7,Joji,Glimpse of Us
8,SZA,Used (feat. Don Toliver)
9,Sofia Carson,Come Back Home


In [28]:
# format dataset to make sure it has same data format
def format_dataset(df):
    df = df.drop(columns=["track_id", "popularity", "genre"])
    moved_column = df.pop("year")
    df["year"] = moved_column
    return df

In [29]:
selected_10_songs = format_dataset(selected_10_songs)
create_tokenized_summary(selected_10_songs, 'track_name', 'artist_name')
top_10_summary_vector = selected_10_songs['tokenized_summary'].apply(lambda x: get_summary_vector(x, word2vec_model))
clean_tokenized_summary(selected_10_songs)
print(top_10_summary_vector[0])

[ 2.3160021 -0.2143867 -1.2958574 -0.3584495 -2.334128   0.6242531
  2.2698038  1.9013895  0.7554226 -1.120798   1.0283349  2.695714
 -1.2928238  2.043832 ]


In [30]:
top_10_songs_scaled = scaled_numeric_columns(selected_10_songs)
# Display the resulting DataFrame
print(top_10_songs_scaled[0])

[ 1.3650332   1.84352071  0.4463037   1.91004076  0.81649658  1.93012033
 -1.42607907 -0.33672787  0.28590593  0.35268685 -0.52754349  0.78345199
  0.65465367 -2.54399491]


In [31]:
top_10_song_embeddings = merged_embeddings(top_10_summary_vector, top_10_songs_scaled)

First song's embedding:  [ 2.31600213 -0.2143867  -1.29585743 -0.35844949 -2.3341279   0.62425309
  2.26980376  1.90138948  0.75542259 -1.12079799  1.02833486  2.695714
 -1.29282379  2.04383206  1.3650332   1.84352071  0.4463037   1.91004076
  0.81649658  1.93012033 -1.42607907 -0.33672787  0.28590593  0.35268685
 -0.52754349  0.78345199  0.65465367 -2.54399491]
Size for entire dataset:  10 ,  28


In [32]:
# mean aggregation method
mean_top_10_song_embeddings = np.mean(top_10_song_embeddings, axis = 0)

### Prepare individual personal song data

Seanna's top 10 favorite song has various genre and style:
1. Teeth - 5 Seconds of Summer
2. I WANNA BE YOUR SLAVE - Måneskin
3. Enemy - from the series Arcane League of Legends - Imagine Dragons
4. Say Something - A Great Big World
5. Marry You - Bruno Mars
6. Gotta Have You - The Weepies
7. 100 Degrees - Rich Brian
8. The Monster - Eminem
9. You Belong With Me - Taylor Swift
10. Bailando - Spanish Version - Enrique Iglesias

In [33]:
seanna_data = {
    'track_name': [
        'Teeth',
        'I WANNA BE YOUR SLAVE',
        'Enemy - from the series Arcane League of Legends',
        'Say Something',
        'Marry You',
        'Gotta Have You',
        '100 Degrees',
        'The Monster',
        'You Belong With Me',
        'Bailando - Spanish Version'
    ],
    'artist_name': [
        '5 Seconds of Summer',
        'Måneskin',
        'Imagine Dragons',
        'A Great Big World',
        'Bruno Mars',
        'The Weepies',
        'Rich Brian',
        'Eminem',
        'Taylor Swift',
        'Enrique Iglesias'
    ]
}

# Create DataFrame
seanna_favorite_songs = pd.DataFrame(seanna_data)

In [34]:
seanna_favorite_songs = pd.merge(all_songs, seanna_favorite_songs, on=['track_name', 'artist_name'], how='inner')
seanna_favorite_songs

Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Eminem,The Monster,48RrDBpOSSl1aLVCalGl5C,78,2013,hip-hop,0.781,0.853,1,-3.68,0,0.0715,0.0525,0.0,0.12,0.624,110.049,250189,4
1,A Great Big World,Say Something,78TKtlSLWK8pZAKKW3MyQL,56,2013,piano,0.453,0.146,2,-8.976,1,0.0343,0.867,3e-06,0.0945,0.0915,137.905,229400,3
2,Enrique Iglesias,Bailando - Spanish Version,32lm3769IRfcnrQV11LO4E,67,2014,pop,0.723,0.777,7,-3.503,1,0.108,0.0426,4e-06,0.0451,0.961,91.017,243413,4
3,5 Seconds of Summer,Teeth,26wLOs3ZuHJa2Ihhx6QIE6,76,2019,dance,0.756,0.448,3,-2.993,0,0.0404,0.0508,4e-06,0.11,0.431,139.031,204887,4
4,Rich Brian,100 Degrees,2ZDpSQfBdgkooeXw6oj3Uz,57,2019,hip-hop,0.756,0.648,0,-5.287,1,0.0731,0.118,0.0,0.515,0.657,80.979,166146,4
5,Måneskin,I WANNA BE YOUR SLAVE,4pt5fDVTg5GhEvEtlz9dKk,81,2021,indie-pop,0.75,0.608,1,-4.008,1,0.0387,0.00165,0.0,0.178,0.958,132.507,173347,4
6,Imagine Dragons,Enemy - from the series Arcane League of Legends,45lFaFCHXmpCiiMDvtihIv,1,2023,rock,0.728,0.783,11,-4.424,0,0.266,0.237,0.0,0.434,0.555,77.011,173381,4
7,The Weepies,Gotta Have You,1YjMWOorkBaP4MdKkKtp4y,50,2005,acoustic,0.678,0.363,11,-10.9,1,0.0318,0.872,0.000101,0.0798,0.543,75.004,199787,5
8,Taylor Swift,You Belong With Me,3GCL1PydwsLodcpv0Ll1ch,68,2008,pop,0.687,0.783,6,-4.44,1,0.0386,0.162,1.3e-05,0.114,0.443,129.964,231133,4
9,Bruno Mars,Marry You,22PMfvdz35fFKYnJyMn077,74,2010,dance,0.621,0.82,10,-4.865,1,0.0367,0.332,0.0,0.104,0.452,144.905,230192,4


In [35]:
seanna_favorite_songs_to_search = seanna_favorite_songs[['artist_name', 'track_name']]
seanna_favorite_songs_to_search

Unnamed: 0,artist_name,track_name
0,Eminem,The Monster
1,A Great Big World,Say Something
2,Enrique Iglesias,Bailando - Spanish Version
3,5 Seconds of Summer,Teeth
4,Rich Brian,100 Degrees
5,Måneskin,I WANNA BE YOUR SLAVE
6,Imagine Dragons,Enemy - from the series Arcane League of Legends
7,The Weepies,Gotta Have You
8,Taylor Swift,You Belong With Me
9,Bruno Mars,Marry You


In [36]:
seanna_favorite_songs = format_dataset(seanna_favorite_songs)
create_tokenized_summary(seanna_favorite_songs, 'track_name', 'artist_name')
seanna_summary_vector = seanna_favorite_songs['tokenized_summary'].apply(lambda x: get_summary_vector(x, word2vec_model))
clean_tokenized_summary(seanna_favorite_songs)

seanna_songs_scaled = scaled_numeric_columns(seanna_favorite_songs)
seanna_songs_embeddings = merged_embeddings(seanna_summary_vector, seanna_songs_scaled)
mean_seanna_song_embeddings = np.mean(seanna_songs_embeddings, axis = 0)

First song's embedding:  [ 2.18169737  0.22475466 -1.17476308  1.14897251 -0.20428205 -0.55098599
 -0.45541486  1.33938098  0.45629466 -0.69557673  0.78355527  1.24850774
 -2.04820204  0.08276778  0.95631581  1.03607778 -1.01388955  0.6677343
 -1.52752523 -0.03537872 -0.70704936 -0.42204769 -0.39115239  0.21535434
 -0.06651864  1.35618345  0.         -0.26832816]
Size for entire dataset:  10 ,  28


Yuhan's top 10 favorite song has similar genre and style:
1. Anti-Hero - Taylor Swift
2. Lover - Taylor Swift
3. Question...? - Taylor Swift
4. deja vu - Olivia Rodrigo
5. RADIO - HENRY
6. Wonderful U - AGA
7. Forever Young - Eve Ai
8. Something's Wrong with the Morning - Margo Guryan
9. The Most Beautiful Thing - Bruno Major
10. At My Worst - Pink Sweat$

In [37]:
yuhan_data = {
    'track_name': [
        'Anti-Hero',
        'Lover',
        'Question...?',
        'deja vu',
        'RADIO',
        'Wonderful U',
        'Forever Young',
        "Something's Wrong with the Morning",
        'The Most Beautiful Thing',
        'At My Worst'
    ],
    'artist_name': [
        'Taylor Swift',
        'Taylor Swift',
        'Taylor Swift',
        'Olivia Rodrigo',
        'HENRY',
        'AGA',
        'Eve Ai',
        'Margo Guryan',
        'Bruno Major',
        'Pink Sweat$'
    ]
}

# Create DataFrame
yuhan_favorite_songs = pd.DataFrame(yuhan_data)
yuhan_favorite_songs = pd.merge(all_songs, yuhan_favorite_songs, on=['track_name', 'artist_name'], how='inner')
yuhan_favorite_songs

Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Margo Guryan,Something's Wrong with the Morning,0IqQoCYYaSeM2ThWKPGoXX,52,2014,pop,0.656,0.567,2,-8.128,0,0.0352,0.682,0.000315,0.106,0.71,133.558,105573,4
1,AGA,Wonderful U,2eSNpIOFoi1Q8wxw6CycXJ,47,2016,cantopop,0.557,0.436,6,-8.569,1,0.0676,0.809,0.0,0.151,0.246,179.997,248551,3
2,Eve Ai,Forever Young,25sQT3yCEgd1uE6LC9ivcs,51,2018,singer-songwriter,0.304,0.226,0,-10.707,1,0.0329,0.929,0.0,0.161,0.323,139.593,313907,4
3,Taylor Swift,Lover,1dGr1c8CrMLDpV6mPbImSI,83,2019,pop,0.359,0.543,7,-7.582,1,0.0919,0.492,1.6e-05,0.118,0.453,68.534,221307,4
4,Pink Sweat$,At My Worst,0ri0Han4IRJhzvERHOZTMr,71,2020,chill,0.813,0.415,0,-5.926,1,0.0349,0.777,0.0,0.131,0.667,91.921,170345,4
5,HENRY,RADIO,4Dyb1oDEx4togM79cHL8UK,48,2020,k-pop,0.761,0.766,0,-5.414,1,0.143,0.118,0.0,0.111,0.266,146.879,191985,4
6,Bruno Major,The Most Beautiful Thing,07koEqsKHZTlGVMC9eoEjO,67,2020,pop,0.806,0.362,7,-10.386,1,0.0344,0.541,0.0489,0.111,0.418,127.498,235427,4
7,Olivia Rodrigo,deja vu,6HU7h9RYOaPRFeh0R3UeAr,83,2021,pop,0.442,0.612,2,-7.222,1,0.112,0.584,6e-06,0.37,0.178,180.917,215507,4
8,Taylor Swift,Anti-Hero,0V3wPSX9ygBnCm8psDIegu,92,2022,pop,0.637,0.643,4,-6.571,1,0.0519,0.13,2e-06,0.142,0.533,97.008,200690,4
9,Taylor Swift,Question...?,0heeNYlwOGuUSe7TgUD27B,74,2022,pop,0.751,0.502,7,-8.763,1,0.167,0.2,0.0,0.296,0.106,108.943,210557,4


In [38]:
yuhan_favorite_songs_to_search = yuhan_favorite_songs[['artist_name', 'track_name']]
yuhan_favorite_songs_to_search

Unnamed: 0,artist_name,track_name
0,Margo Guryan,Something's Wrong with the Morning
1,AGA,Wonderful U
2,Eve Ai,Forever Young
3,Taylor Swift,Lover
4,Pink Sweat$,At My Worst
5,HENRY,RADIO
6,Bruno Major,The Most Beautiful Thing
7,Olivia Rodrigo,deja vu
8,Taylor Swift,Anti-Hero
9,Taylor Swift,Question...?


In [39]:
yuhan_favorite_songs = format_dataset(yuhan_favorite_songs)
create_tokenized_summary(yuhan_favorite_songs, 'track_name', 'artist_name')
yuhan_summary_vector = yuhan_favorite_songs['tokenized_summary'].apply(lambda x: get_summary_vector(x, word2vec_model))
clean_tokenized_summary(yuhan_favorite_songs)

yuhan_songs_scaled = scaled_numeric_columns(yuhan_favorite_songs)
yuhan_songs_embeddings = merged_embeddings(yuhan_summary_vector, yuhan_songs_scaled)
mean_yuhan_song_embeddings = np.mean(yuhan_songs_embeddings, axis = 0)

First song's embedding:  [ 3.79312015  0.443066   -1.0267359   1.2555654  -0.88453609 -0.05224168
  0.15272658  0.52212608  2.03030443 -1.15595007  0.75490409  1.32537699
 -2.10083055 -0.04597534  0.26800525  0.40781972 -0.51601569 -0.12088828
 -3.         -0.89490988  0.5636146  -0.31440234 -0.74871726  1.66240608
  0.17398382 -2.07684355  0.33333333 -2.13000299]
Size for entire dataset:  10 ,  28


## Store Embeddings to Pinecone

In [40]:
# !pip install -qU \
#   "pinecone-client[grpc]"==2.2.1

In [41]:
import os
import pinecone
import time

  from tqdm.autonotebook import tqdm


In [42]:
PINECONE_API_KEY = '03367330-5730-4400-ac60-9ab695a047c0'
PINECONE_ENV = 'us-east-1-aws'

In [43]:
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)

### Store embeddings to Pinecone - Cosine

In [44]:
index_name = 'music-recommender-cosine'
dim = len(embedded_features['values'][0])

In [45]:
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=dim,
        metric='cosine'
    )
    # wait a moment for the index to be fully initialized
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

In [47]:
# now connect to the index
index_c = pinecone.GRPCIndex(index_name)
index_c.describe_index_stats()

{'dimension': 28,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [48]:
index_c.upsert_from_dataframe(embedded_features, batch_size=1000)

sending upsert requests:   0%|          | 0/1141542 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/1142 [00:00<?, ?it/s]

upserted_count: 1141542

In [49]:
index_c.describe_index_stats()

{'dimension': 28,
 'index_fullness': 0.2,
 'namespaces': {'': {'vector_count': 1141542}},
 'total_vector_count': 1141542}

### Store embeddings to Pinecone - Euclidean

In [50]:
index_name = 'music-recommender-euclidean'
dim = len(embedded_features['values'][0])

In [52]:
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=dim,
        metric='euclidean'
    )
    # wait a moment for the index to be fully initialized
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

In [53]:
# now connect to the index
index_e = pinecone.GRPCIndex(index_name)
index_e.describe_index_stats()

{'dimension': 28,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [54]:
index_e.upsert_from_dataframe(embedded_features, batch_size=1000)

sending upsert requests:   0%|          | 0/1141542 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/1142 [00:00<?, ?it/s]

upserted_count: 1141542

In [55]:
index_e.describe_index_stats()

{'dimension': 28,
 'index_fullness': 0.2,
 'namespaces': {'': {'vector_count': 1141542}},
 'total_vector_count': 1141542}

### Store embeddings to Pinecone - Dotproduct

In [56]:
index_name = 'music-recommender-dotproduct'
dim = len(embedded_features['values'][0])

In [57]:
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=dim,
        metric='dotproduct'
    )
    # wait a moment for the index to be fully initialized
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

In [58]:
# now connect to the index
index_d = pinecone.GRPCIndex(index_name)
index_d.describe_index_stats()

{'dimension': 28,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [59]:
index_d.upsert_from_dataframe(embedded_features, batch_size=1000)

sending upsert requests:   0%|          | 0/1141542 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/1142 [00:00<?, ?it/s]

upserted_count: 1141542

In [60]:
index_d.describe_index_stats()

{'dimension': 28,
 'index_fullness': 0.2,
 'namespaces': {'': {'vector_count': 1140000}},
 'total_vector_count': 1140000}

## Query

In [65]:
# display query results in dataframe with recommended songs and similarity scores
def combined_query_result(index, song_embeddings):
    query_response = index.query(song_embeddings, top_k=10, include_metadata=True)
    result_songs = []
    matches = query_response['matches']
    for e in matches:
        id = e['id']
        song = songs_df[songs_df['id'] == id]
        result_songs.append([song['name'].item(), song['artists'].item(), str(e['score'])])
    return pd.DataFrame(result_songs, columns=['song_name', 'artists', 'similarity_score'])
    
def individual_query_result(index, query_songs, query_embeddings):
    result = []
    fav_song_names = query_songs['track_name'].tolist()
    fav_song_artists = query_songs['artist_name'].tolist()
    
    for i in range(len(fav_song_names)):
        xc = index.query(query_embeddings[i], top_k=2, include_metadata=True)
        id = xc['matches'][0]['id']
        score = xc['matches'][0]['score']
        song = songs_df[songs_df['id'] == id]
        song_name = song['name'].item()
        song_artists = song['artists'].item()
        if fav_song_names[i] == song_name and fav_song_artists[i] in song_artists:
            id = xc['matches'][1]['id']
            score = xc['matches'][1]['score']
            song = songs_df[songs_df['id'] == id]
        result.append([fav_song_names[i], fav_song_artists[i], song['name'].item(), song['artists'].item(), str(score)])
    
    return pd.DataFrame(result, columns=['fav_song', 'artists', 'match_name', 'match_artists', 'similarity_score'])

### Query - Dotproduct

#### Combined Song Vector as a Single Query

In [66]:
# personal listening histroy top 10 averaged - Yuhan
combined_query_result(index_d, mean_yuhan_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,What You Want It to Be,['Sugar'],34.438194
1,Life's What You Make It,['Baby'],33.90038
2,You Can't Take What You Don't Have (You Don't ...,['Whiteheart'],33.849274
3,If You Don't Love Me Like You Say You Love Me,['Betty Wright'],33.57661
4,You You You You You,['The 6ths'],33.40665
5,You Don't Know What Love Is,['Jimmy Johnson'],33.36376
6,Don't You Want To Know,['Oh So'],33.29406
7,You Can't Get What You Want,['Joe Jackson'],33.26461
8,Don't You Want It,['Five'],33.150494
9,Can't Take It With You When You Go,['Mike Love'],33.124035


In [67]:
# personal listening histroy top 10 averaged - Seanna
combined_query_result(index_d, mean_seanna_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,You Can't Take What You Don't Have (You Don't ...,['Whiteheart'],43.265182
1,You You You You You,['The 6ths'],43.222973
2,Don't You Want It,['Five'],43.167645
3,What You Want It to Be,['Sugar'],42.696896
4,Don't You Want It,['Lovers'],42.38134
5,Don't Let It Get You Down,['America'],42.100937
6,Life's What You Make It,['Baby'],42.079002
7,You Can't Always Get What You Want,['Aretha Franklin'],42.039806
8,You Can't Get What You Want,['Joe Jackson'],42.0105
9,If You Don't Love Me Like You Say You Love Me,['Betty Wright'],42.008553


In [68]:
# spotify top 10 averaged
combined_query_result(index_d, mean_top_10_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,What You Want It to Be,['Sugar'],33.897263
1,Can't Take It With You When You Go,['Mike Love'],33.8061
2,Don't Let It Get You Down,['America'],33.800323
3,You Can't Take What You Don't Have (You Don't ...,['Whiteheart'],33.770912
4,You You You You You,['The 6ths'],33.460445
5,You Can't Take It When You Go,['Dave Mason'],33.364513
6,You Can't Get What You Want,['Joe Jackson'],33.33207
7,Don't You Want To Know,['Oh So'],33.2719
8,Life's What You Make It,['Baby'],33.227203
9,Don't Want You to Go,['Angel'],33.019917


#### Individual Song Vector as a Single Query

In [69]:
# personal listening histroy top 10 1by1 - Yuhan
individual_query_result(index_d, yuhan_favorite_songs_to_search, yuhan_songs_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,Something's Wrong with the Morning,Margo Guryan,If You Don't Love Me Like You Say You Love Me,['Betty Wright'],61.160202
1,Wonderful U,AGA,You Don't Have To Drop A Heart To Break It,['The Ravens'],58.211906
2,Forever Young,Eve Ai,"Music for Dreams, Volume One",['The Lovely Moon'],100.98441
3,Lover,Taylor Swift,You Can't Take It,['Linda Jones'],40.952503
4,At My Worst,Pink Sweat$,Just Can't Take It,['Restless'],51.316917
5,RADIO,HENRY,Conversation,['Peter Peter'],80.25086
6,The Most Beautiful Thing,Bruno Major,Major Major Major,['The Jac'],106.24206
7,deja vu,Olivia Rodrigo,Indiana (En Vivo),['Hombres G'],37.27895
8,Anti-Hero,Taylor Swift,Sam Hall,['Kevin Evans'],30.86615
9,Question...?,Taylor Swift,"It's Not What You Did, It's Who You Are",['Howard Zinn'],53.591736


In [70]:
# personal listening histroy top 10 1by1 - Seanna
individual_query_result(index_d, seanna_favorite_songs_to_search, seanna_songs_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,The Monster,Eminem,Monstercat Podcast Ep. 086 (Staff Picks 2015),['Monstercat Call of the Wild'],64.65959
1,Say Something,A Great Big World,You Go Your Way,['Philip Glass'],82.10528
2,Bailando - Spanish Version,Enrique Iglesias,Bargrooves Lounge (Continuous Mix 1),['Various Artists'],64.14451
3,Teeth,5 Seconds of Summer,1 For 2 For 1,['D-Styles'],53.986347
4,100 Degrees,Rich Brian,Too Slow Blues,['John Evans Band'],42.168617
5,I WANNA BE YOUR SLAVE,Måneskin,Do What You Say You're Gonna Do,['SaraBeth'],81.52937
6,Enemy - from the series Arcane League of Legends,Imagine Dragons,The First Present,"[""It's a Death Metal X-mas""]",63.110268
7,Gotta Have You,The Weepies,But What Can You Do,['Beautumn'],86.349884
8,You Belong With Me,Taylor Swift,Don't Let It Get You Down,['America'],72.11739
9,Marry You,Bruno Mars,Until You Say You Love Me,['Aretha Franklin'],44.59065


In [71]:
# spotify top 10 1by1
individual_query_result(index_d, top_10_songs_to_search, top_10_song_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,No Lie,Sean Paul,. . . So We . . .,['Illuminandi'],63.13508
1,Arcade,Duncan Laurence,Segment,['Mark E. Smith'],59.822723
2,Heartbreak Anniversary,Giveon,Mad Girl - Live,['Emilie Autumn'],39.142895
3,Where Are You Now,Lost Frequencies,Don't You Want It,['Lovers'],79.98185
4,Alone,Burna Boy,Where You Go I Go Too,['Lindstrøm'],49.52886
5,Anti-Hero,Taylor Swift,Brother Down,['Sam Roberts Band'],30.321486
6,Seek & Destroy,SZA,For The Family (Instrumental Version),['K-DEF'],41.068905
7,Glimpse of Us,Joji,"Music for Dreams, Volume One",['The Lovely Moon'],60.15206
8,Used (feat. Don Toliver),SZA,Baby D Intro (feat. Baby D),"['Bobby Gore', 'Baby D.']",67.32645
9,Come Back Home,Sofia Carson,Don't Let It Get You Down,['America'],61.75722


In [76]:
top_10_song_embeddings[5]

array([ 0.95477927,  0.12518816,  0.26304179,  1.10655558, -0.890567  ,
       -1.19093466,  0.081604  , -0.5917865 , -0.05989695, -1.51069283,
        0.85497081,  0.5811221 , -2.04147458,  1.32498038,  0.40567252,
        0.38554106, -0.51006137,  0.39044806,  0.81649658, -0.72634549,
       -1.1167231 , -0.33669355, -0.6244413 ,  0.84644843, -0.68645582,
        0.32974362,  0.65465367,  0.63599873])

In [77]:
yuhan_songs_embeddings[8]

array([ 0.95477927,  0.12518816,  0.26304179,  1.10655558, -0.890567  ,
       -1.19093466,  0.081604  , -0.5917865 , -0.05989695, -1.51069283,
        0.85497081,  0.5811221 , -2.04147458,  1.32498038,  0.16057699,
        0.92611902,  0.17200523,  0.81461395,  0.33333333, -0.53805708,
       -1.43327411, -0.33576803, -0.32558035,  0.74288772, -0.87309328,
       -0.20991622,  0.33333333,  1.14692469])

### Combined Song Vector as Single Query - Euclidean

#### Combined Song Vector as a Single Query

In [81]:
# personal listening histroy top 10 averaged - Yuhan
combined_query_result(index_e, mean_yuhan_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,The Taco Song,['Nate Hale'],3.6114178
1,There's Nothing Like A Show on Broadway,"['Various Artists', 'Mel Brooks']",4.380821
2,Slow Boat,['Lennie Gallant'],4.5975456
3,I Am Mister Pip,['Robbie Tucker'],4.781353
4,Good King Wenceslas,['Skydiggers'],4.834404
5,Erase Me,"['Said the Sky', 'NÉONHÈART']",4.914715
6,Ordinary Day,['Hal Ketchum'],4.975313
7,You Are My Sunshine,"['Magnificent Sevenths', 'Milton Batiste', 'Al...",5.028639
8,Truth or Consequences,['Sumaia Jackson'],5.1045303
9,Happy Anniversary,['Becky Schlegel'],5.147007


In [82]:
# personal listening histroy top 10 averaged - Seanna
combined_query_result(index_e, mean_seanna_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,There's Nothing Like A Show on Broadway,"['Various Artists', 'Mel Brooks']",3.9513474
1,Torment In Me,['Leaving Dionysus'],5.0317726
2,Catch Me,['Vicki Genfan'],5.0534134
3,Hillary Right,['Little Champions'],5.134693
4,Good King Wenceslas,['Skydiggers'],5.2687836
5,You Wear the Jesus,['Electro Spectre'],5.29834
6,January's Gone,['Bucks Fizz'],5.3161163
7,Fool's Game,['Belladonic Haze'],5.381874
8,Teardrops From My Eyes,['Veronica Martell'],5.446373
9,Linger Longer,['The Clean'],5.470455


In [83]:
# spotify top 10 averaged
combined_query_result(index_e, mean_top_10_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,The Taco Song,['Nate Hale'],3.9900131
1,Love,['Sasha Ilyukevich & The Highly Skilled Migran...,4.2678013
2,We Hung The Moon,['Jesse Dayton & Brennen Leigh'],4.405098
3,There's Nothing Like A Show on Broadway,"['Various Artists', 'Mel Brooks']",4.4356575
4,Truth or Consequences,['Sumaia Jackson'],4.4555817
5,Georgia In My Heart,['Dr Ika & Temur Kvitelashvili'],4.4678726
6,Me and Donnie Vee,['Phil Angotti'],4.9150696
7,Erase Me,"['Said the Sky', 'NÉONHÈART']",5.0514565
8,"Grandma, Grandpa And Me",['Kid Pan Alley'],5.0682755
9,See You on Rooftops - Alt Version,['Neil Halstead'],5.20689


#### Individual Song Vector as a Single Query

In [78]:
# personal listening histroy top 10 1by1 - Yuhan
individual_query_result(index_e, yuhan_favorite_songs_to_search, yuhan_songs_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,Something's Wrong with the Morning,Margo Guryan,Your Love Is Like a Flower,['Herschel Sizemore'],10.503426
1,Wonderful U,AGA,That He Wrote,['Downy Mildew'],9.833748
2,Forever Young,Eve Ai,Miss Me,['Bitch and Animal'],15.354469
3,Lover,Taylor Swift,Country Boy,['Cole Prior Stevens'],5.436756
4,At My Worst,Pink Sweat$,1-800,['Bad Bad Hats'],7.838249
5,RADIO,HENRY,Radio,['Michael Sweet'],20.314468
6,The Most Beautiful Thing,Bruno Major,Now A Major Motion Picture,['International Observer'],15.42112
7,deja vu,Olivia Rodrigo,Figuración,['Pedro Aznar'],7.211113
8,Anti-Hero,Taylor Swift,Tunica,['Spencer Robinson'],3.114872
9,Question...?,Taylor Swift,LEAVE ME ALONE,"['KAYTRANADA', 'Shay Lia']",6.6771545


In [79]:
# personal listening histroy top 10 1by1 - Seanna
individual_query_result(index_e, seanna_favorite_songs_to_search, seanna_songs_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,The Monster,Eminem,The Coldest Tiles,['Fake The Envy'],2.2480812
1,Say Something,A Great Big World,The Night We Called It A Day,['Cory Jamison'],11.965103
2,Bailando - Spanish Version,Enrique Iglesias,Este Madrid (Versión 1978),['Leño'],7.7511063
3,Teeth,5 Seconds of Summer,Kingdom of Avarice,['Fiction 8'],5.7408524
4,100 Degrees,Rich Brian,Twelve Sticks,['Rev. Gary Davis'],9.7794
5,I WANNA BE YOUR SLAVE,Måneskin,I'm Gonna Dance,['Lexie Green'],5.0640717
6,Enemy - from the series Arcane League of Legends,Imagine Dragons,Remixed Remixes of The Rite of Spring Monkey,['Dos Monos'],6.0059967
7,Gotta Have You,The Weepies,Who Do You Think Will Answer?,['Chuck Brown'],14.017212
8,You Belong With Me,Taylor Swift,Show Me Heaven,['Jessica Andrews'],5.2259903
9,Marry You,Bruno Mars,Us,['Céline Dion'],8.072178


In [80]:
# spotify top 10 1by1
individual_query_result(index_e, top_10_songs_to_search, top_10_song_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,No Lie,Sean Paul,Where I Dwell,"['Gangsta Blac', 'DJ Paul', 'JUICY.']",11.745789
1,Arcade,Duncan Laurence,"Remember me, my deir","['Anonymous\xa0', 'Custer LaRue', 'Baltimore C...",5.977783
2,Heartbreak Anniversary,Giveon,Border Village,['Melora Creager'],6.090435
3,Where Are You Now,Lost Frequencies,When You Leave Me,['SoulSonic'],8.506622
4,Alone,Burna Boy,bad.day,['Young Montana?'],4.862171
5,Anti-Hero,Taylor Swift,Rosin,['Jadea Kelly'],2.8585815
6,Seek & Destroy,SZA,Fighting Challenge,['Falcao & Monashee'],7.2206726
7,Glimpse of Us,Joji,Dance of Ganesha,['Ajeet'],6.313774
8,Used (feat. Don Toliver),SZA,You Niggaz Pussy (feat. V Slash),"['Juicy J', 'V Slash']",10.580589
9,Come Back Home,Sofia Carson,Come Back,['Romarzs'],6.1404037


### Combined Song Vector as Single Query - Cosine

#### Combined Song Vector as a Single Query

In [84]:
# personal listening histroy top 10 averaged - Yuhan
combined_query_result(index_c, mean_yuhan_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,Love Is the Answer,['Lonnie Liston Smith'],0.91988355
1,This Guy's in Love with You - Remastered,['Oscar Peterson'],0.9193857
2,There's Nothing Like A Show on Broadway,"['Various Artists', 'Mel Brooks']",0.9188681
3,"Like a White Star, Tangled and Far, Tulip That...",['T. Rex'],0.9167306
4,Little Bit of Blue,"['What Time Is It, Mr. Fox?']",0.9139731
5,But Not For Me,['Rod Stewart'],0.9130943
6,This Is the Time,['Rosemary Baby'],0.90977967
7,Look Out For Love,['Annie Sellick'],0.9094874
8,Once Is Never Enough,['Bert Wilson & Rebirth'],0.9090391
9,Give Me Love (give Me Peace On Earth),['Dave Davies'],0.9054414


In [85]:
# personal listening histroy top 10 averaged - Seanna
combined_query_result(index_c, mean_seanna_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,I Love To Dance,['Langhorne Slim'],0.9341574
1,The Truth About You,['Calamine'],0.9328675
2,Show Me,['46 Long'],0.93216354
3,I Want to Be a Real Cowboy Girl,['The Sweetback Sisters'],0.9316267
4,About You,['The Farmers'],0.93073076
5,The Last Thing You Do,['The Donefors'],0.9286963
6,I'm in Love - Alternate Vocal - Let Me in Your...,['Aretha Franklin'],0.92866284
7,You're The Only One (Who Can Move Me That Way),['Duke Robillard'],0.9279845
8,It's You And Me,['The C-Quents'],0.9278324
9,Get Me to a Monastery,['The Divine Comedy'],0.92694354


In [86]:
# spotify top 10 averaged
combined_query_result(index_c, mean_top_10_song_embeddings)

Unnamed: 0,song_name,artists,similarity_score
0,Outta My Head (And Outta Yours Too),['James Christensen'],0.9401452
1,You Left Me,['Paul & Storm'],0.9281143
2,Alone With You,['Chris Jones & The Night Drivers'],0.9279254
3,(Wine Friend of Mine) Stand By Me,['Johnny Bush & Justin Trevino'],0.9269414
4,(Take Me Back) Mary Jane,['Young Heart Attack'],0.9247651
5,You Never Even Called Me By My Name (The Perfe...,['David Allan Coe'],0.92451304
6,This Is the Start,['Sly & Reggie'],0.9238507
7,We Hung The Moon,['Jesse Dayton & Brennen Leigh'],0.92303854
8,Love Is the Answer,['Lonnie Liston Smith'],0.9217581
9,All Alone,['Randy & The Wolfpack'],0.9208446


#### Individual Song Vector as a Single Query

In [87]:
# personal listening histroy top 10 1by1 - Yuhan
individual_query_result(index_c, yuhan_favorite_songs_to_search, yuhan_songs_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,Something's Wrong with the Morning,Margo Guryan,Your Love Is Like a Flower,['Herschel Sizemore'],0.8969703
1,Wonderful U,AGA,What's Good About Good-bye,['Dionne Warwick'],0.89186144
2,Forever Young,Eve Ai,Miss Me,['Bitch and Animal'],0.8754276
3,Lover,Taylor Swift,Country Boy,['Cole Prior Stevens'],0.9185135
4,At My Worst,Pink Sweat$,1-800,['Bad Bad Hats'],0.9107069
5,RADIO,HENRY,Youth Leagues,['Robert Pollard'],0.9094042
6,The Most Beautiful Thing,Bruno Major,Now A Major Motion Picture,['International Observer'],0.86355996
7,deja vu,Olivia Rodrigo,Amor,['Mijares'],0.8883017
8,Anti-Hero,Taylor Swift,Tunica,['Spencer Robinson'],0.93293387
9,Question...?,Taylor Swift,LEAVE ME ALONE,"['KAYTRANADA', 'Shay Lia']",0.9014054


In [88]:
# personal listening histroy top 10 1by1 - Seanna
individual_query_result(index_c, seanna_favorite_songs_to_search, seanna_songs_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,The Monster,Eminem,The Coldest Tiles,['Fake The Envy'],0.955262
1,Say Something,A Great Big World,Hard Times Come Again No More,['Cantus'],0.9191566
2,Bailando - Spanish Version,Enrique Iglesias,No Mas - Version Malaga,['El Chojin'],0.8958625
3,Teeth,5 Seconds of Summer,Kingdom of Avarice,['Fiction 8'],0.92771447
4,100 Degrees,Rich Brian,Twelve Sticks,['Rev. Gary Davis'],0.86593324
5,I WANNA BE YOUR SLAVE,Måneskin,I'm Gonna Dance,['Lexie Green'],0.95833045
6,Enemy - from the series Arcane League of Legends,Imagine Dragons,Remixed Remixes of The Rite of Spring Monkey,['Dos Monos'],0.93866104
7,Gotta Have You,The Weepies,Who Do You Think Will Answer?,['Chuck Brown'],0.91706765
8,You Belong With Me,Taylor Swift,I'll go my way by myself,['Junior Cook'],0.9566713
9,Marry You,Bruno Mars,Fill Me Up,['Linda Perry'],0.87168074


In [89]:
# spotify top 10 1by1
individual_query_result(index_c, top_10_songs_to_search, top_10_song_embeddings)

Unnamed: 0,fav_song,artists,match_name,match_artists,similarity_score
0,No Lie,Sean Paul,Where I Dwell,"['Gangsta Blac', 'DJ Paul', 'JUICY.']",0.9065889
1,Arcade,Duncan Laurence,My shepherd is the living Lord,"['Thomas Tomkins', 'Laurence Cummings', 'Oxfor...",0.9271634
2,Heartbreak Anniversary,Giveon,Desert Vampire,['Rasputina'],0.8501829
3,Where Are You Now,Lost Frequencies,This Is What You Get When You Mess With Love,['GusGus'],0.94880205
4,Alone,Burna Boy,bad.day,['Young Montana?'],0.9235947
5,Anti-Hero,Taylor Swift,Harbour,['Lily Wilson'],0.93706644
6,Seek & Destroy,SZA,Fighting Challenge,['Falcao & Monashee'],0.9010477
7,Glimpse of Us,Joji,Dance of Ganesha,['Ajeet'],0.9147812
8,Used (feat. Don Toliver),SZA,Spaceship (feat. Sheck Wes),"['Don Toliver', 'Sheck Wes']",0.9179843
9,Come Back Home,Sofia Carson,Baby Let Me See You Smile,['Patrick Lamb'],0.9355773
