## Setup Environment

In [12]:
!pip install pandas \
            nltk \
            gensim \
            scikit-learn \
            numpy

Collecting scikit-learn
  Downloading scikit_learn-1.3.2-cp311-cp311-macosx_10_9_x86_64.whl.metadata (11 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Downloading scikit_learn-1.3.2-cp311-cp311-macosx_10_9_x86_64.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.3.2 threadpoolctl-3.2.0


In [23]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.preprocessing import StandardScaler
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/zhengyuhan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Load dataset of songs

Dataset: https://www.kaggle.com/datasets/rodolfofigueroa/spotify-12m-songs, an open source dataset on Kaggle. It provides nearly 1.2 million of songs in Spotify. Those songs were retreived by using Spotify API.

In [84]:
file_path = '../tracks_features.csv'
df = pd.read_csv(file_path)
print(df.head())

                       id                   name                      album  \
0  7lmeHLHBe4nmXzuXc0HDjk                Testify  The Battle Of Los Angeles   
1  1wsRitfRRtWyEapl0q22o8        Guerrilla Radio  The Battle Of Los Angeles   
2  1hR0fIFK2qRG3f3RF70pb7       Calm Like a Bomb  The Battle Of Los Angeles   
3  2lbASgTSoDO7MTuLAXlTW0              Mic Check  The Battle Of Los Angeles   
4  1MQTmpYOZ6fcMQc56Hdo7T  Sleep Now In the Fire  The Battle Of Los Angeles   

                 album_id                       artists  \
0  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   
1  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   
2  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   
3  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   
4  2eia0myWFgoHuttJytCxgX  ['Rage Against The Machine']   

                   artist_ids  track_number  disc_number  explicit  \
0  ['2d0hyoQ5ynDBnkvAbJKORj']             1            1     False   
1  ['2d0hyoQ5ynDBnkvAbJKORj'] 

## Preprocessing data

We want to perform some operations to select the numeric audio features we want, and also convert those categorical values into numeric one to create the vector embeddings.
The selected features includes:
- id (not sure if we need this?)
- name
- artists
- danceability
- energy
- key
- loudness
- mode
- speechiness
- acousticness
- instrumentalness
- liveness
- valence
- tempo
- duration_ms
- time_signature
- year (do we want this?)

In [16]:
selected_features = df.drop(columns=["album", "album_id", "artist_ids", "track_number", "disc_number", "explicit", "release_date"])
print(selected_features.head())

                       id                   name  \
0  7lmeHLHBe4nmXzuXc0HDjk                Testify   
1  1wsRitfRRtWyEapl0q22o8        Guerrilla Radio   
2  1hR0fIFK2qRG3f3RF70pb7       Calm Like a Bomb   
3  2lbASgTSoDO7MTuLAXlTW0              Mic Check   
4  1MQTmpYOZ6fcMQc56Hdo7T  Sleep Now In the Fire   

                        artists  danceability  energy  key  loudness  mode  \
0  ['Rage Against The Machine']         0.470   0.978    7    -5.399     1   
1  ['Rage Against The Machine']         0.599   0.957   11    -5.764     1   
2  ['Rage Against The Machine']         0.315   0.970    7    -5.424     1   
3  ['Rage Against The Machine']         0.440   0.967   11    -5.830     0   
4  ['Rage Against The Machine']         0.426   0.929    2    -6.729     1   

   speechiness  acousticness  instrumentalness  liveness  valence    tempo  \
0       0.0727       0.02610          0.000011    0.3560    0.503  117.906   
1       0.1880       0.01290          0.000071    0.1550    0.

In [17]:
# check if our filted features contains any missing value
selected_features.isna().any()

id                  False
name                 True
artists             False
danceability        False
energy              False
key                 False
loudness            False
mode                False
speechiness         False
acousticness        False
instrumentalness    False
liveness            False
valence             False
tempo               False
duration_ms         False
time_signature      False
year                False
dtype: bool

Some songs have multiple artists, we want to convert them from a list to string.
Example: ['Pietro Locatelli', 'Capella Istropolitana', 'Jaroslav Krcek'] to 'Pietro Locatelli, Capella Istropolitana, Jaroslav Krcek'

In [18]:
def convert_artists_name(artists_list):
    items_list = artists_list.strip("[]").replace("'", "").split(", ")
    return ", ".join(items_list)

selected_features["artists"] = selected_features["artists"].apply(convert_artists_name)
selected_features.iloc[1184]["artists"]

'Pietro Locatelli, Capella Istropolitana, Jaroslav Krcek'

In [19]:
# remove duplicated rows by song name and artists name
print("Shape before duplicated removal: ", selected_features.shape)
selected_features = selected_features.drop_duplicates(subset=['name', 'artists'])
print("Shape after duplicated removal: ", selected_features.shape)

Shape before duplicated removal:  (1204025, 17)
Shape after duplicated removal:  (1141555, 17)


In [20]:
print(selected_features.head())
print(selected_features.tail())

                       id                   name                   artists  \
0  7lmeHLHBe4nmXzuXc0HDjk                Testify  Rage Against The Machine   
1  1wsRitfRRtWyEapl0q22o8        Guerrilla Radio  Rage Against The Machine   
2  1hR0fIFK2qRG3f3RF70pb7       Calm Like a Bomb  Rage Against The Machine   
3  2lbASgTSoDO7MTuLAXlTW0              Mic Check  Rage Against The Machine   
4  1MQTmpYOZ6fcMQc56Hdo7T  Sleep Now In the Fire  Rage Against The Machine   

   danceability  energy  key  loudness  mode  speechiness  acousticness  \
0         0.470   0.978    7    -5.399     1       0.0727       0.02610   
1         0.599   0.957   11    -5.764     1       0.1880       0.01290   
2         0.315   0.970    7    -5.424     1       0.4830       0.02340   
3         0.440   0.967   11    -5.830     0       0.2370       0.16300   
4         0.426   0.929    2    -6.729     1       0.0701       0.00162   

   instrumentalness  liveness  valence    tempo  duration_ms  time_signature  \


## Create vectors/embeddings

We first need to convert those song and artists name into vector. The converted vector representation will have length of 14, so we can combine these with 14 numeric column values. We will combine the song name with artists name to one column for better tokenize

In [21]:
selected_features['string_summary'] = selected_features['name'] + ' - ' + selected_features['artists']
selected_features['string_summary'] = selected_features['string_summary'].astype(str)

# Drop the original 'name' and 'artists' columns
selected_features.drop(['name', 'artists'], axis=1, inplace=True)
print(selected_features.head())

                       id  danceability  energy  key  loudness  mode  \
0  7lmeHLHBe4nmXzuXc0HDjk         0.470   0.978    7    -5.399     1   
1  1wsRitfRRtWyEapl0q22o8         0.599   0.957   11    -5.764     1   
2  1hR0fIFK2qRG3f3RF70pb7         0.315   0.970    7    -5.424     1   
3  2lbASgTSoDO7MTuLAXlTW0         0.440   0.967   11    -5.830     0   
4  1MQTmpYOZ6fcMQc56Hdo7T         0.426   0.929    2    -6.729     1   

   speechiness  acousticness  instrumentalness  liveness  valence    tempo  \
0       0.0727       0.02610          0.000011    0.3560    0.503  117.906   
1       0.1880       0.01290          0.000071    0.1550    0.489  103.680   
2       0.4830       0.02340          0.000002    0.1220    0.370  149.749   
3       0.2370       0.16300          0.000004    0.1210    0.574   96.752   
4       0.0701       0.00162          0.105000    0.0789    0.539  127.059   

   duration_ms  time_signature  year  \
0       210133             4.0  1999   
1       206200    

In [24]:
# Convert string summaries to lowercase and then tokenize
selected_features['tokenized_summary'] = selected_features['string_summary'].apply(lambda x: word_tokenize(x.lower()))

In [25]:
# Define Word2Vec model parameters (may adjust later)
vector_size = 14
window_size = 5
min_count = 1

# Train Word2Vec model
word2vec_model = Word2Vec(selected_features['tokenized_summary'], vector_size=vector_size, window=window_size, min_count=min_count)

In [27]:
# Convert string summaries to vectors
def get_summary_vector(summary, model):
    summary_vector = [model.wv[word] for word in summary if word in model.wv]
    return sum(summary_vector) / len(summary_vector) if summary_vector else [0] * vector_size

summary_vector = selected_features['tokenized_summary'].apply(lambda x: get_summary_vector(x, word2vec_model))

In [28]:
selected_features.drop(['string_summary', 'tokenized_summary'], axis=1, inplace=True)
print(summary_vector[0])

[ 0.7233145   0.47999406 -1.6708161  -1.3737744   1.0735456   1.3275248
 -0.41009608  0.37724844 -0.14633438 -1.8264831   0.62498635 -0.54044133
 -2.9819233   2.1566064 ]


The numerical columns are audio characteristics of the song, and we want to scale all the values to make it become the embeddings.

In [29]:
# Extract the numeric columns (excluding 'id' and 'summary_vector')
numeric_columns = selected_features.drop(['id'], axis=1)

# Standardize the numeric columns
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_columns)

# Display the resulting DataFrame
print(scaled_data[0])

[-0.11476022  1.59348117  0.51038804  0.9231026   0.70123154 -0.10405628
 -1.0974239  -0.76136317  0.8647423   0.28528598  0.01103847 -0.23857505
  0.30032139 -0.70179981]


Finally, we want to merge those summary vector (name & artisits) with scaled vector (audio charactersitcs) to make the embeddings for each song.

In [30]:
song_embeddings = [
    np.concatenate([summary_row, scaled_row])
    for summary_row, scaled_row in zip(summary_vector, scaled_data)
]
print(song_embeddings[0])
print(len(song_embeddings), ", ", len(song_embeddings[0]))

[ 0.72331452  0.47999406 -1.67081606 -1.37377441  1.07354558  1.32752478
 -0.41009608  0.37724844 -0.14633438 -1.82648313  0.62498635 -0.54044133
 -2.98192334  2.15660644 -0.11476022  1.59348117  0.51038804  0.9231026
  0.70123154 -0.10405628 -1.0974239  -0.76136317  0.8647423   0.28528598
  0.01103847 -0.23857505  0.30032139 -0.70179981]
1141555 ,  28


Combining those things into our final table for uploading to Pinecone. The table should have two columns, one is id, and another one is song embeddings representation.

In [57]:
embedded_features = selected_features[["id"]].copy()
embedded_features.loc[:, "values"] = song_embeddings
print(embedded_features.head())
print(embedded_features.shape)

                       id                                             values
0  7lmeHLHBe4nmXzuXc0HDjk  [0.7233145236968994, 0.4799940586090088, -1.67...
1  1wsRitfRRtWyEapl0q22o8  [0.7651256918907166, 0.7351270318031311, -1.51...
2  1hR0fIFK2qRG3f3RF70pb7  [1.1690152883529663, 1.0357924699783325, -1.99...
3  2lbASgTSoDO7MTuLAXlTW0  [0.9501161575317383, 0.7967706918716431, -1.63...
4  1MQTmpYOZ6fcMQc56Hdo7T  [1.0759998559951782, 0.8795831799507141, -1.78...
(1141555, 2)


In [59]:
len(embedded_features['values'][0])

28

## Store embeddings to Pinecone

In [33]:
!pip install -qU \
  "pinecone-client[grpc]"==2.2.1

In [34]:
import os
import pinecone
import time

  from tqdm.autonotebook import tqdm


In [45]:
# PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY') or 'YOUR_API_KEY'
# PINECONE_ENV = os.environ.get('PINECONE_ENVIRONMENT') or 'YOUR_ENV'

In [51]:
PINECONE_API_KEY = '03367330-5730-4400-ac60-9ab695a047c0'
PINECONE_ENV = 'us-east-1-aws'

In [52]:
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)

In [70]:
index_name = 'music-recommender-test'
dim = len(embedded_features['values'][0])

In [71]:
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=dim,
        metric='cosine'
    )
    # wait a moment for the index to be fully initialized
    time.sleep(1)

# now connect to the index
index = pinecone.GRPCIndex(index_name)

In [72]:
index.describe_index_stats()

{'dimension': 28,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [77]:
test_data = embedded_features[:10]
store_data = embedded_features[10:]

In [78]:
index.upsert_from_dataframe(store_data, batch_size=1000)

sending upsert requests:   0%|          | 0/1141545 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/1142 [00:00<?, ?it/s]

upserted_count: 1141545

In [79]:
index.describe_index_stats()

{'dimension': 28,
 'index_fullness': 0.2,
 'namespaces': {'': {'vector_count': 1141545}},
 'total_vector_count': 1141545}

## Search for similar songs

1. Our personal favofite song (feed 1 get top 10)
2. Our listening history (feed 10 get top 10)
3. Spotify 2023 top 100 song (most streamed 1 get 10 top)
4. Spotify 2023 topp 100 song (feed 10 get top 10)

Pinecone search metric 

### Prepare test data

Get the most streamed songs in 2023 (datasets: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023/data, https://www.kaggle.com/datasets/amitanshjoshi/spotify-1million-tracks)

In [115]:
file_path_top_songs = '../spotify-2023.csv'
top_songs = pd.read_csv(file_path_top_songs, encoding='latin-1')
list(top_songs.columns)

['track_name',
 'artist(s)_name',
 'artist_count',
 'released_year',
 'released_month',
 'released_day',
 'in_spotify_playlists',
 'in_spotify_charts',
 'streams',
 'in_apple_playlists',
 'in_apple_charts',
 'in_deezer_playlists',
 'in_deezer_charts',
 'in_shazam_charts',
 'bpm',
 'key',
 'mode',
 'danceability_%',
 'valence_%',
 'energy_%',
 'acousticness_%',
 'instrumentalness_%',
 'liveness_%',
 'speechiness_%']

In [98]:
file_path_all_songs = '../spotify_data.csv'
all_songs = pd.read_csv(file_path_all_songs)
all_songs = all_songs[['artist_name', 'track_name', 'year','loudness']]

   Unnamed: 0    artist_name        track_name                track_id  \
0           0     Jason Mraz   I Won't Give Up  53QF56cjZA9RTuuMZDrSA6   
1           1     Jason Mraz  93 Million Miles  1s8tP3jP4GZcyHDsjvw218   
2           2  Joshua Hyslop  Do Not Let Me Go  7BRCa8MPiyuvr2VU3O9W0F   
3           3   Boyce Avenue          Fast Car  63wsZUhUZLlh1OsyrZq7sz   
4           4   Andrew Belle  Sky's Still Blue  6nXIYClvJAfi6ujLiKqEq8   

   popularity  year     genre  danceability  energy  key  loudness  mode  \
0          68  2012  acoustic         0.483   0.303    4   -10.058     1   
1          50  2012  acoustic         0.572   0.454    3   -10.286     1   
2          57  2012  acoustic         0.409   0.234    3   -13.711     1   
3          58  2012  acoustic         0.392   0.251   10    -9.845     1   
4          54  2012  acoustic         0.430   0.791    6    -5.419     0   

   speechiness  acousticness  instrumentalness  liveness  valence    tempo  \
0       0.0429      

In [122]:
top_song_names = top_songs['track_name'].tolist()
top_song_artists = top_songs['artist(s)_name'].tolist()
# top_song_names
# top_song_artists
selected_rows = []
for _, row in all_songs.iterrows():
    # print(row)
    song = row['track_name']
    artist = row['artist_name']
    if song in top_song_names:
        artist_top = top_songs[top_songs['track_name'] == song]['artist(s)_name']
        print(song, artist_top, artist)
        if artist_top
        # selected_rows.append(row['track_id'])
        
        

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [None]:
id (not sure if we need this?)
name
artists
danceability
energy
key
loudness
mode
speechiness
acousticness
instrumentalness
liveness
valence
tempo
duration_ms
time_signature
year (do we want this?)

### Query

In [80]:
test_data

Unnamed: 0,id,values
0,7lmeHLHBe4nmXzuXc0HDjk,"[0.7233145236968994, 0.4799940586090088, -1.67..."
1,1wsRitfRRtWyEapl0q22o8,"[0.7651256918907166, 0.7351270318031311, -1.51..."
2,1hR0fIFK2qRG3f3RF70pb7,"[1.1690152883529663, 1.0357924699783325, -1.99..."
3,2lbASgTSoDO7MTuLAXlTW0,"[0.9501161575317383, 0.7967706918716431, -1.63..."
4,1MQTmpYOZ6fcMQc56Hdo7T,"[1.0759998559951782, 0.8795831799507141, -1.78..."
5,2LXPNLSMAauNJfnC58lSqY,"[1.2189991474151611, 0.6400238275527954, -2.23..."
6,3moeHk8eIajvUEzVocXukf,"[0.1430703103542328, 0.9378200173377991, -2.30..."
7,4llunZfVXv3NvUzXVB3VVL,"[0.6131278872489929, -0.008440256118774414, -0..."
8,21Mq0NzFoVRvOmLTOnJjng,"[1.0829685926437378, 0.1389971375465393, -0.98..."
9,6s2FgJbnnMwFTpWJZzvb6z,"[0.38173824548721313, 0.20188066363334656, -1...."


In [81]:
# query with song "7lmeHLHBe4nmXzuXc0HDjk"

# create the query vector
xq = test_data['values'][0]

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '5S4fFQBvN1DigjO6XqRM16',
              'metadata': {},
              'score': 0.962494,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '4ARUrlst7m0CLVMOyFg6XZ',
              'metadata': {},
              'score': 0.9586928,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '5mhW5HmdaH1c2Eyor47W80',
              'metadata': {},
              'score': 0.9574417,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '3yjyLEq4lSr0CbZuUc2uZr',
              'metadata': {},
              'score': 0.9555039,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '0paR0kCOxCUKqmtNDHjPaL',
              'metadata': {},
              'score': 0.94905937,
              'sparse_values': {'indices': [], 'values': []},
              'values': 

In [85]:
df[df['id'] == '5S4fFQBvN1DigjO6XqRM16']

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date
802610,5S4fFQBvN1DigjO6XqRM16,Projecting Power,Emo Diaries - Chapter Ten - The Hope I Hide In...,13W3r4Kq7uMOA94IXEoEYk,['The Holiday Plan'],['2Y7RpEHJ35w7FjeLZIefGd'],3,1,False,0.454,...,0.0525,0.103,0.0247,0.43,0.539,110.937,210853,4.0,2004,2004-04-27


In [86]:
df[df['id'] == '7lmeHLHBe4nmXzuXc0HDjk']

Unnamed: 0,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,danceability,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date
0,7lmeHLHBe4nmXzuXc0HDjk,Testify,The Battle Of Los Angeles,2eia0myWFgoHuttJytCxgX,['Rage Against The Machine'],['2d0hyoQ5ynDBnkvAbJKORj'],1,1,False,0.47,...,0.0727,0.0261,1.1e-05,0.356,0.503,117.906,210133,4.0,1999,1999-11-02
