# Spotify Sound Guide by Sergey Sonkin

The code below is broken up into steps for 
* Step 0: Deal with env and global variables, including the artist's spotify id
* Step 1: Get artist's albums
* Step 2: Filter out duplicate albums
* Step 3: Get track information, engineer any new features
* Step 4: Repeat steps 1-3 but for artist's singles (done separately from albums to avoid duplicates)
* Step 5: Process features for recommendation
* Step 6: Get a seed song
* Step 7: Define our likeability metric
* Step 8: Start recommending!

## Step 0: Dealing with request variables

In [22]:
import os
import requests
import pandas as pd
import re

client_id = os.getenv("SPOTIFY_CLIENT_ID")
client_secret = os.getenv("SPOTIFY_CLIENT_SECRET")

data = "grant_type=client_credentials&client_id=" \
        + client_id \
        + "&client_secret=" \
        + client_secret
header = {
    "Content-Type": "application/x-www-form-urlencoded"
}
response = requests.post("https://accounts.spotify.com/api/token",headers=header,data=data)
access_token = response.json()['access_token']
headers = {
    'Authorization': 'Bearer {token}'.format(token=access_token)
}
BASE_URL = 'https://api.spotify.com/v1/'

## Some artist ids
global_artist_id = "4RnBFZRiMLRyZy0AzzTg2C" ##Run The Jewels
global_artist_id = "2kRfqPViCqYdSGhYSM9R0Q" ##Madison Beer
global_artist_id = "6qqNVTkY8uBg9cP3Jd7DAH" ##Billie Eilish
global_artist_id = "6fWVd57NKTalqvmjRd2t8Z" ##24kGoldn
global_artist_id = "2tIP7SsRs7vjIcLrU85W8J" ##The Kid Laroi
global_artist_id = "7dGJo4pcD2V6oG8kP0tJRR" ##Eminem
global_artist_id = "4EPyKMgsR7JDuW9tL0AYZP" ##lilbootycall
global_artist_id = "1VPmR4DJC1PlOtd0IADAO0" ##Suicideboys

## Step 1: Start by getting just albums

In [7]:
def get_albums(type='album',artist_id=global_artist_id):
    album_list = []
    counter = 0 ## Need counter to deal with limit of 50 per page
    while(True):
        print(counter)
        r = requests.get(BASE_URL + 'artists/' + artist_id + '/albums', 
                        headers=headers, 
                        params={'market':'US', 'include_groups': type, 'limit': 50, 'offset': 50*counter})
        d = r.json()
        if len(d['items']) == 0:
            print("Done!")
            break
        album_list += d['items']
        counter += 1
    return album_list

album_list = get_albums('album')

0
<Response [200]>
1
<Response [200]>
Done!


## Step 2: Filter out duplicate albums

How do we pick which of the duplicates to use? We choose based on the following criteria in order

1. Most explicit
2. Most recently released
3. Most number of tracks

In [8]:
def filter_duplicates(album_list):
    ## Generating list of album names
    names = [(i,re.sub(r'\W+', '',album['name'].lower())) for i,album in enumerate(album_list)]
    ## Generating list of duplicates (doubles)
    viewed = {}
    doubles = []
    for (index,name) in names:
        if name in viewed:
            other_index = viewed[name]
            doubles.append((name,index,other_index))
        else:
            viewed[name] = index

    ## How do we pick which of the dupli
    ## For each duplicate album, find which one has the explicit songs
    for (name,index_1,index_2) in doubles:
        album_id_1 = album_list[index_1]['id']
        album_id_2 = album_list[index_2]['id']
        r1 = requests.get(BASE_URL+'albums/'+album_id_1+'/tracks',headers=headers)
        r2 = requests.get(BASE_URL+'albums/'+album_id_2+'/tracks',headers=headers)
        tracks1 = r1.json()['items']
        tracks2 = r2.json()['items']
        explicit_1 = tracks1[0]['explicit']
        explicit_2 = tracks2[0]['explicit']
        ## If one is explicit but not the other, take the explicit version
        if explicit_1 and not explicit_2:
            viewed[name] = index_1
        elif explicit_2 and not explicit_1:
            viewed[name] = index_2
        ## If they're the same explicitness, take the more recently released version
        else:
            album_rd_1 = album_list[index_1].get("release_date",-1)
            album_rd_2 = album_list[index_2].get("release_date",-1)
            if album_rd_1 > album_rd_2:
                viewed[name] = index_1
            elif album_rd_2 > album_rd_1:
                viewed[name] = index_2
            else:
            ## If they're release on the same date, take the one with more tracks
                album_tracks_1 = album_list[index_1].get("total_tracks",-1)
                album_tracks_2 = album_list[index_2].get("total_tracks",-1)
                if album_tracks_1 > album_tracks_2:
                    viewed[name] = index_1
                elif album_tracks_2 > album_tracks_1:
                    viewed[name] = index_2
                ## If they have the same number of tracks honestly I'm defeated just pick the first
                else:
                    viewed[name] = index_1
    filtered_album_ids = list(viewed.values())
    print("We removed " + str(len(album_list) - len(filtered_album_ids)) + " duplicates")
    return filtered_album_ids

filtered_album_ids = filter_duplicates(album_list)

We removed 3 duplicates


## Step 3: Get track information

The star of the show is Spotify's "Get Tracks' Audio Features" endpoint. Its documention can be found here https://developer.spotify.com/documentation/web-api/reference/get-several-audio-features

### Step 3a: Initialize track information

First we create our pandas table with the basic information about each track we have from our queries above. 

If we have duplicate tracks, we filter for them as we go.

In [9]:
track_info = pd.DataFrame(columns = ["album_name","track_name","release_date","popularity",
                                     "acousticness","danceability","energy","instrumentalness","key","liveness",
                                     "loudness","mode","speechiness","tempo","time_signature","valence"])
track_info[['mode', 'time_signature','popularity']] = track_info[['mode', 'time_signature','popularity']].astype('int8')

## We will store track ids in a list for use in Step 3b
track_names = set()

def get_new_tracks(album_list,filtered_album_ids):
    global track_info
    track_ids = []
    for album_id in filtered_album_ids:
        ## Get the album
        album = album_list[album_id]
        album_name = album["name"]
        album_release_date = album["release_date"]
        ## Get the tracks for that album
        id = album['id']
        r = requests.get(BASE_URL+'albums/'+id+'/tracks',headers=headers)
        tracks = r.json()['items']
        ## For each track, just get the track name and id and set some initial values
        for track in tracks:
            track_id = track['id']
            track_name = track['name']
            track_popularity = track.get('popularity',50)
            if track_name not in track_names:
                ## Creating new DF with desired column names
                new_track_dict = {"album_name":album_name,
                                  "track_name":track_name,
                                  "release_date":pd.to_datetime(album_release_date),
                                  "popularity":track_popularity,
                                  "acousticness":0.0,
                                  "danceability":0.0,
                                  "energy":0.0,
                                  "instrumentalness":0.0,
                                  "key":0.0,
                                  "liveness":0.0,
                                  "loudness":0.0,
                                  "mode":0,
                                  "speechiness":0.0,
                                  "tempo":0.0,
                                  "time_signature":0,
                                  "valence":0.0}
                new_track_row = pd.DataFrame([new_track_dict],index=[track_id])
                ## Append to track_info, track_ids, and track_names
                track_info = track_info.append(new_track_row)
                track_ids.append(track_id)
                track_names.add(track_name)
            else:
                print("We have a duplicate track name: " + track_name)
    return track_ids

track_ids = get_new_tracks(album_list,filtered_album_ids)

### Step 3b: Get song details

Now for all of the (unique) tracks from 3a, we ask Spotify for its song features.

Every once in a while Spotify just doesn't have any song features so we drop the song completely. It's not helpful to us if we don't know anything about it, and it happens so rarely that it's not a concern for the usability of this project.

In [10]:
special_features = ["acousticness","danceability","energy","instrumentalness","key","liveness",
                    "loudness","mode","speechiness","tempo","time_signature","valence"]
def get_features(track_ids,track_info):
    ## Break our song IDs into 100 song batches
    l = len(track_ids)
    width = 100
    iters = (l // width) + 1
    for ii in range(iters):
        ## Get these 100 songs intro a string
        track_id_subset = track_ids[width*ii:width*ii+width]
        tracks_string = (",").join(track_id_subset)
        ## Get the request with this data
        r = requests.get(BASE_URL + 'audio-features/',headers=headers,
                        params={'ids':tracks_string})
        audio_features = r.json()['audio_features']
        ## We got the audio features, now extract as much as possible
        for jj,audio_feature in enumerate(audio_features):
            track_id = track_id_subset[jj]
            try:
                for f in special_features:
                    track_info.loc[track_id,f] = audio_feature[f]
            except:
                track_id = jj + width*ii
                print("We have an issue with track id " + str(track_id) + " " + str(track_ids[track_id]))
                track_info = track_info.drop(track_ids[track_id])


get_features(track_ids,track_info)

We have an issue with track id 171 7s7q9dpsSCMEnDR3WhExZy


In [11]:
track_info

Unnamed: 0,album_name,track_name,release_date,popularity,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
49YpGS0rVcRLtiDvx5JQyp,DIRTIESTNASTIEST$UICIDE,Sorry for the Delay,2022-12-16,50,0.00951,0.787,0.889,0.000322,2.0,0.6520,-3.125,1,0.1280,156.027,4,0.677
5dol1hrERJOReznLRJ2VVQ,DIRTIESTNASTIEST$UICIDE,BUCKHEAD,2022-12-16,50,0.00026,0.759,0.833,0.057300,11.0,0.1780,-5.010,1,0.0779,140.026,4,0.522
3QQXpvZd9qmzHZ02wDf2im,DIRTIESTNASTIEST$UICIDE,I Dream of Chrome,2022-12-16,50,0.04840,0.840,0.934,0.000000,0.0,0.0961,-3.717,1,0.1190,149.994,4,0.670
1UsvO5U72YRU8Xnq8Lp14O,DIRTIESTNASTIEST$UICIDE,Champagne Face,2022-12-16,50,0.02310,0.894,0.767,0.000024,10.0,0.5740,-4.695,0,0.1370,144.077,4,0.412
2CkpD7gqMXrrpTCJ9TZ0bw,DIRTIESTNASTIEST$UICIDE,The Serpent and the Rainbow,2022-12-16,50,0.00147,0.780,0.780,0.000000,0.0,0.4720,-2.857,1,0.0858,118.014,4,0.446
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6NEhZKGfYMHh7lieunprnP,KILL YOURSELF Part VII: The Fuck God Saga,Noxygen,2014-12-22,50,0.00000,0.000,0.000,0.000000,0.0,0.0000,0.000,0,0.0000,0.000,0,0.000
5ocBRvhwLtQfOm3KojKLiS,KILL YOURSELF Part VII: The Fuck God Saga,Grey Boys,2014-12-22,50,0.00000,0.000,0.000,0.000000,0.0,0.0000,0.000,0,0.0000,0.000,0,0.000
0XxSQS2cT0lRQgXNyyfUj2,KILL YOURSELF Part VII: The Fuck God Saga,Crucify Me Wearing Tommy,2014-12-22,50,0.00000,0.000,0.000,0.000000,0.0,0.0000,0.000,0,0.0000,0.000,0,0.000
40UUguoDiCMepoXCe3MR0o,KILL YOURSELF Part VII: The Fuck God Saga,Back from the Dead,2014-12-22,50,0.00000,0.000,0.000,0.000000,0.0,0.0000,0.000,0,0.0000,0.000,0,0.000


In [12]:
## Just outputting to a CSV so I can experiment in tableau
track_info.to_csv("out2.csv")

## Step 4: Repeat Steps 1-3 but for singles

We already put steps 1-3 into functions so this doesn't require much duplicate code at all!

In the future, instead of calling get_albums on albums and singles individually we could call for both at the same time saving on at most 1 API call per artist. 

If we go into large scale production and get bottlenecked by rate limits, this could be a good improvement. For now, it's pedantic and not worth the energy to rewrite otherwise working code.

### Step 4a: Get singles

In [13]:
single_list = get_albums('single')

0
<Response [200]>
1
<Response [200]>
2
<Response [200]>
Done!


### Step 4b: Filter duplicate singles

In [14]:
filtered_single_ids = filter_duplicates(single_list)

We removed 5 duplicates


### Step 4c: Retrieve singles and add if they're unique

In [15]:
track_ids = get_new_tracks(single_list,filtered_single_ids)

We have a duplicate track name: Big Shot Cream Soda
We have a duplicate track name: My Swisher Sweet, But My Sig Sauer
We have a duplicate track name: Escape From BABYLON
We have a duplicate track name: THE_EVIL_THAT_MEN_DO
We have a duplicate track name: THE_EVIL_THAT_MEN_DO
We have a duplicate track name: Materialism as a Means to an End
We have a duplicate track name: Avalon
We have a duplicate track name: Avalon
We have a duplicate track name: NEW PROFILE PIC
We have a duplicate track name: NEW PROFILE PIC
We have a duplicate track name: ...And to Those I Love, Thanks for Sticking Around
We have a duplicate track name: Fuck Your Culture
We have a duplicate track name: Scope Set
We have a duplicate track name: Meet Mr. NICEGUY
We have a duplicate track name: Carrollton
We have a duplicate track name: Aliens Are Ghosts
We have a duplicate track name: nothingleftnothingleft
We have a duplicate track name: For the Last Time
We have a duplicate track name: Fuckthepopulation
We have a du

### Step 4d: Retrieve features for these new singles

In [16]:
get_features(track_ids,track_info)

In [17]:
track_info

Unnamed: 0,album_name,track_name,release_date,popularity,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
49YpGS0rVcRLtiDvx5JQyp,DIRTIESTNASTIEST$UICIDE,Sorry for the Delay,2022-12-16,50,0.00951,0.787,0.889,0.000322,2.0,0.6520,-3.125,1,0.1280,156.027,4,0.677
5dol1hrERJOReznLRJ2VVQ,DIRTIESTNASTIEST$UICIDE,BUCKHEAD,2022-12-16,50,0.00026,0.759,0.833,0.057300,11.0,0.1780,-5.010,1,0.0779,140.026,4,0.522
3QQXpvZd9qmzHZ02wDf2im,DIRTIESTNASTIEST$UICIDE,I Dream of Chrome,2022-12-16,50,0.04840,0.840,0.934,0.000000,0.0,0.0961,-3.717,1,0.1190,149.994,4,0.670
1UsvO5U72YRU8Xnq8Lp14O,DIRTIESTNASTIEST$UICIDE,Champagne Face,2022-12-16,50,0.02310,0.894,0.767,0.000024,10.0,0.5740,-4.695,0,0.1370,144.077,4,0.412
2CkpD7gqMXrrpTCJ9TZ0bw,DIRTIESTNASTIEST$UICIDE,The Serpent and the Rainbow,2022-12-16,50,0.00147,0.780,0.780,0.000000,0.0,0.4720,-2.857,1,0.0858,118.014,4,0.446
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4Gy5kycvHxatuBiNQBCPA6,KILL YOURSELF Part I: The $uicide Saga,Kill Yourself,2014-01-01,50,0.48900,0.856,0.709,0.000085,10.0,0.0845,-6.976,0,0.0420,110.063,4,0.403
2LRyFZnPdogRD8fjdj0gHr,KILL YOURSELF Part I: The $uicide Saga,Mask & Da Glock,2014-01-01,50,0.02310,0.759,0.882,0.000015,9.0,0.5440,-5.996,1,0.1840,129.992,4,0.570
5SN1ffDyC7OtMlZjdOKgHZ,KILL YOURSELF Part I: The $uicide Saga,Maple Syrup,2014-01-01,50,0.04770,0.602,0.858,0.039000,2.0,0.2740,-6.671,1,0.0716,129.859,4,0.580
6cDsdfgV7UHdDc2AokAylv,KILL YOURSELF Part I: The $uicide Saga,Kill Yourself - Leaned Out Remix,2014-01-01,50,0.10700,0.649,0.738,0.071500,0.0,0.3040,-5.972,0,0.0452,93.245,4,0.523


### Step 5: Clean and process our data for recommendation

A couple of things to point out. 

#### Onehot encoding

We need to onehot encode our data because our similarity metric uses cosine similarity/distance, which does not work with categorical columns. 

Some classes that could be encoded but probably shouldn't be: Keys, Pitch Classes, Time Signature. 

These features (from Spotify) each have their own basis in numeric space (i.e. distance between Pitch Class 2 and 3 vs 2 and 4 does have mathematical significance) so we leave it as is.

#### Dealing with release date

We could convert release date into Unix time in seconds but I figured a more practical approach would be to instead look at the artists "progress" into their career: whether a song was released at the start or towards their most latest album.

Artists, now more than ever, tend to develop and change their sound over time especially with every major album release. Much of this could be attributed to the change from buying songs in records to browsing songs on streaming services but that's a separate rant.

By converting release date into a range of [0,1] we are mapping songs together while still also considering album groupings because album_name is being onehot encoded as well.

Arguably this is overfitting, but in testing it almost seems to not overfit enough. 

A large reason for this is because if an artist only started recently, the difference between 0 and 1 is huge, but the difference between 2022 and 2023 isn't as an artist isn't likely to change their sound so drastically in a year. 

In the future, one way to deal with this would be to creative a "minimum" of five years to every artist's career so that even if they have only been producing for a year, their release dates get mapped to 0.9 and 1 so they're not as far apart. Currently not needed, but I'm documenting this for myself in case I need to revisit this idea.

In [18]:
start = 0

def clean_data(track_info):
    ## In theory we could also get dummies for keys but pitch classes seem to have numeric analogue
    ## Same goes for time_signature
    cleaned = pd.get_dummies(track_info,columns=['album_name','mode'])
    cleaned = cleaned.drop(columns=['track_name','popularity','release_date'])

    start = track_info['release_date'].min()
    cleaned['duration'] = track_info['release_date'] - start
    cleaned['duration'] = cleaned['duration'].dt.days
    maxd = cleaned['duration'].max()
    cleaned['duration'] = cleaned['duration'] / maxd

    return cleaned

cleaned = clean_data(track_info)

## Step 6: Get a seed song and define our likeability metric

How do we pick the first song to recommend? Picking the most popular one is probably a safe bet - it's popular for a reason!

### Step 6a: Get most popular song for seed song

This code will break if none of their most popular songs are in our database.

This would only happen if all of their most popular songs are by another artist, which I haven't seen yet but let's be honest if that's the case you probably shouldn't listen to them anyways.

In [19]:
## Get most popular tracks
r = requests.get(BASE_URL + 'artists/' + global_artist_id + '/top-tracks',headers=headers,params={'market':'US'})
trracks = r.json()['tracks']
for track in trracks:
    most_id = track["id"]
    ## Make sure we're tracking this song's data
    if most_id in cleaned.index:
        most_pop = cleaned.loc[most_id].to_frame().transpose()
        break
most_pop

Unnamed: 0,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,time_signature,...,album_name_Now the Moon's Rising,album_name_Radical $uicide,album_name_SHAMELESS $UICIDE,album_name_Scrape,"album_name_Sing Me a Lullaby, My Sweet Temptation",album_name_Stop Staring At the Shadows,album_name_YUNGDEATHLILLIFE,mode_0,mode_1,duration
30QR0ndUdiiMQMA9g1PGCm,0.124,0.792,0.511,9e-05,2.0,0.14,-6.876,0.0409,113.983,4.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.668961


### Step 7: Start computing distances

I'll create a video explaining why we're using this metric.

In [20]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


(r,_) = track_info.shape
prev = np.ones((r,1))

## Store track_ids of songs already recommended
viewed_tracks = []

def get_song_recommendation(stripped,new_track_info,liked=True):
    global prev
    ## Step 0: Mark this track as viewed
    prev_id = new_track_info.index[0]
    viewed_tracks.append(prev_id)
    ## Step 1: Get distances from new track
    dists = cosine_similarity(stripped,new_track_info)
    ## Step 2: If we hated that song, we want songs that are far away from it
    if not liked:
        dists = 1 - dists
    ## Step 3: Get new metric array
    prev = np.multiply(prev,dists)
    ## Step 4: From this array, get next song to recommend
    sortedd = np.argsort(prev,axis=0)
    for ii in range(1,20):
        jj = sortedd[-ii][0]
        new_track_id = stripped.iloc[jj].name
        if new_track_id not in viewed_tracks:
            return stripped.iloc[jj].to_frame().transpose()

### Step 8: Start recommending songs and hearing back!

In [21]:
from time import sleep

new_rec = most_pop
for _ in range(5):
    print("You should listen to spotify track id " + new_rec.index[0])
    sleep(1)
    enjoyed = input("Did you enjoy this track? Yes or No")
    if enjoyed == "Yes":
        new_rec = get_song_recommendation(cleaned,new_rec,True)
    else:
        new_rec = get_song_recommendation(cleaned,new_rec,False)

You should listen to spotify track id 30QR0ndUdiiMQMA9g1PGCm
You should listen to spotify track id 0XxSQS2cT0lRQgXNyyfUj2
You should listen to spotify track id 7zNjwnJWBWJDUchrHzhYMo
You should listen to spotify track id 7s7q9dpsSCMEnDR3WhExZy
You should listen to spotify track id 5vamLq20Q17oMMeHfpBEWY
