# 1. Gathering the data for our recommender system

To start of our recommender system, we have to gather some data to work with. You can either use the SpotifyAPI to retrieve songs from the Spotify database or find a dataset online. Thankfully for us, I found a Kaggle dataset that contained about 1 million songs. For the purpose of this project, we'll filter the songs that released from 2000-2020. 

Upon further review of the dataset, its missing a key component that is going to drive our recommender system, the genres of the songs. To find the genres of the songs, I used the SpotifyAPI to get the genres of the artists (because the songs themselves don't have any genres attached to them) and added them to their respective songs.

In [89]:
import pandas as pd
import numpy as np
import re
import datetime
import json
import requests
import base64
import time
from urllib.parse import urlencode
from spotipy.oauth2 import SpotifyOAuth

from PIL import Image
import matplotlib.pyplot as plt
import math

In [2]:
songs = pd.read_csv('tracks_features.csv')

In [3]:
songs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1204025 entries, 0 to 1204024
Data columns (total 24 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   id                1204025 non-null  object 
 1   name              1204025 non-null  object 
 2   album             1204025 non-null  object 
 3   album_id          1204025 non-null  object 
 4   artists           1204025 non-null  object 
 5   artist_ids        1204025 non-null  object 
 6   track_number      1204025 non-null  int64  
 7   disc_number       1204025 non-null  int64  
 8   explicit          1204025 non-null  bool   
 9   danceability      1204025 non-null  float64
 10  energy            1204025 non-null  float64
 11  key               1204025 non-null  int64  
 12  loudness          1204025 non-null  float64
 13  mode              1204025 non-null  int64  
 14  speechiness       1204025 non-null  float64
 15  acousticness      1204025 non-null  float64
 16  

## Filtering songs from 2000-2020

In [4]:
new_songs = songs[songs['year'].between(2000,2020)].reset_index().copy()

In [6]:
new_songs

Unnamed: 0,index,id,name,album,album_id,artists,artist_ids,track_number,disc_number,explicit,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,year,release_date
0,22,2SwgVZn9S4NGueAaEAryf1,Man on a Mission,Do It for Love,4evw6IBex3N8x1oA2axMTH,Daryl Hall & John Oates,['77tT1kLj6mCWtFNqiOmP9H'],1,1,False,...,0.0315,0.29200,0.000025,0.1010,0.962,119.946,224307,4.0,2018,2018-04-10
1,23,0QCQ1Isa0YPVyIbs6JwpO1,Do It for Love,Do It for Love,4evw6IBex3N8x1oA2axMTH,Daryl Hall & John Oates,['77tT1kLj6mCWtFNqiOmP9H'],2,1,False,...,0.0586,0.10700,0.000000,0.0574,0.832,87.976,238000,4.0,2018,2018-04-10
2,24,3kIBEFhsZOeeKGebxRraOb,Someday We'll Know,Do It for Love,4evw6IBex3N8x1oA2axMTH,Daryl Hall & John Oates,['77tT1kLj6mCWtFNqiOmP9H'],3,1,False,...,0.0308,0.02330,0.000010,0.0819,0.461,109.977,268013,4.0,2018,2018-04-10
3,25,5dNDRw6qjDcnbW3luRhElU,Forever for You,Do It for Love,4evw6IBex3N8x1oA2axMTH,Daryl Hall & John Oates,['77tT1kLj6mCWtFNqiOmP9H'],4,1,False,...,0.0240,0.56200,0.000006,0.1860,0.370,97.030,277813,4.0,2018,2018-04-10
4,26,561UU4MvlsCenN1x7leYCh,Life's Too Short,Do It for Love,4evw6IBex3N8x1oA2axMTH,Daryl Hall & John Oates,['77tT1kLj6mCWtFNqiOmP9H'],5,1,False,...,0.0347,0.07600,0.013600,0.0731,0.974,116.013,209960,4.0,2018,2018-04-10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991563,1204020,0EsMifwUmMfJZxzoMPXJKZ,Gospel of Juke,Notch - EP,38O5Ys0W9PFS5K7dMb7yKb,FVLCRVM,['7AjItKsRnEYRSiBt2OxK1y'],2,1,False,...,0.0672,0.00935,0.002240,0.3370,0.415,159.586,276213,4.0,2014,2014-01-09
991564,1204021,2WSc2TB1CSJgGE0PEzVeiu,Prism Visions,Notch - EP,38O5Ys0W9PFS5K7dMb7yKb,FVLCRVM,['7AjItKsRnEYRSiBt2OxK1y'],3,1,False,...,0.0883,0.10400,0.644000,0.0749,0.781,121.980,363179,4.0,2014,2014-01-09
991565,1204022,6iProIgUe3ETpO6UT0v5Hg,Tokyo 360,Notch - EP,38O5Ys0W9PFS5K7dMb7yKb,FVLCRVM,['7AjItKsRnEYRSiBt2OxK1y'],4,1,False,...,0.0564,0.03040,0.918000,0.0664,0.467,121.996,385335,4.0,2014,2014-01-09
991566,1204023,37B4SXC8uoBsUyKCWnhPfX,Yummy!,Notch - EP,38O5Ys0W9PFS5K7dMb7yKb,FVLCRVM,['7AjItKsRnEYRSiBt2OxK1y'],5,1,False,...,0.0409,0.00007,0.776000,0.1170,0.227,124.986,324455,4.0,2014,2014-01-09


## Adding the genres to our data using the SpotifyAPI

### SpotifyAPI - Client Credentials Flow

In [7]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
@author: CodingEntrepreneurs
@github: https://github.com/codingforentrepreneurs
"""

client_id = '3a59181b7b554234b55258066bbe7cbd'
client_secret = '1d3cee53b7d04618a72337e62eb399f3'

class SpotifyAPI(object):
    access_token = None
    access_token_expires = datetime.datetime.now()
    access_token_did_expire = True
    client_id = None
    client_secret = None
    token_url = "https://accounts.spotify.com/api/token"
    
    def __init__(self, client_id, client_secret, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.client_id = client_id
        self.client_secret = client_secret

    def get_client_credentials(self):
        """
        Returns a base64 encoded string
        """
        client_id = self.client_id
        client_secret = self.client_secret
        if client_secret == None or client_id == None:
            raise Exception("You must set client_id and client_secret")
        client_creds = f"{client_id}:{client_secret}"
        client_creds_b64 = base64.b64encode(client_creds.encode())
        return client_creds_b64.decode()
    
    def get_token_headers(self):
        client_creds_b64 = self.get_client_credentials()
        return {
            "Authorization": f"Basic {client_creds_b64}"
        }
    
    def get_token_data(self):
        return {
            "grant_type": "client_credentials"
        } 
    
    def perform_auth(self):
        token_url = self.token_url
        token_data = self.get_token_data()
        token_headers = self.get_token_headers()
        r = requests.post(token_url, data=token_data, headers=token_headers)
        if r.status_code not in range(200, 299):
            raise Exception("Could not authenticate client.")
            # return False
        data = r.json()
        now = datetime.datetime.now()
        access_token = data['access_token']
        expires_in = data['expires_in'] # seconds
        expires = now + datetime.timedelta(seconds=expires_in)
        self.access_token = access_token
        self.access_token_expires = expires
        self.access_token_did_expire = expires < now
        return True
    
    def get_access_token(self):
        token = self.access_token
        expires = self.access_token_expires
        now = datetime.datetime.now()
        if expires < now:
            self.perform_auth()
            return self.get_access_token()
        elif token == None:
            self.perform_auth()
            return self.get_access_token() 
        return token
    
    def get_resource_header(self):
        access_token = self.get_access_token()
        headers = {
            "Authorization": f"Bearer {access_token}"
        }
        return headers
        
        
    def get_resource(self, lookup_id, resource_type='albums', version='v1'):
        endpoint = f"https://api.spotify.com/{version}/{resource_type}/{lookup_id}"
        headers = self.get_resource_header()
        r = requests.get(endpoint, headers=headers)
        if r.status_code not in range(200, 299):
            return {}
        return r.json()
    
    def get_album(self, _id):
        return self.get_resource(_id, resource_type='albums')
    
    def get_artist(self, _id):
        return self.get_resource(_id, resource_type='artists')
    
    def search(self, query, search_type='artist'): # type
        headers = self.get_resource_header()
        endpoint = "https://api.spotify.com/v1/search"
        data = urlencode({"q": query, "type": search_type.lower()})
        lookup_url = f"{endpoint}?{data}"
        r = requests.get(lookup_url, headers=headers)
        if r.status_code not in range(200, 299):  
            return {}
        return r.json()

In [None]:
sp = SpotifyAPI(client_id, client_secret)

### Adding the genres to our Spotify catalog
Instead of iterating through each song in the catalog, using the artist_ids, and getting the genres for each row, we will create a dataframe of all the unique artist_ids and get the genres that way. This way we make about 120,000 requests instead of 990,000 requests.

In [12]:
new_songs['artist_ids'][0]

"['77tT1kLj6mCWtFNqiOmP9H']"

In [13]:
# Removing the following characters: " ' [ ] and any whitespace
new_songs['new_artist_ids'] = new_songs['artist_ids'].str.replace("[(\"\'\[\]\s)]", '').str.split(',')

  


In [None]:
def get_artist_genres(_id):
    try:
        return sp.get_artist(_id)['genres'] 
    except KeyError:
        return np.nan
    
unique_artist_ids = new_songs['new_artist_ids'].explode().unique()
artist_ids = pd.DataFrame(unique_artist_ids, columns=['artist_ids'])

# The time the code was executed
start_time = time.time()

# Getting the data
artist_ids['genres'] = artist_ids['artist_ids'].apply(get_artist_genres)

# Displaying how long it took
print("--- %s seconds ---" % (time.time() - start_time))
    
# Saving the results so we don't have to run again
artist_ids.to_csv(path_or_buf='/Users/leynahoang/Desktop/analytics/spotify-1mil-songs/artist_genres.csv')

To get genres for about 120,000 artists, it took about 4.7 hours. I am not sure if this could be improved but this was what I came up with to get the genres for the catalog that would not iterate through each song. I will then map the genres to the according artists in our catalog.

The new column `new_artist_ids` will be a list of the artists because songs often have more than 1 artist.

In [30]:
artist_genres = pd.read_csv('artist_genres.csv').drop('Unnamed: 0', axis=1)

In [32]:
# Filling in the artists with no genres with an empty list
artist_genres['genres'].fillna({i: [] for i in artist_genres.index}, inplace=True)

In [41]:
artist_genres[artist_genres['artist_ids'] == '0CB5oYDxJ2M6vsLKA1ZNTW']['genres'].values[0]

"['gothenburg indie']"

As we can see above, the values in the dataframe look like lists, but rather they are strings. Let's clean that.

In [88]:
start_time = time.time()

all_genres = []
for ix, row in new_songs.iterrows():
    genres = []
    for _id in row['new_artist_ids']:
        g = artist_genres[artist_genres['artist_ids'] == _id]['genres'].values[0]
        genres.append(g)
    
    # Cleaning the genre list before finalizing
    new_genres = []
    for i in genres:
        if i == '[]':
            pass
        else:
            clean_i = re.sub("[(\"\'\[\])]", '', string=str(i)).split(',')
            new_genres.append(clean_i)
    all_genres.append(new_genres)

# Displaying how long it took
print("--- %s minutes ---" % ((time.time() - start_time)/60))
# this took about 3 hours


KeyboardInterrupt: 

### Exporting the new Spotify catalog with the genres

In [87]:
new_songs['artist_genres'] = all_genres

new_songs.to_csv(path_or_buf='/Users/leynahoang/Desktop/analytics/spotify-1mil-songs/final_spotify_catalog.csv')