**Table of contents**<a id='toc0_'></a>    
- [🌐 Spotify Recommender System](#toc1_)    
  - [Getting the data](#toc1_1_)    
    - [Connect to the API](#toc1_1_1_)    
    - [Spotify search](#toc1_1_2_)    
    - [Get song information (audio features)](#toc1_1_3_)    
    - [Get album information (audio features of its songs)](#toc1_1_4_)    
    - [Get playlist information](#toc1_1_5_)    
    - [Playlist -> Album -> Songs -> Audio Features](#toc1_1_6_)    
  - [Unsupervised learning (clustering)](#toc1_2_)    
  - [Create the recommendation engine](#toc1_3_)    
- [Acknowledgments](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[🌐 Spotify Recommender System](#toc0_)

In [None]:
# You know the drill
# !pip install spotipy

In [None]:
import numpy as np
import pandas as pd
import random
import warnings
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import time
import getpass
from yellowbrick.cluster import KElbowVisualizer

warnings.filterwarnings('ignore')

## <a id='toc1_1_'></a>[Getting the data](#toc0_)

Spotify has an API with a dedicated API wrapper called `spotipy` (ha, get it?), which can be used to retrieve songs, albums, and artist information. Additionally, Spotify has developed a couple of features for the tracks (liveness, instrumentalness, etc.) which are very useful in machine learning applications as the one we'll do today!

Firstly, we will connect to the Spotify API using our credentials:

In [None]:
# Remember, you don't want other people to see your password!
client_id = getpass.getpass()

In [None]:
client_secret = getpass.getpass()

### <a id='toc1_1_1_'></a>[Connect to the API](#toc0_)

In [None]:
spotify = spotipy.Spotify(
    client_credentials_manager=SpotifyClientCredentials(
        client_id=client_id,
        client_secret=client_secret))

### <a id='toc1_1_2_'></a>[Spotify search](#toc0_)

We can run a search similarly to how we would in the Spotify app:

In [None]:
song = spotify.search(q="Bohemian Rhapsody", limit=3)

In [None]:
song

The outputs of all Spotify API calls will be JSON files, which can be treated as dictionaries:

In [None]:
song.keys()

In [None]:
song['tracks']

We notice that `song['tracks']` is also a dictionary, so we can repeat the process:

In [None]:
song["tracks"].keys()

We have a couple of keys here:
- `limit` - the song limit
- `href` - a link to the web API endpoint returning the full result of the request
- `previous` - URL of the previous page of items
- `next` - URL to the next page of items
- `offset` - the offset of items returned from the 0th page
- `total` - total results available

In tracks-items we have the number of hits we got from the search:

In [None]:
len(song["tracks"]["items"]) # As we expected, this is equal to 3

We can select the first element and keep inspecting:

In [None]:
song["tracks"]["items"][0].keys()

Now we have many more details about the specific songs, including some very relevant ones such as `album`, `artists`, `name`, and `uri`. URIs are Unique Resource Identifiers and Spotify has unique URIs for songs, albums, and playlists.

In [None]:
song["tracks"]["items"][0]["artists"][0].keys()

Who were the artists playing Bohemian Rhapsody?

In [None]:
song["tracks"]["items"][0]["artists"][0]["name"]

![](https://media3.giphy.com/media/dhgg2GTU8pv8vmkdiW/giphy.gif?cid=ecf05e47vh8cfhakzo9clp91r1cewyp82u0r9o80g319kfgj&ep=v1_gifs_search&rid=giphy.gif&ct=g)

### <a id='toc1_1_3_'></a>[Get song information (audio features)](#toc0_)

Now that we've learnt how to access songs using Spotify's search function, we will extract audio features to build our subsequent clustering model. This time, instead of querying for a specific song, I'm using a link taken directly from Spotify:

In [None]:
song = spotify.track("https://open.spotify.com/track/6YMPu36VGIknb8Ey1ohW3j")

In [None]:
song.keys()

What song is it? :D

In [None]:
# Find out what the song is!

After retrieving the song, I can get its URI to further extract audio features:

In [None]:
# So... what is the URI?

# song_uri = 

In [None]:
spotify.audio_features(tracks=[song_uri])[0]

Nice! Now it's time to get even more songs :)

### <a id='toc1_1_4_'></a>[Get album information (audio features of its songs)](#toc0_)

We can also extract album information using a direct link:

In [None]:
album = spotify.album_tracks("https://open.spotify.com/album/2WT1pbYjLJciAR26yMebkH?si=Iqlrze6XRM6FfZQWfmRq3A")

and explore the JSON again:

In [None]:
album.keys()

In [None]:
# This is the number of songs in the album
len(album["items"])

We can explore details about the first song:

In [None]:
album["items"][0].keys()

In [None]:
album["items"][0]["name"]

Now we can get the titles of all songs in the album:

In [None]:
for song in album["items"]:
    print(song["name"])

We will get the URIs using a list comprehension so we can later extract the audio features:

In [None]:
album_uris = [song["uri"] for song in album["items"]]

In [None]:
album_track_feat = [spotify.audio_features(uri)[0] for uri in album_uris]

In [None]:
len(album_track_feat)

In [None]:
pd.DataFrame(album_track_feat)

### <a id='toc1_1_5_'></a>[Get playlist information](#toc0_)

We can apply the same strategy to extract all the songs from a playlist:

In [None]:
list_items = spotify.playlist_items("https://open.spotify.com/playlist/37i9dQZEVXbMDoHDwVN2tF?si=15bc8d87f6bf4560")

In [None]:
list_items.keys()

In [None]:
len(list_items["items"])

In [None]:
list_items["items"][0].keys()

In [None]:
list_items["items"][0]["track"].keys()

In [None]:
list_items["items"][0]["track"]["name"]

In [None]:
list_items["items"][0]["track"]["album"].keys()

In [None]:
list_items["items"][0]["track"]["album"]["uri"]

### <a id='toc1_1_6_'></a>[Playlist -> Album -> Songs -> Audio Features](#toc0_)

Now we will combine all the previous steps together to build up a music dataset. We will extract all the songs in a playlist, then all the songs for each of their albums. For all the songs we collect, we will create a database with audio features that we can use later on: 

In [None]:
list_items = spotify.playlist_items("https://open.spotify.com/playlist/37i9dQZEVXbMDoHDwVN2tF?si=5f944fd835e14197")

In [None]:
list_items["items"][0].keys()

In [None]:
for item in list_items["items"]:
    print(item["track"]["album"]["name"])

From the playlist info, we can get all the album's URI:

In [None]:
album_uris = [item["track"]["album"]["uri"] for item in list_items["items"]]

Then, with the album URIs, we can get all the songs:

In [None]:
albums = [spotify.album_tracks(uri) for uri in album_uris]

In [None]:
# I can check all the songs my dataset will have and count them
count = 0
for album in albums:
    for song in album["items"]:
        count += 1
        print(song["name"])

In [None]:
count # How many songs did we get?

Now we can get all the song URIs to later extract the audio features:

In [None]:
song_uris = [song["uri"] for album in albums for song in album["items"]]

In [None]:
len(song_uris)

In [None]:
songs_feat = [spotify.audio_features(uri)[0] for uri in song_uris]

In [None]:
len(songs_feat)

In [None]:
songs_feat[0]

There are some songs that do not return any results, so we will remove those:

In [None]:
while None in songs_feat:
    songs_feat.remove(None)

In [None]:
songs_feat_df = pd.DataFrame(songs_feat)

We can wrap all of the previous steps in a function to more easily extract audio features from a given playlist:

In [None]:
def get_features_from_playlist(url):
    list_items = spotify.playlist_items(url)
    album_uris = [item["track"]["album"]["uri"] for item in list_items["items"]]
    albums = [spotify.album_tracks(uri) for uri in album_uris]
    song_uris = [song["uri"] for album in albums for song in album["items"]]
    song_name = [song["name"] for album in albums for song in album["items"]]
    song_artist = [song["artists"][0]["name"] for album in albums for song in album["items"]]
    song_feat = [spotify.audio_features(uri)[0] for uri in song_uris]

    while None in songs_feat:
        songs_feat.remove(None)

    name_df = pd.DataFrame(song_name)
    name_df.columns = ["name"]
    artist_df = pd.DataFrame(song_artist)
    artist_df.columns = ["artist"]
    feat_df = pd.DataFrame(songs_feat)#_ok)

    final_df = pd.concat([name_df, artist_df, feat_df], axis = 1)

    return pd.DataFrame(final_df)

Let's test it:

In [None]:
my_df = get_features_from_playlist("https://open.spotify.com/playlist/37i9dQZEVXbMDoHDwVN2tF?si=94f82b9354d2421b")

Review dataframe characteristics:

In [None]:
my_df.shape

In [None]:
my_df.head()

In [None]:
my_df.isna().sum()

We can fully remove songs with no features:

In [None]:
my_df = my_df.dropna()

In [None]:
my_df.dtypes

For our model, we only require the audio features, which are numeric, so we can filter out the rest:

In [None]:
my_df_num = my_df.select_dtypes(include=np.number)

In [None]:
my_df_num.head()

`Duration_ms` and `time_signature` are not really interesting parameters to classify the songs so we will drop them:

In [None]:
my_df_num = my_df_num.drop(columns = ["duration_ms", "time_signature"])

## <a id='toc1_2_'></a>[Unsupervised learning (clustering)](#toc0_)

Scaling

In [None]:
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(my_df_num)
data_scaled = pd.DataFrame(data_scaled, columns=my_df_num.columns)

In [None]:
data_scaled.describe()

Clustering - deciding on cluster number:

In [None]:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,20))
visualizer.fit(data_scaled)
visualizer.poof()

With the number of clusters, we fit and predict the clusters for each song:

In [None]:
model = KMeans(n_clusters=6)
model.fit(data_scaled)
clusters = model.predict(data_scaled)

In [None]:
len(clusters)

In [None]:
# Check the clusters
clusters

Now I can add the song clusters to the audio features dataframe to use in song recommendations:

In [None]:
my_df["clusters"] = clusters
my_df.head(3)

Review cluster distribution:

In [None]:
my_df["clusters"].value_counts()

One of the strategies we use to recommend similar songs is to select another song from the same cluster:

In [None]:
my_df[my_df["clusters"] == 1]

## <a id='toc1_3_'></a>[Create the recommendation engine](#toc0_)

Now, let's put our knowledge in action! We'll ask a user to input a song:

In [None]:
query = input("Please, input a song name and you will get one recommendation.")

In [None]:
query

And we will use their input to search for a song on Spotify:

In [None]:
song = spotify.search(q=query, limit=1)

We will get the audio features for the song to later figure out what cluster it should belong to:

In [None]:
song.keys()

In [None]:
song["tracks"].keys()

In [None]:
song["tracks"]["items"][0].keys()

In [None]:
song["tracks"]["items"][0]["uri"]

After extracting the URI, we can use it to get the audio features:

In [None]:
query_aud_feat = spotify.audio_features("spotify:track:0aTVqs93YOmmirMMioFjFA")
query_aud_feat = pd.DataFrame(query_aud_feat[0], index=[0])
query_aud_feat

Convert the audio features dataframe to the same format as the dataframe we used during training:

In [None]:
query_aud_feat = query_aud_feat.select_dtypes(include=np.number)
query_aud_feat = query_aud_feat.drop(columns = ["duration_ms", "time_signature"])
query_aud_feat

Scale using the previously trained scaler:

In [None]:
query_scaled = scaler.transform(query_aud_feat)

And predict the song cluster using our previously trained model:

In [None]:
query_cluster = model.predict(query_scaled)
query_cluster[0]

So now we can filter the original dataframe to get all songs form the same cluster:

In [None]:
query_same_cluster = my_df[my_df["clusters"] == query_cluster[0]]
query_same_cluster.head()

Then we can take a random song from the filtered dataset, which will be our recommendation for the user:

In [None]:
recommendation = query_same_cluster.sample()
recommendation

But we will present it to the user simply using the song name, rather than the full feature dataframe:

In [None]:
recom_title = recommendation["name"].item()
recom_artist = recommendation["artist"].item()
print(f"You should listen '{recom_title}' from '{recom_artist}'")

Finally, we'll put everything into a function that we can run behind an app (remember what python library we could use for this ;)? ):

In [None]:
def recommender():
    query = input("Please, input a song name and you will get one recommendation.")
    song = spotify.search(q=query, limit=1)
    song_uri = song["tracks"]["items"][0]["uri"]
    query_aud_feat = spotify.audio_features(song_uri)
    query_aud_feat = pd.DataFrame(query_aud_feat[0], index=[0])
    query_aud_feat = query_aud_feat.select_dtypes(include=np.number)
    query_aud_feat = query_aud_feat.drop(columns = ["duration_ms", "time_signature"])
    query_scaled = scaler.transform(query_aud_feat)
    query_cluster = model.predict(query_scaled)
    query_same_cluster = my_df[my_df["clusters"] == query_cluster[0]]
    recommendation = query_same_cluster.sample()
    recom_title = recommendation["name"].item()
    recom_artist = recommendation["artist"].item()
    return f"You should listen '{recom_title}' from '{recom_artist}'"

In [None]:
recommender()

# <a id='toc2_'></a>[Acknowledgments](#toc0_)

Thank you, Miguel SM, for the contents of this lesson!