# Spotify Music Recommendation Model

> Modeling recommendation relationships between spotify tracks using the Million Playlist dataset and graph theory.

Feel free to explore this notebook's contents, as it is well-documented with explanations for most aspects of the model-building process.

## Links

<ul>
    <li><a href="https://www.kaggle.com/datasets/himanshuwagh/spotify-million">Million Playlist Dataset</a></li>
</ul>

In [1]:
import numpy as np
import os
import json
import rustworkx as rx
import pickle

## Loading Data

The most popular and diverse playlists are selected by checking if it has at least 5 followers and num_tracks / unique_artists is at least 2.

The uriToData dictionary maps each track's URI (unique identifier) to its name, artist, and graph index. This is important because the graph will only store each track as an index and URI. 

The allPlaylists NumPy array stores all playlists as a 2D array, where each row is a playlist. To decrease memory usage, the tracks are stores as itheir graph indicies.

<b>Note:</b> the data files are not provided in the GitHub repository (too large). You will have to download them from the dataset link.

In [4]:
%%time

files = os.listdir('./data')

uriToData = {}
numPlaylists = 0

for file in files:
    if file == '.DS_Store':
        continue
    with open(f"./data/{file}") as f:
        d = json.load(f)
        playlists = d["playlists"]
        for i in range(len(playlists)):
            if playlists[i]["num_followers"] < 5:
                continue
            if playlists[i]["num_tracks"] > 70:
                continue
            if playlists[i]["num_tracks"] / playlists[i]["num_artists"] > 2:
                continue
            playlist = playlists[i]["tracks"]
            numPlaylists += 1
            for j in range(len(playlist)):
                playlist[j]["track_uri"] = playlist[j]["track_uri"].split(':')[2]
                if playlist[j]["track_uri"] not in uriToData:
                    uriToData[playlist[j]["track_uri"]] = (len(uriToData), playlist[j]["track_name"], playlist[j]["artist_name"])

allPlaylists = np.full((numPlaylists, 70), -1, dtype='int32')
counter = 0

for file in files:
    if file == '.DS_Store':
        continue
    with open(f"./data/{file}") as f:
        d = json.load(f)
        playlists = d["playlists"]
        for i in range(len(playlists)):
            if playlists[i]["num_followers"] < 5:
                continue
            if playlists[i]["num_tracks"] > 70:
                continue
            if playlists[i]["num_tracks"] / playlists[i]["num_artists"] > 2:
                continue
            playlist = playlists[i]["tracks"]
            for j in range(len(playlist)):
                allPlaylists[counter, j] = uriToData[playlist[j]["track_uri"].split(':')[2]][0]
            counter += 1

CPU times: user 4min 52s, sys: 17.7 s, total: 5min 10s
Wall time: 5min 15s


## Populating the Graph

An undirected, edge-weighted graph is used to represent the relationships between tracks.

A track is "connected" to another track in the graph if they have a playlist in common. These connections are weighted. For example, if a track and some other track share 20 playlists, their connection has a strong weight. On the other hand, if a track and some other track share only 1 playlist, their connection is weak. Thus, the relationships between all spotify tracks can be represented as a undirected, edge-weighted graph, where the tracks are nodes. The neighbors of a node with the strongest connections are therefore the songs that should be recommended.

Intuitively, this makes sense. If you like a song and 20 other playlists have that song with another song, then that other song should be recommended.

### Why RustworkX

Using Python's built-in data structures would take significantly longer (hours) and take up much more memory. On the other hand, RustworkX uses Rust to populate the graph in less than a minute, using much less memory in the process.

In [5]:
%%time
graph = rx.PyGraph(multigraph=False)

for k, v in uriToData.items():
    graph.add_node(k)

for i in range(numPlaylists):
    playlist = allPlaylists[i, :]
    for j in range(70):
        if playlist[j] == -1:
            break
        for k in range(j + 1, 70):
            if playlist[k] == -1:
                break
            if graph.has_edge(playlist[j], playlist[k]):
                graph.update_edge(playlist[j], playlist[k], graph.get_edge_data(playlist[j], playlist[k]) + 1)
            else:
                graph.add_edge(playlist[j], playlist[k], 1)
        

CPU times: user 55.4 s, sys: 109 ms, total: 55.5 s
Wall time: 55.7 s


## Creating an API for our Model

Creating a few functions to easily get recommendations from our model and print recommendations to the console.

In [6]:
# functions for interfacing with adjacency matrix and maps

def get_uri(url):
    parts = url.split('/')
    return parts[len(parts) - 1]

def get_recommendations(url, numRecs=5):
    try:
        trackIdx = uriToData[get_uri(url)][0]
    except KeyError:
        print("song not found in database")
        return None
    recommendations = sorted(dict(graph.incident_edge_index_map(trackIdx)).items(), key=lambda x: x[1][2], reverse=True)
    result = []
    for i in range(len(recommendations)):
        if i >= numRecs:
            break
        result.append((uriToData[graph[recommendations[i][1][1]]][1], recommendations[i][1][2], uriToData[graph[recommendations[i][1][1]]][2]))
    return result

def print_recommendations(recs):
    if recs is None:
        print("no recommendations")
        return
    for recSong, numAprs, artist in recs:
        print(f"{recSong} by {artist} - {numAprs}")

## Model Performance

10 recommendations for a track are retrieved in sub-5ms and O(1) time. This makes it good enough for web app integration.

In [11]:
%%time

# Fluorescent Adolescent by Arctic Monkeys
recommendations = get_recommendations("https://open.spotify.com/track/7e8utCy2JlSB8dRHKi49xM", numRecs=10)

CPU times: user 1.02 ms, sys: 12 μs, total: 1.03 ms
Wall time: 1.05 ms


### Quality of Recommendations

The tracks recommended for Fluorescent Adolescent by Arctic Monkeys have some clear similarities with the track. For example, Do I Wanna Know? is also by Arctic Monkeys. Additionally, The Less I Know The Better, Electric Feel, and most of the tracks are also of the alternative rock genre. Lisztomania also has a nostalgic theme.

In [13]:
print_recommendations(recommendations)

The Less I Know The Better by Tame Impala - 7
Electric Feel by MGMT - 6
Feels Like We Only Go Backwards by Tame Impala - 5
Houdini by Foster The People - 5
Do I Wanna Know? by Arctic Monkeys - 5
Pumped Up Kicks by Foster The People - 4
A-Punk by Vampire Weekend - 4
Lisztomania by Phoenix - 4
Take a Walk by Passion Pit - 4
Something Good Can Work by Two Door Cinema Club - 4


## Exporting Model and Data for Deployment

To deploy our model on a web API, we need the graph to retrieve recommendations. We also need a datastructure to efficiently store and retrieve the uri, name, artist, and graph index data for all the tracks, so we can display the graph's recommendations.

The graph will be saved as a pickle file, so we can load it into a flask application (check out the api folder). We will save the uri, name, artist, and graph index data as a csv file so it can be imported into a SQL table. 

### Why SQL over Python Dictionary

The reason why we will use SQL instead of loading a Python Dictionary into Flask is SQL's B-tree indices. Firstly, SQL's B-tree indices can retrieve data from millions of records in the order of milliseconds, which is a lot faster than Python's Dictionaries. Secondly, we want an autocompleting search bar in our web app, so SQL's "LIKE" statement with wildcard String patterns and lightning-fast indexing will enable us to provide fast autocomplete suggestions.

In [8]:
with open('./web/recommendation_graph.pkl', 'wb+') as f:
    pickle.dump(graph, f)

import csv

with open('names.csv', 'w+', newline='') as csvfile:
    fieldnames = ['uri', 'name', 'artist', 'graph_index']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
    writer.writeheader()
    for k, v in uriToData.items():
        writer.writerow({'uri': k, 'name': v[1], 'artist': v[2], 'graph_index': v[0]})