# Daltonify: An Audio Feature Based Recommender System

## *Building the Recommender System*

#### Table of Contents

* [The Recommender System](#recommender-system)
* [Creating a Playlist](#create-playlist)
* [Evaluating Playlists](#eval-playlist)

### Import Libraries

In [1]:
## STANDARD IMPORTS
import pandas as pd 
import numpy as np
import re
## VISUALIZATIONS
import matplotlib.pyplot as plt
import seaborn as sns
## MODELING
from sklearn.metrics.pairwise import cosine_similarity
## SPOTIFY
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [3]:
### Spotify Credentials - must be set in local environment to run
auth_manager = SpotifyClientCredentials()
sp = spotipy.Spotify(auth_manager=auth_manager)

### Read in Sample Data

The sample data is in pre-extracted csv files to reduce the need to call new sample sets while we test the recommender system.

In [12]:
### read in data
df = pd.read_csv('../data/country.csv')
track = pd.read_csv('../data/boston.csv')

We don't need all of the audio features right now so we'll drop the ones we don't need.

In [13]:
drop_cols = ['key', 'mode', 'time_signature', 'duration_ms']
df.drop(columns=drop_cols, inplace=True)
track.drop(columns=drop_cols, inplace=True)  ### not present in test set using here

In [14]:
### check
df.head(2)

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,Forever After All,Luke Combs,What You See Ain't Always What You Get (Deluxe...,6IBcOGPsniK3Pso1wHIhew,0.86,0.487,0.65,-5.195,0.0253,0.191,0.0,0.0933,0.456,151.964
1,Be Like That - feat. Swae Lee & Khalid,Kane Brown,Be Like That (feat. Swae Lee & Khalid),5f1joOtoMeyppIcJGZQvqJ,0.87,0.727,0.626,-8.415,0.0726,0.0469,2.6e-05,0.126,0.322,86.97


In [15]:
### check
track

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,Boston,Dalton & the Sheriffs,Luckier by Half,4HJ7mSMtHAdU55lLjGE4zW,0.15,0.541,0.921,-5.25,0.0443,0.00052,0.0784,0.159,0.613,99.98


In [18]:
data = pd.concat([df, track], ignore_index=True)
data.tail(2)

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
1999,White Trash,Tom MacDonald,White Trash,7MqeeNEYUxb7JeFnaBSSwo,0.57,0.737,0.775,-5.65,0.0993,0.024,0.0,0.104,0.339,154.04
2000,Boston,Dalton & the Sheriffs,Luckier by Half,4HJ7mSMtHAdU55lLjGE4zW,0.15,0.541,0.921,-5.25,0.0443,0.00052,0.0784,0.159,0.613,99.98


## The Recommender System <a class="anchor" id="recommender-system"></a>
<hr/>

This recommender is a content based recommender system which uses some of the Spotify audio features. It takes in a single track and generates a list of tracks from a desired genre. These tracks are chosen so that they have the highest similarity score possible but ranked according to popularity. The recommender uses cosine similarity to score each of the tracks.

### Choosing features

The following audio features were used.
* danceability
* energy
* valence
* instrumentalness
* acousticness
* speechiness. 

These features were chosen since they already exist on a 0 to 1 scale and can be easily compared to each other. Other features will be incorporated in future iterations. Loudness and tempo may also be valuable features to examine. I still need to incorporate a means to score these two features on a 0 to 1 scale for use in the system.

### Outline of the Recommender

The functions below carry out this general process:
1. Combine the track data and sample of 2000 tracks into a single dataframe for scoring.
2. Score the system using Cosine Similarity
3. Locate the desired track in the dataframe of resulting scores
4. Take the top 50th percentile of similarity scores, rank according to popularity, and return the given number of tracks.
5. List the URIs for the tracks in a simple text output which can be copied and pasted into Spotify to generate the playlist.

Other scoring metrics were attempted with varying results. Cosine Similarity was chosen for simplicity. Other scoring methods will be explored in the future.

In [20]:
def add_track_data(df, track):
    '''combines track sample set and given track into single dataframe'''
    ID = track['track_id'].values[0]
    ### Create X data
    data = pd.concat([df, track], ignore_index=True)
    ### desired features for model (may change later)
    features = ['danceability','energy','valence','instrumentalness','acousticness','speechiness']
    X = data[features]
    return X, data

def pop_track_recommender(df, track):
    '''uses cosine similarity to recommend tracks'''
    
    ID = track['track_id'].values[0]
    ### calculate data 
    X, data = add_track_data(df, track)
    
    ### calculate similarity matrix
    similarity_matrix = cosine_similarity(X, X)
    
    ### create mapping bwtn track ids and index
    track_id_map = pd.Series(data.index, index=data['track_id'])
    ## find index of track in dataframe
    track_index = track_id_map[ID]
    
    ### find the correct column for the track in the similarity matrix
    similarity_scores = pd.Series(similarity_matrix[track_index])
    similarity_scores.sort_values(ascending=False, inplace=True)

    ### CREATE DF OF ALL SCORES
    scores_ids = data['track_id'].loc[similarity_scores.index]
    
    ### CREATE DF OF ALL SCORES
    rec_tracks_df = data[data['track_id'].isin(scores_ids.values)].copy()
    rec_tracks_df['score'] = similarity_scores
    rec_tracks_df.sort_values(by=['score', 'popularity'], ascending=False, inplace=True)

    return rec_tracks_df

def top_recommended_tracks(results, num_tracks):
    '''selects songs in the top 50% in terms of similarity score, 
    sorts recommended tracks by popularity and then by similarity score'''
    
    ### GET TOP 50% PERCENTILE OF SIMILARITY SCORE
    top_half = results[results['score'] >= results['score'].median()].copy()
    ### SORT VALUES BY POPULARITY
    top_half.sort_values(by='popularity', ascending=False, inplace=True)
    ### SELECT DESIRED NUMBER OF TRACKS
    top_tracks = top_half[:num_tracks]
    
    return top_tracks

def recommender(df, track, num_tracks):
    '''combines functions above into single function call for simplicity'''
    results = pop_track_recommender(df, track)
    top_tracks = top_recommended_tracks(results, num_tracks)
    ### ADD TRACK TO TOP OF DATAFRAME SO INCLUDED IN PLAYLIST LIST
    playlist = pd.concat([track, top_tracks], ignore_index=True)
    return playlist


In [21]:
### Test function output
playlist = recommender(df, track, 15)
playlist

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,score
0,Boston,Dalton & the Sheriffs,Luckier by Half,4HJ7mSMtHAdU55lLjGE4zW,0.15,0.541,0.921,-5.25,0.0443,0.00052,0.0784,0.159,0.613,99.98,
1,Forever After All,Luke Combs,What You See Ain't Always What You Get (Deluxe...,6IBcOGPsniK3Pso1wHIhew,0.86,0.487,0.65,-5.195,0.0253,0.191,0.0,0.0933,0.456,151.964,0.973038
2,"10,000 Hours (with Justin Bieber)",Dan + Shay,"10,000 Hours (with Justin Bieber)",2wrJq5XKLnmhRXHIAf9xBa,0.85,0.654,0.63,-4.644,0.0259,0.153,0.0,0.111,0.43,89.991,0.954964
3,Fortunate Son,Creedence Clearwater Revival,Willy And The Poor Boys,4BP3uh0hFLFRb5cjsgLqDh,0.81,0.64,0.663,-7.516,0.0374,0.201,0.00806,0.152,0.663,132.77,0.958775
4,I Hope,Gabby Barrett,Goldmine,23T0OX7QOiIUFShSzbJ5Uo,0.8,0.466,0.536,-6.227,0.0429,0.0951,0.0,0.114,0.377,75.998,0.97759
5,One Of Them Girls,Lee Brice,One Of Them Girls,14GwnOeC9qYEKEA6uOZepa,0.8,0.643,0.79,-6.079,0.0462,0.294,0.0,0.107,0.8,95.987,0.955529
6,More Than a Feeling,Boston,Boston,1QEEqeFIZktqIpPI4jSVSF,0.79,0.377,0.682,-8.039,0.0299,0.000894,0.00217,0.0504,0.288,108.736,0.984003
7,When It Rains It Pours,Luke Combs,This One's for You,1mMLMZYXkMueg65jRRWG1l,0.78,0.551,0.801,-5.069,0.0303,0.013,6e-06,0.265,0.625,128.027,0.994759
8,Pretty Heart,Parker McCollum,Pretty Heart,6vC90OOjZR165Hw8CpsqEm,0.78,0.562,0.683,-4.427,0.0296,0.00531,0.0,0.107,0.385,132.003,0.98208
9,Hurricane,Luke Combs,This One's for You,6xHI9KjUjYT0FPtGO8Mxa1,0.78,0.464,0.813,-6.185,0.0416,0.0153,0.0,0.254,0.515,75.977,0.997661


When returning a playlist users most likely will not care about the details of the data. We only really probably want to return the following features.

In [22]:
playlist[['track_name', 'artist', 'album']]

Unnamed: 0,track_name,artist,album
0,Boston,Dalton & the Sheriffs,Luckier by Half
1,Forever After All,Luke Combs,What You See Ain't Always What You Get (Deluxe...
2,"10,000 Hours (with Justin Bieber)",Dan + Shay,"10,000 Hours (with Justin Bieber)"
3,Fortunate Son,Creedence Clearwater Revival,Willy And The Poor Boys
4,I Hope,Gabby Barrett,Goldmine
5,One Of Them Girls,Lee Brice,One Of Them Girls
6,More Than a Feeling,Boston,Boston
7,When It Rains It Pours,Luke Combs,This One's for You
8,Pretty Heart,Parker McCollum,Pretty Heart
9,Hurricane,Luke Combs,This One's for You


In [23]:
def display_playlist(playlist):
    ### displays playlist track name, artist, album
    playlist_df = playlist[['track_name', 'artist', 'album']]
    playlist_df.columns = ['Title', 'Artist', 'Album']
    ### start index at 1
    playlist_df.index = np.arange(1,len(playlist_df)+1)
    return playlist_df

In [24]:
display_playlist(playlist)

Unnamed: 0,Title,Artist,Album
1,Boston,Dalton & the Sheriffs,Luckier by Half
2,Forever After All,Luke Combs,What You See Ain't Always What You Get (Deluxe...
3,"10,000 Hours (with Justin Bieber)",Dan + Shay,"10,000 Hours (with Justin Bieber)"
4,Fortunate Son,Creedence Clearwater Revival,Willy And The Poor Boys
5,I Hope,Gabby Barrett,Goldmine
6,One Of Them Girls,Lee Brice,One Of Them Girls
7,More Than a Feeling,Boston,Boston
8,When It Rains It Pours,Luke Combs,This One's for You
9,Pretty Heart,Parker McCollum,Pretty Heart
10,Hurricane,Luke Combs,This One's for You


## Creating a Playlist <a class="anchor" id="create-playlist"></a>
<hr/>

I'm still working on getting user authorization incorporated into the app so users can easily import their generated playlist into Spotify. For the now the work around is to generate the list of URIs which can be copy and pasted into a playlist in the Spotify desktop application.

Since we only store the IDs in our dataframe, we'll need to pre-pend each ID with the text `spotify:track:` and then write this to a text file that can later be read.

In [32]:
def make_track_URIs(track_ids):
    '''reformats track ids as track URIs'''
    ### need text spotify:track: in front of each ID to use in Spotify
    track_URIs = []
    for track_id in track_ids:
        uri = 'spotify:track:'+ track_id
        track_URIs.append(uri)
    return track_URIs

def create_playlist_file(track_ids):
    '''creates text file of Spotify URIs'''
    track_list = track_ids.values.tolist()
    track_URIs = make_track_URIs(track_list)
    ### write URIs to text file
    playlist = open(fr'./playlist.txt','w')
    playlist.writelines('%s\n' % track for track in track_URIs) 
    playlist.close()
    pass

In [33]:
create_playlist_file(playlist['track_id'])

Now we check the result by reading in the text file back in.

In [34]:
playlist_file = open('../playlists/playlist.txt', 'r+')

text = playlist_file.read() ### prints better in streamlit
text

'spotify:track:4HJ7mSMtHAdU55lLjGE4zW\nspotify:track:6IBcOGPsniK3Pso1wHIhew\nspotify:track:2wrJq5XKLnmhRXHIAf9xBa\nspotify:track:4BP3uh0hFLFRb5cjsgLqDh\nspotify:track:23T0OX7QOiIUFShSzbJ5Uo\nspotify:track:14GwnOeC9qYEKEA6uOZepa\nspotify:track:1QEEqeFIZktqIpPI4jSVSF\nspotify:track:1mMLMZYXkMueg65jRRWG1l\nspotify:track:6vC90OOjZR165Hw8CpsqEm\nspotify:track:6xHI9KjUjYT0FPtGO8Mxa1\nspotify:track:7aEtlGHoiPAfRB084NiDmx\nspotify:track:20OFwXhEXf12DzwXmaV7fj\nspotify:track:1D7cfiC5mxqHfTCcOiRBej\nspotify:track:698eQRku24PIYPQPHItKlA\nspotify:track:2DwbFtfC6sXBiVDPmju8Dd\nspotify:track:70YvYr2hGlS01bKRIho1HM\n'

In [36]:
playlist_file = open('../playlists/playlist.txt', 'r+')
playlist_file.read().splitlines()

['spotify:track:4HJ7mSMtHAdU55lLjGE4zW',
 'spotify:track:6IBcOGPsniK3Pso1wHIhew',
 'spotify:track:2wrJq5XKLnmhRXHIAf9xBa',
 'spotify:track:4BP3uh0hFLFRb5cjsgLqDh',
 'spotify:track:23T0OX7QOiIUFShSzbJ5Uo',
 'spotify:track:14GwnOeC9qYEKEA6uOZepa',
 'spotify:track:1QEEqeFIZktqIpPI4jSVSF',
 'spotify:track:1mMLMZYXkMueg65jRRWG1l',
 'spotify:track:6vC90OOjZR165Hw8CpsqEm',
 'spotify:track:6xHI9KjUjYT0FPtGO8Mxa1',
 'spotify:track:7aEtlGHoiPAfRB084NiDmx',
 'spotify:track:20OFwXhEXf12DzwXmaV7fj',
 'spotify:track:1D7cfiC5mxqHfTCcOiRBej',
 'spotify:track:698eQRku24PIYPQPHItKlA',
 'spotify:track:2DwbFtfC6sXBiVDPmju8Dd',
 'spotify:track:70YvYr2hGlS01bKRIho1HM']

## Evaluating Playlists <a class="anchor" id="eval-playlist"></a>
<hr/>

The quality of any playlist is completely subjective. Ultimately, the goal is to generate a playlist which will increase new streams for an artist. This aspect of the project has yet to be tested in practice.

In this case, there are a lot of tracks here by Luke Combs. I would like to see a little more variety here in terms of artists but since there is a limit as to how random of a sample size we can obtain and I'm not sure how best to adjust for this at this time.

I have made this playlist public on my Spotify account. You can listen to it [here](https://open.spotify.com/playlist/67NmDFA5m1UwkbOb2mAzvA?si=ZBBWKtklRjqsV_oLO9YxBA).