# Daltonify: An Audio Feature Based Recommender System

## *Collection of Track Samples*

This notebook covers how track samples were obtained from Spotify for use in building the recommender system.

#### Table of Contents

* [Testing Pulling Track Samples from Spotify](#test-sample-pull)
* [Function to Pull Track Samples](#function-sample-pull)
* [Pulling Track Samples for Testing](#pull-some-samples)
* [Individual Track Samples](#single-track-samples)

### Import Libraries

In [9]:
## STANDARD IMPORTS
import pandas as pd 
import numpy as np
import datetime as dt

## SPOTIFY
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [5]:
### Spotify Credentials - must be set in local environment to run
auth_manager = SpotifyClientCredentials()
sp = spotipy.Spotify(auth_manager=auth_manager)

## Testing Pulling Track Samples from Spotify <a class="anchor" id="test-sample-pull"></a>
<hr/>

You are limited in how many tracks we can pull at once. Currently, this is 2000 tracks. More can be pulled over time, however, the goal is to pull samples on the fly in our app deployment. I'll just use this limit to construct a playlist for now so that users don't have to wait too long for a result.

I'll first test how to do this using a smaller sample size to test our process. I'll also use this to test a few different genre samples. Obviously, genre assignment is very subjective and not all genre designations will have enough tracks to generate the full sample size.

This is an issue that will need to be addressed later, but for now I'll just keep our genres as generic as possible and adjust if I encounter issues.

In [3]:
genre = 'hip hop'

In [6]:
### TESTING VALUES
limit = 5# 50
max_requests = 20 ## 2000

track_df = []
for n in range(0, max_requests, limit):
    search_results = sp.search(q=f'genre: "{genre}"', type='track', limit=limit, offset = n, market='US')['tracks']['items']

    track_list = []
    
    for i in range(len(search_results)):
        track_info = [
            search_results[i].get('name'), 
            search_results[i].get('artists')[0]['name'], 
            search_results[i].get('album')['name'],
            search_results[i].get('id'),
            search_results[i].get('popularity'),
            ]
        track_list.append(track_info)

    ## create dataframe of track info
    track_list_df = pd.DataFrame(track_list, columns=['track_name', 'artist', 'album', 'track_id', 'popularity'])
    ## get audio features for tracks
    track_audio_features = pd.DataFrame.from_dict(sp.audio_features(tracks=track_list_df['track_id'].values.tolist()))
    drop_cols = ['type', 'id', 'uri', 'track_href', 'analysis_url']
    track_audio_features.drop(columns = drop_cols, inplace=True)
    ## concat both dataframs
    track_list_df = pd.concat([track_list_df, track_audio_features], axis=1)
    track_df.append(track_list_df)

tracks = pd.concat(track_df, ignore_index=True)
tracks['popularity'] = np.round(tracks['popularity']/100, 2)
genre = genre.replace(' ', '')

print('We got your song sample there are ', tracks.shape[0], ' entries.')

We got your song sample there are  20  entries.


In [7]:
tracks

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Laugh Now Cry Later (feat. Lil Durk),Drake,Laugh Now Cry Later (feat. Lil Durk),2SAqBLGA283SUiwJ3xOUVI,0.94,0.761,0.518,0,-8.871,1,0.134,0.244,3.5e-05,0.107,0.522,133.976,261493,4
1,ROCKSTAR (feat. Roddy Ricch),DaBaby,BLAME IT ON BABY,7ytR5pFWmSjzHJIeQkgog4,0.94,0.746,0.69,11,-7.956,1,0.164,0.247,0.0,0.101,0.497,89.977,181733,4
2,POPSTAR (feat. Drake),DJ Khaled,POPSTAR (feat. Drake),6EDO9iiTtwNv6waLwa1UUq,0.91,0.8,0.56,5,-4.818,0,0.261,0.057,0.0,0.134,0.45,163.071,200221,4
3,"WHATS POPPIN (feat. DaBaby, Tory Lanez & Lil W...",Jack Harlow,"WHATS POPPIN (feat. DaBaby, Tory Lanez & Lil W...",2MbdDtCv5LUVjYy9RuGTgC,0.89,0.904,0.723,11,-5.224,0,0.26,0.0631,0.0,0.185,0.835,145.013,227478,4
4,Tyler Herro,Jack Harlow,Tyler Herro,4DuUwzP4ALMqpquHU0ltAB,0.87,0.794,0.756,5,-7.16,0,0.136,0.11,0.0,0.247,0.775,123.066,156498,4
5,Mr. Right Now (feat. Drake),21 Savage,SAVAGE MODE II,4Q34FP1AT7GEl9oLgNtiWj,0.87,0.647,0.667,5,-5.563,1,0.304,0.231,0.0,0.133,0.704,172.08,193839,4
6,Runnin,21 Savage,SAVAGE MODE II,5SWnsxjhdcEDc7LJjq9UHk,0.87,0.819,0.626,10,-4.574,0,0.202,0.00748,0.101,0.167,0.415,143.01,195906,4
7,my ex's best friend (with blackbear),Machine Gun Kelly,Tickets To My Downfall,7kDUspsoYfLkWnZR7qwHZl,0.88,0.731,0.675,5,-5.134,0,0.0434,0.00473,0.0,0.141,0.298,124.939,139461,4
8,Wolves (feat. Post Malone),Big Sean,Detroit 2,33gwZOGJWEZ7dRWPqPxBEZ,0.85,0.724,0.675,2,-5.267,1,0.0867,0.0978,9e-06,0.351,0.325,160.048,199758,4
9,SO DONE,The Kid LAROI,SO DONE,5psEZhQu6lukjhavJo4AbC,0.86,0.719,0.598,9,-6.254,1,0.077,0.232,0.0,0.115,0.303,142.592,126521,4


Since we cannot sort what is returned by Spotify we can only do so much to truly randomize the results returned. Using the limit and offset will be our only means to do this at the moment.

## Function to Pull Track Samples <a class="anchor" id="function-sample-pull"></a>
<hr/>

Now that the above loop is working properly, we'll turn it into a function for easy use. Because Spotify limits the number of calls we can make, we'll generate some sample sets of data to build our recommender system.

In [16]:
def get_tracks(genre):
    '''pulls max of 2000 tracks for given genre, returns dataframe in addition to exporting results to csv.'''
    limit = 50
    max_requests = 2000

    track_df = []
    for n in range(0, max_requests, limit):
        search_results = sp.search(q=f'genre: "{genre}"', type='track', limit=limit, offset = n, market='US')['tracks']['items']

        track_list = []
        
        for i in range(len(search_results)):
            track_info = [
                search_results[i].get('name'), 
                search_results[i].get('artists')[0]['name'], 
                search_results[i].get('album')['name'],
                search_results[i].get('id'),
                search_results[i].get('popularity'),
                ]
            track_list.append(track_info)

        ## create dataframe of track info
        track_list_df = pd.DataFrame(track_list, columns=['track_name', 'artist', 'album', 'track_id', 'popularity'])
        ## get audio features for tracks
        track_audio_features = pd.DataFrame.from_dict(sp.audio_features(tracks=track_list_df['track_id'].values.tolist()))
        drop_cols = ['type', 'id', 'uri', 'track_href', 'analysis_url']
        track_audio_features.drop(columns = drop_cols, inplace=True)
        ## concat both dataframs
        track_list_df = pd.concat([track_list_df, track_audio_features], axis=1)
        track_df.append(track_list_df)

    tracks = pd.concat(track_df, ignore_index=True)
    tracks['popularity'] = np.round(tracks['popularity']/100, 2)
    genre = genre.replace(' ', '')
    
    tracks.to_csv(f'../data/{genre}.csv', index=False)

    print('We got', tracks.shape[0], 'tracks from the', genre, 'genre.')
    
    pass 

## Pulling Track Samples for Testing <a class="anchor" id="pull-some-samples"></a>
<hr/>


We'll pull a few samples of data that we'll use later. We'll use Hip Hop, Country, and Rock as our test genres.

#### Hip Hop

In [23]:
get_tracks('hip hop')

We got 2000 tracks from the hiphop genre.


In [13]:
### check
hiphop = pd.read_csv('../data/hiphop.csv')
hiphop.head(2)

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Laugh Now Cry Later (feat. Lil Durk),Drake,Laugh Now Cry Later (feat. Lil Durk),2SAqBLGA283SUiwJ3xOUVI,0.94,0.761,0.518,0,-8.871,1,0.134,0.244,3.5e-05,0.107,0.522,133.976,261493,4
1,ROCKSTAR (feat. Roddy Ricch),DaBaby,BLAME IT ON BABY,7ytR5pFWmSjzHJIeQkgog4,0.94,0.746,0.69,11,-7.956,1,0.164,0.247,0.0,0.101,0.497,89.977,181733,4


#### Country

In [17]:
get_tracks('country')

We got 2000 tracks from the country genre.


In [18]:
### check
country = pd.read_csv('../data/country.csv')
country.head(2)

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Forever After All,Luke Combs,What You See Ain't Always What You Get (Deluxe...,6IBcOGPsniK3Pso1wHIhew,0.86,0.487,0.65,0,-5.195,1,0.0253,0.191,0.0,0.0933,0.456,151.964,232533,4
1,Be Like That - feat. Swae Lee & Khalid,Kane Brown,Be Like That (feat. Swae Lee & Khalid),5f1joOtoMeyppIcJGZQvqJ,0.87,0.727,0.626,7,-8.415,1,0.0726,0.0469,2.6e-05,0.126,0.322,86.97,191406,4


#### Rock

In [20]:
get_tracks('rock')

We got 2000 tracks from the rock genre.


In [21]:
### check
rock = pd.read_csv('../data/rock.csv')
rock.head(2)

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Dreams - 2004 Remaster,Fleetwood Mac,Rumours (Super Deluxe),0ofHAoxe9vBkTCp2UQIavz,0.9,0.828,0.492,0,-9.744,1,0.0276,0.0644,0.00428,0.128,0.789,120.151,257800,4
1,Sweater Weather,The Neighbourhood,I Love You.,2QjOHCTQ1Jl3zawyYOpxh6,0.87,0.612,0.807,10,-2.81,1,0.0336,0.0495,0.0177,0.101,0.398,124.053,240400,4


A version of this function is lso used within the app deployment to pull desired genre samples as needed. The version used by the app does not save the datafame as a csv.

## Individual Track Samples <a class="anchor" id="single-track-samples"></a>
<hr/>

We'll also need to make some samples of individual tracks. These will be able to be imported and combined for use in the recommender system. We wrote several functions to do this for us. Functions are broken up in this way for use in the app where different pieces were needed for display purposes.

In [10]:
def get_track_info(uri):
    track = sp.track(uri)
    track_info = {
        'track_name' : track['name'],
        'artist' : track['artists'][0]['name'],
        'album' : track['album']['name'],
        'artwork_url': track['album']['images'][1]['url'], ## 300 W 64 H
        'release_date': dt.datetime.strptime(track['album']['release_date'],'%Y-%m-%d').strftime('%B %d, %Y'),
        'track_id' : track['id'],
        'popularity' : np.round(track['popularity']/100, 2),
    }
    return track_info

def get_track_audio_features(track_info):
    ''' uses dict of track_info to build on for audio features dataframe'''
    ### Call only pieces needed for playlist creation
    track_info_list = [
        track_info['track_name'],
        track_info['artist'],
        track_info['album'],
        track_info['track_id'],
        track_info['popularity']
    ]
    ### GET TRACK AUDIO FEATURES FROM SPOTIFY
    track_audio_features = sp.audio_features(tracks=track_info['track_id'])
    audio_features_df = pd.DataFrame.from_dict(track_audio_features)
    drop_cols = ['type', 'id', 'uri', 'track_href', 'analysis_url']
    audio_features_df.drop(columns = drop_cols, inplace=True)

    ### COMBINE TRACK INFO AND AUDIO FEATURES
    track_info_df = pd.DataFrame([track_info_list], columns=['track_name','artist', 'album', 'track_id', 'popularity'])
    track_audio_features = pd.concat([track_info_df, audio_features_df], axis=1)
    return track_audio_features

def get_track_data(uri):
    '''Combines track info and track audio features data'''
    track_info = get_track_info(uri)
    track_df = get_track_audio_features(track_info)
    return track_df

### Construct Individual Track Samples

#### Boston - Dalton & the Sheriffs

In [28]:
sid = '4HJ7mSMtHAdU55lLjGE4zW' ### Boston - Dalton & the Sheriffs

In [29]:
### construct dataframe for track
track = get_track_data(sid)
### check
track.head()

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Boston,Dalton & the Sheriffs,Luckier by Half,4HJ7mSMtHAdU55lLjGE4zW,0.15,0.541,0.921,11,-5.25,1,0.0443,0.00052,0.0784,0.159,0.613,99.98,223440,4


In [30]:
### export to csv
track.to_csv('../data/boston.csv', index=False)

### WAP - Cardi B

In [37]:
sid = '4Oun2ylbjFKMPTiaSbbCih'

In [38]:
### construct dataframe for track
track = get_track_data(sid)
### check
track.head()

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,WAP (feat. Megan Thee Stallion),Cardi B,WAP (feat. Megan Thee Stallion),4Oun2ylbjFKMPTiaSbbCih,0.97,0.935,0.454,1,-7.509,1,0.375,0.0194,0,0.0824,0.357,133.073,187541,4


In [39]:
### export to csv
track.to_csv('../data/WAP.csv', index=False)

### Beast of Burden - The Rolling Stones

In [34]:
# sid = '0832Tptls5YicHPGgw7ssP'
sid = 'spotify:track:6ttUD7vz01HOy3hm9Kq3t5' ### not remastered version

In [35]:
### construct dataframe for track
track = get_track_data(sid)
### check
track.head()

Unnamed: 0,track_name,artist,album,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Beast Of Burden,The Rolling Stones,Honk (Deluxe),6ttUD7vz01HOy3hm9Kq3t5,0.39,0.781,0.844,1,-4.754,0,0.0305,0.454,0.00368,0.0518,0.902,100.709,265499,4


In [36]:
### export to csv
track.to_csv('../data/beastofburden.csv', index=False)