# Exploratory Analysis

## Eploring SpotiPy
- Explore available data
- Select features of interest
- Generate initial dataframe and database

###### Note:
###### - Potentially building a recomender system that can take a set of the most frequently played songs from one user, and match them with a second users profile; potentially between courting couples and friends
###### - Consider doing podcasts as a feature for the people that might be interested in that
###### - Also, consider calling the playlists `"{user_1} and {user_2}'s Playlist Baby"`

In [2]:
# imports

import sys
import json
import spotipy
import webbrowser
import numpy as np
import pandas as pd
# sklearn
from os import getenv
import spotipy.util as util
from dotenv import load_dotenv
from json.decoder import JSONDecodeError
from spotipy.oauth2 import SpotifyClientCredentials, SpotifyOAuth

notes: pivoting given that we cannot create two playlists for two users simultaneously
- we can create a single play list for one user given the other user's library
- so what we'll do it take the two user libraries and generate a playlist for a single user
- given the music from a second play list

In [3]:
# We are using the client Module from the Python library for the Spotify API
# (https://spotipy.readthedocs.io/en/2.13.0/#module-spotipy.client)
# Client Credentials Flow


load_dotenv()  # this imports all .env variables

# Setting up env variables to connect to API
uri = getenv('uri') # must match in the Spotify app dashboard
SPOTIFY_CLIENT_ID = getenv('SPOTIFY_CLIENT_ID')
SPOTIFY_CLIENT_SECRET = getenv('SPOTIFY_CLIENT_SECRET')
username = getenv('USER_ID')  #  user who's data we are collecting
# scope = 'playlist-modify-public'  #  determines the kind of access you have to a user profile
# scope = 'user-top-read'
scope = 'user-library-read'

# Access token to obtain user info
token = util.prompt_for_user_token(username='spotify',
                                   client_id=SPOTIFY_CLIENT_ID,
                                   client_secret=SPOTIFY_CLIENT_SECRET,
                                   scope=scope,
                                  redirect_uri=uri)

# activating spotify session
spotify_session = spotipy.Spotify(auth=token)

## Goals 
- Connect to user library using [scopes](https://developer.spotify.com/documentation/general/guides/scopes/)
- Scopes to connect to are [user-library-read](https://developer.spotify.com/documentation/general/guides/scopes/#user-library-read), [playlist-modify-public](https://developer.spotify.com/documentation/general/guides/scopes/#playlist-modify-public), and [user-top-read](https://developer.spotify.com/documentation/general/guides/scopes/#user-top-read)

##### **The goal here is to cnnect to the users' respective libraries, analyze them, and create a new playlist.**
- For this analysis I will exlpore the Audio Features Objects

# Exploring Top Tracks for a user

In [4]:
# Playing with the api: accessing user top read, modify playlist, read library

# User top artists (test)
top_artists = spotify_session.current_user_top_artists(limit=1)

# User top tracks (test)
top_tracks = spotify_session.current_user_top_tracks(limit=50, time_range='medium_term')

# Exploring Track_ids (test)
print(top_tracks['items'][0]['id'] == "0akyEssGRVHstqCSWXusJL")
top_track_id = top_tracks['items'][0]['id']

# Top ten tracks ids
top_50_tracks_id = [top_tracks['items'][x]['id'] for x in range(len(top_tracks['items']))]
print(top_50_tracks_id)

# Top track Audio Features Object
audio_feat = spotify_session.audio_features(tracks=top_50_tracks_id)

True
['0akyEssGRVHstqCSWXusJL', '02gaYAEdeR6poHcBH1KUQF', '6plO0gM4tUvRC9TKFGIuaN', '0NeJjNlprGfZpeX2LQuN6c', '54KsfVVnN4YWI2mMrnyUcC', '1s3WD4gbNoEXHiuSTmAKaK', '57mLRN6tfXwTRvp9oPWpop', '6KseaEAFSS63N2NPZtDnRL', '5iSpfk6cDOSYePagAoG639', '1jecO8NeYLsVWVptITz4c1', '7kWFRZdedr2gtfE8JDumVZ', '52N0IV8hLVkRmnpFclmCzK', '4zFPUEMucYleIIUnYVoeZf', '1IF5UcqRO42D12vYwceOY6', '2UxrK7r4cyQOSh7wvdQTe1', '17OkYffr0SdAcpcbwMkDDG', '71Mj2THXRicZhTFGzln3al', '0JfsIu62NVXNQl2s7ATN37', '0107Auhv91hE49iLoxtayt', '4cJOLN346rtOty3UPACsao', '4RNYL9drYkmWYpDyfknta9', '1sJev5Y7VI2Ke8AwUpnh0l', '1yTJg3lyUPmwbnve82twH5', '6Zy0ITa16EjCAbbGuPzdRi', '6fcS6fncRVP8rldHjriZHS', '7v3YlquaNhK2GYKzxovSEp', '3PUbNbybe6dTMWdUt9vQ02', '4OwOKRIKlO7wsDitlUN4QH', '51wUFdgpNsV8cVzu7i6N0l', '4RrOSjdnV8rkpIuOIfkKYS', '0xSqHQ5wv80hNkpU50vPc7', '2TzkIzgzIHhewMxyh1u4hh', '5ONAA8z6SvFBniu8zXz1Ax', '58iNllszkXpDOcYRgcfLfH', '6ThsXWiur66KlCzXVj9tXj', '3QIQtCPni57ZcSPzz7JDxt', '7ukRl9q1yVYO2j5SXwvjaB', '1VA0QtG2DXGF4k6fYz70PE', '2G3ud

In [5]:
# Obtainiing track-name and artist name

top_tracks.keys()
top_tracks['items'][0].keys()
top_tracks['items'][0]['name']                # Generates track name
top_tracks['items'][0]['artists'][0]['name']  # Generates artist name

'Ed Maverick'

In [6]:
# Audio Features Objects

# See the reference README file for a description of the Audio Features Objects, or explore the following link
# (https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/)

# Note: 'key' is -1 if no key is detected.  Consider when training model, or processing data

# The following explores the keys of the Audio Features Object for a single track:
print("The following are the key value pairs contained in the audio features for 10 tracks:", '\n')

k_lst = list(audio_feat[0].keys())      # will eventually become the column names
lst_v_lst= []                           # a list of lists of values
for i in range(len(audio_feat)):        # for loop to run through the Audio Feat. Objects
    v_lst = []
    for _, v in audio_feat[i].items():  # for loop through the 0th item to append values
        v_lst.append(v)                 # values appended to list
    lst_v_lst.append(v_lst)             # list of values appending to list
# print(k_lst)
lst_v_lst[0]

The following are the key value pairs contained in the audio features for 10 tracks: 



[0.83,
 0.159,
 1,
 -14.461,
 1,
 0.0383,
 0.946,
 2.02e-05,
 0.362,
 0.189,
 104.95,
 'audio_features',
 '0akyEssGRVHstqCSWXusJL',
 'spotify:track:0akyEssGRVHstqCSWXusJL',
 'https://api.spotify.com/v1/tracks/0akyEssGRVHstqCSWXusJL',
 'https://api.spotify.com/v1/audio-analysis/0akyEssGRVHstqCSWXusJL',
 207400,
 4]

# Creating a DF from user's top tracks:
- Tracks Audio Features (top_tracks_df)


In [7]:
# Generating the dataframe for the tracks

# Take the two lists (K_LST, V_LST) and turn those into the dataframeh; the K_LST will be the column names
# V_LST will be the values; and they will be indexed by the track id's

top_tracks_df = pd.DataFrame(lst_v_lst, columns=k_lst)
top_tracks_df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.83,0.159,1,-14.461,1,0.0383,0.946,2e-05,0.362,0.189,104.95,audio_features,0akyEssGRVHstqCSWXusJL,spotify:track:0akyEssGRVHstqCSWXusJL,https://api.spotify.com/v1/tracks/0akyEssGRVHs...,https://api.spotify.com/v1/audio-analysis/0aky...,207400,4
1,0.726,0.125,5,-9.194,0,0.0803,0.835,0.0,0.131,0.277,92.23,audio_features,02gaYAEdeR6poHcBH1KUQF,spotify:track:02gaYAEdeR6poHcBH1KUQF,https://api.spotify.com/v1/tracks/02gaYAEdeR6p...,https://api.spotify.com/v1/audio-analysis/02ga...,183711,4
2,0.78,0.23,4,-12.706,1,0.0448,0.913,0.00279,0.0798,0.125,123.937,audio_features,6plO0gM4tUvRC9TKFGIuaN,spotify:track:6plO0gM4tUvRC9TKFGIuaN,https://api.spotify.com/v1/tracks/6plO0gM4tUvR...,https://api.spotify.com/v1/audio-analysis/6plO...,240307,4
3,0.658,0.179,8,-10.866,1,0.0448,0.689,0.0,0.17,0.191,128.128,audio_features,0NeJjNlprGfZpeX2LQuN6c,spotify:track:0NeJjNlprGfZpeX2LQuN6c,https://api.spotify.com/v1/tracks/0NeJjNlprGfZ...,https://api.spotify.com/v1/audio-analysis/0NeJ...,238560,4
4,0.77,0.325,7,-11.301,1,0.0322,0.899,0.000556,0.22,0.721,103.085,audio_features,54KsfVVnN4YWI2mMrnyUcC,spotify:track:54KsfVVnN4YWI2mMrnyUcC,https://api.spotify.com/v1/tracks/54KsfVVnN4YW...,https://api.spotify.com/v1/audio-analysis/54Ks...,209652,4


In [8]:
# Combining 'track/artist name' and 'top_tracks_df' DataFrames and droping useless columns

# Dopping columns from 'top_tracks_df'
drop_col = ['type', 'track_href', 'analysis_url', 'uri']
top_tracks_df = top_tracks_df.drop(drop_col, axis=1)
top_tracks_df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,duration_ms,time_signature
0,0.83,0.159,1,-14.461,1,0.0383,0.946,2e-05,0.362,0.189,104.95,0akyEssGRVHstqCSWXusJL,207400,4
1,0.726,0.125,5,-9.194,0,0.0803,0.835,0.0,0.131,0.277,92.23,02gaYAEdeR6poHcBH1KUQF,183711,4
2,0.78,0.23,4,-12.706,1,0.0448,0.913,0.00279,0.0798,0.125,123.937,6plO0gM4tUvRC9TKFGIuaN,240307,4
3,0.658,0.179,8,-10.866,1,0.0448,0.689,0.0,0.17,0.191,128.128,0NeJjNlprGfZpeX2LQuN6c,238560,4
4,0.77,0.325,7,-11.301,1,0.0322,0.899,0.000556,0.22,0.721,103.085,54KsfVVnN4YWI2mMrnyUcC,209652,4


In [9]:
top_tracks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      50 non-null     float64
 1   energy            50 non-null     float64
 2   key               50 non-null     int64  
 3   loudness          50 non-null     float64
 4   mode              50 non-null     int64  
 5   speechiness       50 non-null     float64
 6   acousticness      50 non-null     float64
 7   instrumentalness  50 non-null     float64
 8   liveness          50 non-null     float64
 9   valence           50 non-null     float64
 10  tempo             50 non-null     float64
 11  id                50 non-null     object 
 12  duration_ms       50 non-null     int64  
 13  time_signature    50 non-null     int64  
dtypes: float64(9), int64(4), object(1)
memory usage: 5.6+ KB


In [10]:
# Function: Generates DataFrame with user library

# Creating a large dataset from the user `Spotify`

In [11]:
# Getting all playlists for the user Spotify

playlists = spotify_session.user_playlists('spotify')
playlist_ids = []
while playlists:
    for i, playlist in enumerate(playlists['items']):
        playlist_ids.append(playlist['id'])
    if playlists['next']:
        playlists = spotify_session.next(playlists)
    else:
        playlists = None

In [12]:
# `.playlist_tracks()` method allows one to obtain all tracks in a playlist

response = spotify_session.playlist_tracks(playlist_ids[0],
                                           offset=1,
                                           fields='items.track.id')

In [13]:
# Obtaining track IDs for tracks in a playlist

# Note: this cell takes time to complete running; so run cautiously

trx = []
for i in playlist_ids:
    offset = 0
    while True:
        response = spotify_session.playlist_tracks(i,
                                                   offset=offset,
                                                   fields='items.track.id')
#         trx.append(response['items'])
        offset = offset + len(response['items'])
        if len(response['items']) == 0:
            break
        trx.append(response['items'])

retrying ...1secs


In [14]:
# Dropping empty lists

for k, v in enumerate(trx):
    if len(trx[k]) == 0:
#         print(trx[k])
        trx.pop(k)

In [15]:
# Creating a list of track-id strings

track_ids = []

for lst in trx:
#     print(lst)
    for tracks in lst:
        if tracks['track'] == None:
            continue
        track_ids.append(tracks['track']['id'])
#         print(tracks['track']['id'])

In [16]:
len(track_ids)

104165

In [17]:
# Removing None type track ids

for k, track in enumerate(track_ids):
    if track == None:
        track_ids.pop(k)
len(track_ids)

104147

In [18]:
# OAuth Creds
spot_cc = spotipy.oauth2.SpotifyOAuth(username='agustinvargas',
                                      client_id=SPOTIFY_CLIENT_ID,
                                      client_secret=SPOTIFY_CLIENT_SECRET,
                                      scope=scope,
                                      redirect_uri=uri)

# Token Access Dict
accs_token = spot_cc.get_access_token(as_dict=True)

# Refreshing token
refresh_accs_token = spot_cc.refresh_access_token(accs_token['refresh_token'])

# SpotiSesh
spotify_session = spotipy.Spotify(auth=accs_token['access_token'])

# pseudo code
# 
# end_offset = 0
# start_offset = 0
# while end_offset <= len(track_ids):
#     start_offset = start_offset + end_offset
#     end_offset = start_offset + 50
#     get the audio_feats for the first 50

len(track_ids)
j = track_ids[0]
audio_feat_2 = spotify_session.audio_features(tracks=j)
for _, v in audio_feat_2[0].items():
    print(v)
print(audio_feat_2[0])





            User authentication requires interaction with your
            web browser. Once you enter your credentials and
            give authorization, you will be redirected to
            a url.  Paste that url you were directed to to
            complete the authorization.

        


  if __name__ == '__main__':


Opened https://accounts.spotify.com/authorize?client_id=933b2272683d4796a86dd9a524514edb&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%3A8888%2Flab&scope=user-library-read in your browser




Enter the URL you were redirected to:  http://localhost:8888/lab/workspaces/auto-m?code=AQBDFm9AmTG8iML_IHf1dcx0TbrbDGgWMYY81f5VV1FzNPV38oBfYcaGarpDI3TVyvcDWNkJnnqn79tkiLEI7twqQLDWNV0I6USG9uhGIkC7jXuJN3JM-LCo0EN9sQDfhChqy2S8G6cb0X5J-CMm38dHipwqSOT-WFZz4FZDRfWw9P1E-OrRHWly6fY7cb9n




0.935
0.454
1
-7.509
1
0.375
0.0194
0
0.0824
0.357
133.073
audio_features
4Oun2ylbjFKMPTiaSbbCih
spotify:track:4Oun2ylbjFKMPTiaSbbCih
https://api.spotify.com/v1/tracks/4Oun2ylbjFKMPTiaSbbCih
https://api.spotify.com/v1/audio-analysis/4Oun2ylbjFKMPTiaSbbCih
187541
4
{'danceability': 0.935, 'energy': 0.454, 'key': 1, 'loudness': -7.509, 'mode': 1, 'speechiness': 0.375, 'acousticness': 0.0194, 'instrumentalness': 0, 'liveness': 0.0824, 'valence': 0.357, 'tempo': 133.073, 'type': 'audio_features', 'id': '4Oun2ylbjFKMPTiaSbbCih', 'uri': 'spotify:track:4Oun2ylbjFKMPTiaSbbCih', 'track_href': 'https://api.spotify.com/v1/tracks/4Oun2ylbjFKMPTiaSbbCih', 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/4Oun2ylbjFKMPTiaSbbCih', 'duration_ms': 187541, 'time_signature': 4}


In [19]:
# Configureing authentication to obtain token refresh details

# OAuth Creds
spot_cc = spotipy.oauth2.SpotifyOAuth(username='agustinvargas',
                                      client_id=SPOTIFY_CLIENT_ID,
                                      client_secret=SPOTIFY_CLIENT_SECRET,
                                      redirect_uri=uri)

# Token Access Dict
accs_token = spot_cc.get_access_token(as_dict=True)

# Token Expiration boolian
token_exp = spot_cc.is_token_expired(accs_token)

# Refreshing token
refresh_accs_token = spot_cc.refresh_access_token(accs_token['refresh_token'])





  # Remove the CWD from sys.path while we load stuff.


In [20]:
print('''Do not run this cell unless you are looking to update data of +100k song audiofeatures''')


# # Obtaining Audio Features for +100k songs

# # Client Authentication
# spotify_session = spotipy.client.Spotify(auth=accs_token['access_token'])

# # Access Token
# accs_token = spot_cc.get_access_token(as_dict=True)

# # Token Expiration Boolian
# token_exp = spot_cc.is_token_expired(accs_token)

# # The following loops over the TrackIDs (track_ids) list to obtain
# # audio features for each track

# lst_v_lst= []  # list to be populated with lists of audio features of a track
# for j in track_ids:
#     token_exp = spot_cc.is_token_expired(accs_token)           # Checks for token expiration
#     if token_exp == False:                                     # if token not expired, continue loop
#         audio_feat = spotify_session.audio_features(tracks=j)  # obtain aud_feats for song from SpotifyAPI
#         if audio_feat[0] is None:                              # if aud_feat don't exit for song skip
#             continue
#         else:
#             v_lst = []
#             for _, v in audio_feat[0].items():  # for loop through the 0th item to append values
#                 v_lst.append(v)                 # aud_feats for song appended to a list
#             lst_v_lst.append(v_lst)             # list of aud_feats appends to list of lists
#     else:
#         accs_token = spot_cc.refresh_access_token(accs_token['refresh_token'])    # If token expired, refresh token
#         token_exp = spot_cc.is_token_expired(accs_token)                          # Not really necessary
#         spotify_session = spotipy.client.Spotify(auth=accs_token['access_token']) # Set session to refreshed token

Do not run this cell unless you are looking to update data of +100k song audiofeatures


In [21]:
print('''This cell corresponds to previous cell as well, do not run uncommented''')

# # Collecting song IDs from generated aud_feat lists
# # so that we can gather accurate song title/artists
# # data frame to match

# # list to be populated with song IDs
# track_ids_aud_feat = []

# for k, v in enumerate(lst_v_lst):
#     track_ids_aud_feat.append(lst_v_lst[k][12])  # 12th index position contains ID strings
# len(track_ids_aud_feat). # 104k IDs gathered

This cell corresponds to previous cell as well, do not run uncommented


In [22]:


print('''This cell corresponds to previus cells, do not run uncommented''')


# # Generating lists containing Artist names and Track names

# # Refreshing token
# accs_token = spot_cc.refresh_access_token(accs_token['refresh_token'])
# spotify_session = spotipy.client.Spotify(auth=accs_token['access_token'])

# # Getting artist name and track name (test)
# track = spotify_session.track(track_ids_aud_feat[0])
# print(track['artists'][0]['name'])    # Generates artist name
# print(track['name'])                  # Generates track name      

# # Lists to be populated
# track_names_lst = []
# artist_names_lst = []

# # Iterating over list of track IDs (track_ids_aud_feat)
# for j in track_ids_aud_feat:
#     token_exp = spot_cc.is_token_expired(accs_token)
#     if token_exp == False:
#         track = spotify_session.track(j)
#         track_names_lst.append(track['name'])
#         artist_names_lst.append(track['artists'][0]['name'])
#     else:
#         accs_token = spot_cc.refresh_access_token(accs_token['refresh_token'])
#         token_exp = spot_cc.is_token_expired(accs_token)
#         spotify_session = spotipy.client.Spotify(auth=accs_token['access_token'])

This cell corresponds to previus cells, do not run uncommented


In [53]:
print("""Here we can simply import the CSV file created from all of the gathered song's aud_feats, titles, and artist names""")



# # DataFrame containing artists names and track names
# full_track_artist_names = pd.DataFrame(list(zip(track_names_lst, artist_names_lst)), columns=['track name', 'artist'])
# full_track_artist_names.head(3)

Here we can simply import the CSV file created from all of the gathered song's aud_feats, titles, and artist names


In [23]:
print('''This was the resulting dataframe from the gathered song data''')

# DataFrame from the obtained audio features
# full_df = pd.DataFrame(lst_v_lst, columns=k_lst)

This was the resulting dataframe from the gathered song data


## Importing newly generated song data from corresponding CSV files
- full_track_artist_names: composed of track names corresponding to Audio Features generated amanually (see previous commented out cells)
- full_df: contains all Audio Features for over 100k songs
- songs_100_df: contains dataset found online that contains same parameters

In [24]:
# Importing CSV files of +200k songs

full_df = pd.read_csv('/Users/flanuer/Downloads/Lambda/Course_material/misc_datasets/100k_song_aud_feat.csv', index_col='Unnamed: 0')
drop_col = ['uri']
full_df = full_df.drop(drop_col, axis=1)
full_df.head()

drop_cols = ['track name', 'artist', 'uri']
songs_100_df = pd.read_csv('/Users/flanuer/Downloads/Lambda/Course_material/misc_datasets/songs100k.csv', index_col='Unnamed: 0')
songs_100_df = songs_100_df.drop(drop_cols, axis=1)
songs_100_df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,duration_ms,time_signature
0,0.743,0.339,1,-7.678,1,0.409,0.00582,0.0,0.0812,0.118,203.927,2RM4jf1Xa9zPgMGRDiht8O,238373,4
1,0.846,0.557,8,-7.259,1,0.457,0.0244,0.0,0.286,0.371,159.009,1tHDG53xJNGsItRA3vfVgs,214800,4
2,0.603,0.723,9,-5.89,0,0.0454,0.025,0.0,0.0824,0.382,114.966,6Wosx2euFPMT14UXiWudMy,138913,4
3,0.8,0.579,5,-12.118,0,0.0701,0.0294,0.912,0.0994,0.641,123.003,3J2Jpw61sO7l6Hc7qdYV91,125381,4
4,0.783,0.792,7,-10.277,1,0.0661,3.5e-05,0.878,0.0332,0.928,120.047,2jbYvQCyPgX3CdmAzeVeuS,124016,4


In [25]:
# Joining full_df and songs_100_df

df = pd.concat([full_df, songs_100_df], ignore_index=True)
df.shape

(234807, 14)

In [26]:
print('''No need to do anything here.''')
# Exporting dataframe of +100k songs to csv file
# full_df.to_csv(r'/Users/flanuer/Downloads/Lambda/Course_material/misc_datasets/100k_song_aud_feat.csv')
# full_track_artist_names.to_csv(r'/Users/flanuer/Downloads/Lambda/Course_material/misc_datasets/100k_song_names.csv')

No need to do anything here.


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234807 entries, 0 to 234806
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   danceability      234807 non-null  float64
 1   energy            234807 non-null  float64
 2   key               234807 non-null  int64  
 3   loudness          234807 non-null  float64
 4   mode              234807 non-null  int64  
 5   speechiness       234807 non-null  float64
 6   acousticness      234807 non-null  float64
 7   instrumentalness  234807 non-null  float64
 8   liveness          234807 non-null  float64
 9   valence           234807 non-null  float64
 10  tempo             234807 non-null  float64
 11  id                234807 non-null  object 
 12  duration_ms       234807 non-null  int64  
 13  time_signature    234807 non-null  int64  
dtypes: float64(9), int64(4), object(1)
memory usage: 25.1+ MB


In [28]:
# df.dropna(inplace=True)
df.isna().value_counts()

danceability  energy  key    loudness  mode   speechiness  acousticness  instrumentalness  liveness  valence  tempo  id     duration_ms  time_signature
False         False   False  False     False  False        False         False             False     False    False  False  False        False             234807
dtype: int64

In [29]:
top_tracks_df.head()
top_tracks_df.isna().value_counts()

danceability  energy  key    loudness  mode   speechiness  acousticness  instrumentalness  liveness  valence  tempo  id     duration_ms  time_signature
False         False   False  False     False  False        False         False             False     False    False  False  False        False             50
dtype: int64

# Baseline Explorations (ML)
- Select type of problem type (class/reg)
- Determine model baselines
- Model evaluations/comparisons

#### Note: Modeel takes in Audio Features of one user, and attempts to predict the songs of the other user; the resulting song IDs are used to generate a playlist

In [130]:
from sklearn import metrics
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, LabelEncoder

In [154]:
# Setting up train, test, and val sets

target = 'id'
features = ['danceability',
            'energy',
            'key',
            'loudness',
            'mode',
            'speechiness',
            'acousticness',
            'instrumentalness',
            'liveness',
            'valence',
            'tempo',
            'duration_ms',
            'time_signature']

# Splitting DF into target and features 
y = df[target]
X = df[features]

# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=.5, random_state=42)

# Splitting the DFs
y_top = top_tracks_df[target]
X_top = top_tracks_df[features]

# User library
X_train_top, X_test_top, y_train_top, y_test_top = train_test_split(X_top, y_top, test_size=0.8)

In [155]:
# Check
X_train.shape, X_test.shape, X_val.shape, y_train.shape, y_test.shape, y_val.shape

((140884, 13), (46961, 13), (46962, 13), (140884,), (46961,), (46962,))

In [156]:

df['id'].nunique(), df.shape

(210124, (234807, 14))

In [158]:
nbrs = NearestNeighbors(n_neighbors=10).fit(X_train)

In [162]:
distances, indeces = nbrs.kneighbors(X_test_top)


In [163]:
distances[:3], indeces[:3]

(array([[5.00262096e+00, 1.40925573e+01, 1.83492842e+01, 2.24624883e+01,
         2.29419717e+01, 2.29419717e+01, 2.29419717e+01, 2.36305319e+01,
         2.49519690e+01, 2.59682061e+01],
        [1.24900090e-16, 5.68828932e+00, 7.62011736e+00, 1.01533027e+01,
         1.02972566e+01, 1.08865196e+01, 1.23518261e+01, 1.23518261e+01,
         1.23518261e+01, 1.23518261e+01],
        [1.01908687e+01, 1.28629808e+01, 1.40826989e+01, 1.51806596e+01,
         1.71709211e+01, 1.79428443e+01, 1.79428443e+01, 1.84838966e+01,
         1.85080220e+01, 1.85648789e+01]]),
 array([[ 99455,  30220,  64951, 100667,  82958,  43132, 105669,  62737,
          42681,  78163],
        [108804,  67777, 131652,  27409, 130950,    539,  71222,  43183,
          64829, 112580],
        [121123, 132748,   4303,  68766,  97712,  90908,  57475,  97228,
         126080,   8009]]))

In [164]:

indeces[0]

array([ 99455,  30220,  64951, 100667,  82958,  43132, 105669,  62737,
        42681,  78163])

In [251]:
tracks = []
for i in indeces:
#     print(i[0])
    tracks.append(df['id'][i[0]])
#     tracks.append(df['id'][i[1]])

In [252]:
tracks

['30aVFx8u8eYjy0P6flmPYp',
 '3i7K9L8g42Wu6LgtsdSJjF',
 '0L18rT0je17KgE1FRLf1AM',
 '19zmrH1uz5g3AAY5dJPrT0',
 '72kB1jI3G6H3zv33Qwmmhe',
 '18sytW2s53Of6NVudQyUlH',
 '0C1KlvRBPF9xsJukwEy9PU',
 '5qDHcacw8UaDk65tBLRRi9',
 '60mz2UG8P6BTE0sSl1MNMJ',
 '1IhbIqPsEyxPNNf6HPvMWA',
 '1MWRv3RY4BYG1BkNpMzDeu',
 '2vjRbqK0ulQauqXaUtiiMd',
 '4kRGpTEcDdZTAbc645OL2U',
 '31u9mT0DLuFjg5ACrB2NaH',
 '2MlyLlDv5w7IOoU83fJmnj',
 '5ZHjjyuxyAy93zs0XvVUcv',
 '0FdTBwYr1aNF1smfqaoCde',
 '4lodIS5T59kz41C8EYARpA',
 '43uUgljnKkjHerhN1lQy3o',
 '61zZe6nTfoyRsuIwMg3DR8',
 '6tp18AVA4FEjFTYaSD8exB',
 '12rCCzDNg0Wd7XLGgtvr93',
 '1go3ahvAMoI3Dq76CKhToi',
 '2nMeu6UenVvwUktBCpLMK9',
 '7HuTLuEGUH0dD0k7fW3QFE',
 '5IkfrGnbBKIHcqMDi8vLX9',
 '26fZwf1ImE4aUJ4XaqOkUg',
 '4TgHt7vKCimpywaiKfl0uj',
 '2ygvZOXrIeVL4xZmAWJT2C',
 '4jsdqalaKwDTdPGLvps128',
 '7sDO194LVW5x4vic4ZAjgy',
 '4qZkFcpbvgqs7LgGHRgmXa',
 '4JIgz47B5CToUHw6jtlWJp',
 '3TwSbKoAUJmDK2jU9pGurN',
 '58cKWhCT3I4yhxtNpBlMli',
 '15MTd64KUMG7CF6mOyovsQ',
 '5Cn7gsZSmvcC1b1t1m76ak',
 

In [255]:
# OAuth Creds
spot_cc = spotipy.oauth2.SpotifyOAuth(username='agustinvargas',
                                      client_id=SPOTIFY_CLIENT_ID,
                                      client_secret=SPOTIFY_CLIENT_SECRET,
                                      scope=scope,
                                      redirect_uri=uri)

# # Token Access Dict
accs_token = spot_cc.get_access_token(as_dict=True)


# # Refreshing token
refresh_accs_token = spot_cc.refresh_access_token(accs_token['refresh_token'])

# SpotiSesh
spotify_session = spotipy.Spotify(auth=accs_token)

spotify_session.tracks(tracks)

# pseudo code

# end_offset = 0
# start_offset = 0
# while end_offset <= len(track_ids):
#     start_offset = start_offset + end_offset
#     end_offset = start_offset + 50
#     get the audio_feats for the first 50

# len(tracks)
# j = tracks
# for _, v in audio_feat_2[0].items():
#     spot_trakcs = spotify_session.tracks(tracks=j)
#     print(v)
# print(audio_feat_2[0])





  if __name__ == '__main__':


SpotifyException: http status: 400, code:-1 - https://api.spotify.com/v1/tracks/?ids=30aVFx8u8eYjy0P6flmPYp,3i7K9L8g42Wu6LgtsdSJjF,0L18rT0je17KgE1FRLf1AM,19zmrH1uz5g3AAY5dJPrT0,72kB1jI3G6H3zv33Qwmmhe,18sytW2s53Of6NVudQyUlH,0C1KlvRBPF9xsJukwEy9PU,5qDHcacw8UaDk65tBLRRi9,60mz2UG8P6BTE0sSl1MNMJ,1IhbIqPsEyxPNNf6HPvMWA,1MWRv3RY4BYG1BkNpMzDeu,2vjRbqK0ulQauqXaUtiiMd,4kRGpTEcDdZTAbc645OL2U,31u9mT0DLuFjg5ACrB2NaH,2MlyLlDv5w7IOoU83fJmnj,5ZHjjyuxyAy93zs0XvVUcv,0FdTBwYr1aNF1smfqaoCde,4lodIS5T59kz41C8EYARpA,43uUgljnKkjHerhN1lQy3o,61zZe6nTfoyRsuIwMg3DR8,6tp18AVA4FEjFTYaSD8exB,12rCCzDNg0Wd7XLGgtvr93,1go3ahvAMoI3Dq76CKhToi,2nMeu6UenVvwUktBCpLMK9,7HuTLuEGUH0dD0k7fW3QFE,5IkfrGnbBKIHcqMDi8vLX9,26fZwf1ImE4aUJ4XaqOkUg,4TgHt7vKCimpywaiKfl0uj,2ygvZOXrIeVL4xZmAWJT2C,4jsdqalaKwDTdPGLvps128,7sDO194LVW5x4vic4ZAjgy,4qZkFcpbvgqs7LgGHRgmXa,4JIgz47B5CToUHw6jtlWJp,3TwSbKoAUJmDK2jU9pGurN,58cKWhCT3I4yhxtNpBlMli,15MTd64KUMG7CF6mOyovsQ,5Cn7gsZSmvcC1b1t1m76ak,5XEAV5D4h9CS29YFKOkxYK,4okemFndmfbjMZ4wBJCN0o,0gKVLYWZZkmJ7ZdheAIv1f:
 Only valid bearer authentication supported

In [254]:
spot_tracks = []

for i in tracks:
    spot_tracks.append(spotify_session.tracks([i]))
    print([i])

SpotifyException: http status: 400, code:-1 - https://api.spotify.com/v1/tracks/?ids=30aVFx8u8eYjy0P6flmPYp:
 Only valid bearer authentication supported

In [213]:
for i in spot_tracks['tracks']:
    print(i['name'])
    print(i['external_urls'])

F For You
{'spotify': 'https://open.spotify.com/track/30aVFx8u8eYjy0P6flmPYp'}
Honey Dew
{'spotify': 'https://open.spotify.com/track/3i7K9L8g42Wu6LgtsdSJjF'}
A Million Dreams (The Greatest Showman)
{'spotify': 'https://open.spotify.com/track/0L18rT0je17KgE1FRLf1AM'}
Last
{'spotify': 'https://open.spotify.com/track/19zmrH1uz5g3AAY5dJPrT0'}
Closing Doors - Edit
{'spotify': 'https://open.spotify.com/track/72kB1jI3G6H3zv33Qwmmhe'}
Breakin' My Heart (Pretty Brown Eyes)
{'spotify': 'https://open.spotify.com/track/18sytW2s53Of6NVudQyUlH'}
No Mercy (feat. The-Dream)
{'spotify': 'https://open.spotify.com/track/0C1KlvRBPF9xsJukwEy9PU'}
Adios Nonino (arr. B. Zimmerman)
{'spotify': 'https://open.spotify.com/track/5qDHcacw8UaDk65tBLRRi9'}
Home to Me
{'spotify': 'https://open.spotify.com/track/60mz2UG8P6BTE0sSl1MNMJ'}
Valkyrie
{'spotify': 'https://open.spotify.com/track/1IhbIqPsEyxPNNf6HPvMWA'}
Tawa
{'spotify': 'https://open.spotify.com/track/1MWRv3RY4BYG1BkNpMzDeu'}
The Season's Upon Us - Live From