Instructions
To move forward with the project, you need to create a collection of songs with their audio features - as large as possible!

These are the songs that we will cluster. And, later, when the user inputs a song, we will find the cluster to which the song belongs and recommend a song from the same cluster. The more songs you have, the more accurate and diverse recommendations you'll be able to give. Although... you might want to make sure the collected songs are "curated" in a certain way. Try to find playlists of songs that are diverse, but also that meet certain standards.

The process of sending hundreds or thousands of requests can take some time - it's normal if you have to wait a few minutes (or, if you're ambitious, even hours) to get all the data you need.

An idea for collecting as many songs as possible is to start with all the songs of a big, diverse playlist and then go to every artist present in the playlist and grab every song of every album of that artist. The amount of songs you'll be collecting per playlist will grow exponentially!

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import time
import pandas as pd


In [2]:
input_file = open("/Users/renev/OneDrive/Desktop/input.txt","r")

In [3]:
input_file

<_io.TextIOWrapper name='/Users/renev/OneDrive/Desktop/input.txt' mode='r' encoding='cp1252'>

In [4]:
string = input_file.read()

In [7]:
# string.split('\n')

In [8]:
secrets_dict={}
for line in string.split('\n'):
    if len(line) > 0:
        secrets_dict[line.split(':')[0]]=line.split(':')[1]

In [9]:
auth_manager = SpotifyClientCredentials(client_id = secrets_dict['client_id'], 
                                        client_secret = secrets_dict['client_secret'])

In [12]:
# secrets_dict['client_id']

In [13]:
# secrets_dict['client_secret']

In [14]:
sp = spotipy.Spotify(auth_manager=auth_manager)

Playlists

In [20]:
playlist = sp.user_playlist_tracks("spotify", "37i9dQZF1DWTmvXBN4DgpA")

# Top2000: https://open.spotify.com/playlist/37i9dQZF1DWTmvXBN4DgpA?si=9c054f5e1f984460&nd=1

In [22]:
# playlist

In [23]:
len(playlist["items"])

100

In [24]:
playlist['next']

'https://api.spotify.com/v1/playlists/37i9dQZF1DWTmvXBN4DgpA/tracks?offset=100&limit=100&additional_types=track'

In [26]:
# sp.next(playlist)

Get track_ids

In [168]:
def get_playlist_track_id(user,playlist_id):
#     track_ids = []
    results = sp.user_playlist(user, playlist_id)
    tracks = results['tracks']
    while tracks['next']!=None:
            tracks = sp.next(tracks)
            tracks = tracks + results['tracks']
    return track_ids

In [27]:
def get_playlist_ids(user,playlist_id):
    results = sp.user_playlist_tracks(user,playlist_id)
    tracks = results['items']
    ids = []
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    for s in tracks: ids.append(s["track"]["id"])
    return ids


In [28]:
track_ids = get_playlist_ids("spotify","37i9dQZF1DWTmvXBN4DgpA")

In [29]:
track_ids

['7rO7Pc5dkC2EIW1OKsCJtQ',
 '4u7EnebtmKWzUH433cf5Qv',
 '40riOy7x9W7GXjyGp4pjAv',
 '3FCto7hnn1shUyZL42YgfO',
 '5CQ30WqJwcep0pYcV4AMNc',
 '5Xak5fmy089t0FYmh3VJiY',
 '3n3F07lHLyRKwqg4q64eYA',
 '7LVHVU3tWfcxj5aiPFEW4Q',
 '1HzDhHApjdjXPLHF6GGYhu',
 '7Jh1bpe76CNTCgdgAdBw4Z',
 '4YJUTdZ0Pgl0ZeNyHYXeLd',
 '5U9iJk3ONTlZCSJrokOP1i',
 '54X78diSLoUDI3joC2bjMz',
 '2374M0fQpWi3dLnB54qaLX',
 '21cp8L9Pei4AgysZVihjSv',
 '3YRCqOhFifThpSRFJ1VWFM',
 '0nLiqZ6A27jJri2VCalIUs',
 '0CzeAbfKFnxnWjwo5iYiCG',
 '7Fg4jpwpkdkGCvq1rrXnvx',
 '1Cj2vqUwlJVG27gJrun92y',
 '7HrzErXq3TsKOY1gmdIShB',
 '4AUASx1KCTQFmpHu7qq6Kr',
 '3YfS47QufnLDFA71FUsgCM',
 '57bgtoPSgt236HzfBOd8kj',
 '1f3yAtsJtY87CTmM8RLnxf',
 '7pKfPomDEeI4TPT6EOYjn9',
 '37Tmv4NnfQeb0ZgUC4fOJj',
 '32dnKMni3I3gwUbWp4mi45',
 '3h7NZT1CATbn4GqUs51vwf',
 '5IX4TbIR5mMHGE4wiWwKW0',
 '6NP6BCW2M2I4vdcnXMAvjl',
 '6i81qFkru6Kj1IEsB7KNp2',
 '5tU9JM1v72X7oM808Am6Fq',
 '2g7gviEeJr6pyxO7G35EWQ',
 '7o2CTH4ctstm8TNelqjb51',
 '18AXbzPzBS8Y3AkgSxzJPb',
 '2VxeLyX666F8uXCJ0dZF8B',
 

In [30]:
len(track_ids)

1997

In [198]:
# work through this >>>>>

def getTrackFeatures(id):
    track_info = sp.track(id)
    track_features = sp.audio_features(id)

# track_info = sp.track('1gou2U6ZcLKXLBC3MYVtEu')
# track_features = sp.audio_features('1gou2U6ZcLKXLBC3MYVtEu')
    
#     Track info
    name = track_info['name']
    album= track_info['album']['name']
    artist= track_info['album']['artists'][0]['name']
    release_date= track_info['album']['release_date']
    length= track_info['duration_ms']
    popularity= track_info['popularity']
    
#     Track features

    danceability = track_features[0]['danceability']
    energy=track_features[0]['energy']
    key=track_features[0]['key']
    loudness= track_features[0]['loudness']
    mode=track_features[0]['mode']
    speechiness=track_features[0]['speechiness']
    acousticness= track_features[0]['acousticness']
    instrumentalness=track_features[0]['instrumentalness']
    liveness=track_features[0]['liveness']
    valence= track_features[0]['valence']
    tempo=track_features[0]['tempo']
    id= track_features[0]['id']
    duration_ms= track_features[0]['duration_ms']
    time_signature= track_features[0]['time_signature']

    track_data = [id, name, album, artist, duration_ms, release_date, length, popularity, acousticness,danceability,energy, instrumentalness,
                  key,liveness,loudness,mode, speechiness, tempo, time_signature,valence]
    return track_data

track_list = []
for i in range(len(track_ids)):
    time.sleep(.3)
    track_data = getTrackFeatures(track_ids[i])
    track_list.append(track_data)


In [199]:
features = pd.DataFrame(track_list, columns = ['id','name', 'album', 'artist', 'duration_ms','release_date', 'length', 'popularity',  'acousticness',
                                                'danceability','energy', 'instrumentalness','key',
                  'liveness','loudness','mode','speechiness', 'tempo','time_signature','valence'])

In [200]:
features

Unnamed: 0,id,name,album,artist,duration_ms,release_date,length,popularity,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,7rO7Pc5dkC2EIW1OKsCJtQ,Roller Coaster,Pressure Makes Diamonds,Danny Vera,269986,2019-02-15,269985,1,0.51000,0.400,0.383,0.007630,9,0.1210,-10.048,1,0.0279,96.944,4,0.285
1,4u7EnebtmKWzUH433cf5Qv,Bohemian Rhapsody - Remastered 2011,A Night At The Opera (2011 Remaster),Queen,354320,1975-11-21,354320,80,0.27100,0.414,0.404,0.000000,0,0.3000,-9.928,0,0.0499,71.105,4,0.224
2,40riOy7x9W7GXjyGp4pjAv,Hotel California - 2013 Remaster,Hotel California (2013 Remaster),Eagles,391376,1976-12-08,391376,82,0.00574,0.579,0.508,0.000494,2,0.0575,-9.484,1,0.0270,147.125,4,0.609
3,3FCto7hnn1shUyZL42YgfO,Piano Man,The Essential Billy Joel,Billy Joel,336093,2001-10-02,336093,65,0.60000,0.334,0.472,0.000004,0,0.3170,-8.792,1,0.0277,179.167,3,0.431
4,5CQ30WqJwcep0pYcV4AMNc,Stairway to Heaven - Remaster,Led Zeppelin IV (Deluxe Edition),Led Zeppelin,482830,1971-11-08,482830,78,0.58000,0.338,0.340,0.003200,9,0.1160,-12.049,0,0.0339,82.433,4,0.197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1992,2PmaEYL0IBSdumcG4KUYhI,Zondag,Rob 100,Rob De Nijs,182507,2007-01-01,182506,35,0.15100,0.575,0.832,0.000000,6,0.2450,-7.168,1,0.0331,126.472,4,0.869
1993,0NBgmP6Yg5xZYsnIZ0pGdo,Atmosphere,The Best Of,Joy Division,250413,2008-03-20,250413,50,0.16500,0.565,0.467,0.284000,6,0.0808,-10.078,1,0.0303,120.170,4,0.426
1994,5ry3S7tuPmoBoEikcwQEEi,Laat Mij Maar Alleen,Het Beste Van Klein Orkest,Klein Orkest,201333,1987-01-01,201333,38,0.01960,0.448,0.740,0.000003,7,0.2310,-10.050,1,0.0633,152.996,4,0.940
1995,6mL95upmS5E97bZXNtaSUc,Sweet Jane,Berlin: Live at St. Ann's Warehouse,Lou Reed,331187,2008-11-04,331186,0,0.47900,0.827,0.458,0.116000,2,0.6980,-10.177,1,0.0639,114.633,4,0.879


In [201]:
features.to_csv('features_new.csv')

In [36]:
features_new = pd.read_csv('features_new.csv')




In [39]:
features_new

Unnamed: 0.1,Unnamed: 0,id,name,album,artist,duration_ms,release_date,length,popularity,acousticness,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0,7rO7Pc5dkC2EIW1OKsCJtQ,Roller Coaster,Pressure Makes Diamonds,Danny Vera,269986,2019-02-15,269985,1,0.51000,...,0.383,0.007630,9,0.1210,-10.048,1,0.0279,96.944,4,0.285
1,1,4u7EnebtmKWzUH433cf5Qv,Bohemian Rhapsody - Remastered 2011,A Night At The Opera (2011 Remaster),Queen,354320,1975-11-21,354320,80,0.27100,...,0.404,0.000000,0,0.3000,-9.928,0,0.0499,71.105,4,0.224
2,2,40riOy7x9W7GXjyGp4pjAv,Hotel California - 2013 Remaster,Hotel California (2013 Remaster),Eagles,391376,1976-12-08,391376,82,0.00574,...,0.508,0.000494,2,0.0575,-9.484,1,0.0270,147.125,4,0.609
3,3,3FCto7hnn1shUyZL42YgfO,Piano Man,The Essential Billy Joel,Billy Joel,336093,2001-10-02,336093,65,0.60000,...,0.472,0.000004,0,0.3170,-8.792,1,0.0277,179.167,3,0.431
4,4,5CQ30WqJwcep0pYcV4AMNc,Stairway to Heaven - Remaster,Led Zeppelin IV (Deluxe Edition),Led Zeppelin,482830,1971-11-08,482830,78,0.58000,...,0.340,0.003200,9,0.1160,-12.049,0,0.0339,82.433,4,0.197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1992,1992,2PmaEYL0IBSdumcG4KUYhI,Zondag,Rob 100,Rob De Nijs,182507,2007-01-01,182506,35,0.15100,...,0.832,0.000000,6,0.2450,-7.168,1,0.0331,126.472,4,0.869
1993,1993,0NBgmP6Yg5xZYsnIZ0pGdo,Atmosphere,The Best Of,Joy Division,250413,2008-03-20,250413,50,0.16500,...,0.467,0.284000,6,0.0808,-10.078,1,0.0303,120.170,4,0.426
1994,1994,5ry3S7tuPmoBoEikcwQEEi,Laat Mij Maar Alleen,Het Beste Van Klein Orkest,Klein Orkest,201333,1987-01-01,201333,38,0.01960,...,0.740,0.000003,7,0.2310,-10.050,1,0.0633,152.996,4,0.940
1995,1995,6mL95upmS5E97bZXNtaSUc,Sweet Jane,Berlin: Live at St. Ann's Warehouse,Lou Reed,331187,2008-11-04,331186,0,0.47900,...,0.458,0.116000,2,0.6980,-10.177,1,0.0639,114.633,4,0.879


In [31]:
features_from_project = pd.read_csv('combined-csv-files-Copy1.csv')
features_from_project.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0.1,Unnamed: 0,id,name,album,artist,duration_ms,release_date,length,popularity,acousticness,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0.0,1Pir8D7cNrusAlODsBuhih,Твой бред,Твой бред,FOILAR,130986,2021-06-25,130985,33,0.0201,...,0.936,0.0,0,0.0885,-2.336,1,0.104,142.01,4,0.809
1,1.0,1rHcgLKBdI6iTmAiISOzU7,A 100,A 100,Lenny Tavárez,218913,2021-02-25,218912,55,0.53,...,0.556,2.21e-05,6,0.147,-4.89,1,0.41,92.008,4,0.47
2,2.0,70IVQO4Pfr0oMszh7KLfjK,すきっ!,ときおとめ,Tokimeki Sendenbu,316333,2018-03-28,316333,41,0.0188,...,0.952,5.09e-06,10,0.387,-2.318,1,0.0646,163.008,4,0.445
3,3.0,45nCkTGBWQ3kjt9uxZybUP,Keď som išiel...,Bolo nás jedenást,Milan Lasica,153467,1981-01-01,153466,21,0.31,...,0.959,0.0,4,0.351,-2.663,1,0.137,136.822,4,0.69
4,4.0,15mFoRfeFEHFksh7UGjrvT,分你一半,分你一半,不是花火呀,195200,2021-04-14,195200,46,0.869,...,0.307,0.0,3,0.371,-10.9,1,0.0714,75.067,4,0.473


In [33]:
features_from_project.shape

# After dropping duplicates during project, there was 206429 rows × 21 columns

(207395, 21)

In [37]:
features_all = pd.concat([features_from_project, features_new], axis=0)

In [40]:
features_all.head()

Unnamed: 0.1,Unnamed: 0,id,name,album,artist,duration_ms,release_date,length,popularity,acousticness,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,0.0,1Pir8D7cNrusAlODsBuhih,Твой бред,Твой бред,FOILAR,130986,2021-06-25,130985,33,0.0201,...,0.936,0.0,0,0.0885,-2.336,1,0.104,142.01,4,0.809
1,1.0,1rHcgLKBdI6iTmAiISOzU7,A 100,A 100,Lenny Tavárez,218913,2021-02-25,218912,55,0.53,...,0.556,2.21e-05,6,0.147,-4.89,1,0.41,92.008,4,0.47
2,2.0,70IVQO4Pfr0oMszh7KLfjK,すきっ!,ときおとめ,Tokimeki Sendenbu,316333,2018-03-28,316333,41,0.0188,...,0.952,5.09e-06,10,0.387,-2.318,1,0.0646,163.008,4,0.445
3,3.0,45nCkTGBWQ3kjt9uxZybUP,Keď som išiel...,Bolo nás jedenást,Milan Lasica,153467,1981-01-01,153466,21,0.31,...,0.959,0.0,4,0.351,-2.663,1,0.137,136.822,4,0.69
4,4.0,15mFoRfeFEHFksh7UGjrvT,分你一半,分你一半,不是花火呀,195200,2021-04-14,195200,46,0.869,...,0.307,0.0,3,0.371,-10.9,1,0.0714,75.067,4,0.473


In [42]:
features_all.shape

(209392, 21)

In [43]:
print(207395+1997)

209392


In [44]:
# Drop duplicate track_ids

features_all.sort_values("id", inplace = True)
features_all.drop_duplicates(subset ="id",
                     keep = 'first', inplace = True)

In [45]:
features_all.shape

(207568, 21)

In [46]:
print(206649+1997)

208646


In [47]:
print(208646-207568)

# 1078 tracks added to the list

1078


In [48]:
features_all.to_csv('features_all.csv')