# Generate Dataset

Data set will consist of my saved tracks from my Spotify account. This documents its retrieval using `spotipy` library that interfaces Spotify API.

Running the cell below will prompt your default web-browser to authenticate on your spotify account to get an access token.

In [1]:
import sys
import time

import spotipy as sp
import spotipy.util as util
import pandas as pd

credentials = {
    'username': 'rzera',
    'scope': 'user-library-read',
    'client_id': '52899fd885774f98ba9481e894e0d22e',
    'client_secret': '1280a1394bfe430ea08c11ef1caecaa8',
    'redirect_uri': 'http://localhost'
}

token = util.prompt_for_user_token(**credentials)
if token:
    cli = sp.Spotify(auth=token)
    print('Token successfully acquired!')
else:
    print('Failed to retrieve token')

Token successfully acquired!


Let's query my saved tracks. Keep in mind that we can only fetch 50 tracks at most per request, so we'll iterate until we finish

In [2]:
results = cli.current_user_saved_tracks(limit=50, offset=0)
len(results['items'])

50

What are the available attributes of a given track?

In [8]:
results['items'][0]['track'].keys()

dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track_number', 'type', 'uri'])

Spotify provides an API to extract meaning from a specific track. This is our task now, use its analyzer to enrich our dataset-to-be.

In [11]:
cli.audio_features(['1wUi98zKIqQqeJaSsORKqm'])

[{'danceability': 0.803,
  'energy': 0.685,
  'key': 7,
  'loudness': -13.089,
  'mode': 1,
  'speechiness': 0.0507,
  'acousticness': 1.18e-05,
  'instrumentalness': 0.89,
  'liveness': 0.216,
  'valence': 0.0648,
  'tempo': 123.021,
  'type': 'audio_features',
  'id': '1wUi98zKIqQqeJaSsORKqm',
  'uri': 'spotify:track:1wUi98zKIqQqeJaSsORKqm',
  'track_href': 'https://api.spotify.com/v1/tracks/1wUi98zKIqQqeJaSsORKqm',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/1wUi98zKIqQqeJaSsORKqm',
  'duration_ms': 433339,
  'time_signature': 4}]

Now we can query Spotify until we have no results back!

In [15]:
def make_track_dict(track_item: dict, track_features: dict = None) -> dict:
    """Returns a row describing a track"""
    return {
        'id': track_item['track']['id'],
        'name': track_item['track']['name'],
        'added_at': track_item['added_at'],
        'artist_id': track_item['track']['artists'][0]['id'],
        'artist_name': track_item['track']['artists'][0]['name'],
        'popularity': track_item['track']['popularity'],
        'explicit': track_item['track']['explicit'],
        'duration_ms': track_item['track']['duration_ms']
    }

keep_going = True
step = 50
offset = 0
tracklist = []
start = time.time()

while keep_going:
    results = cli.current_user_saved_tracks(limit=step, offset=offset)
    for track in results['items']:
        tracklist.append(make_track_dict(track))
    
    fetch_len = len(results['items'])
    # stupid progress bar :D
    print(f'{"-" * int(offset/step)}> {offset + fetch_len}')

    # control
    offset += step
    if fetch_len < step:
        keep_going = False
        
print(f'{len(tracklist)} tracks fetched')
print(f'Elapsed time: {int(time.time() - start)}s')

> 50
-> 100
--> 150
---> 200
----> 250
-----> 300
------> 350
-------> 400
--------> 450
---------> 500
----------> 550
-----------> 600
------------> 650
-------------> 700
--------------> 750
---------------> 800
----------------> 850
-----------------> 900
------------------> 924
924 tracks fetched
Elapsed time: 9s


We should enrich this data with its `track_features` in order group everything on the same dataset.

```
TODO: Abstract a function for this while-iteration
```

In [16]:
keep_going = True
lower = 0
step = 50

trackfeat = []
ids = [t['id'] for t in tracklist]

start = time.time()

while keep_going:
    fetch_feat = cli.audio_features(ids[lower:(lower + step)])
    trackfeat.append(fetch_feat)
    
    # stupid progress bar :D
    print(f'{"-" * int(lower/step)}> {lower + len(fetch_feat)}')
    
    lower += step
    if len(fetch_feat) < step:
        keep_going = False
        
# unflatten list
trackfeat = [track for chunk in trackfeat for track in chunk]
        
print(f'{len(trackfeat)} tracks-features fetched')
print(f'Elapsed time: {int(time.time() - start)}s')

> 50
-> 100
--> 150
---> 200
----> 250
-----> 300
------> 350
-------> 400
--------> 450
---------> 500
----------> 550
-----------> 600
------------> 650
-------------> 700
--------------> 750
---------------> 800
----------------> 850
-----------------> 900
------------------> 924
924 tracks-features fetched
Elapsed time: 8s


Now we've two `dict`s of `dicts`, each row of each describing the same track. So it only makes sense if we'd merge them. Merge is a vague word. What we want is:

> For each dict of tracklist and trackfeat, we'd like to take all of tracklist's and trackfeat's key and value pairs and put it into a dummy dict.

It is important to know if each of these dicts have overlapping `keys`:

In [17]:
overlapping = tracklist[0].keys() & trackfeat[0].keys()
overlapping

{'duration_ms', 'id'}

Okay, there is. But are they the same? Meaning: can we *really* join them without concern?

In [18]:
for track, feat in zip(tracklist, trackfeat):
    for overlap in overlapping:
        if track[overlap] != feat[overlap]:
            print(f'Oops.. different found! ({track[overlap]} !={feat[overlap]})')
            
print('Yep, they are all the same :)')

Oops.. different found! (450095 !=450096)
Oops.. different found! (448526 !=448527)
Oops.. different found! (255150 !=255151)
Oops.. different found! (189090 !=189091)
Oops.. different found! (483436 !=483437)
Oops.. different found! (427567 !=427568)
Oops.. different found! (492879 !=492880)
Oops.. different found! (378536 !=378537)
Oops.. different found! (418030 !=418031)
Oops.. different found! (478032 !=478033)
Oops.. different found! (449506 !=449507)
Oops.. different found! (391509 !=391510)
Oops.. different found! (319120 !=319121)
Oops.. different found! (412266 !=412267)
Oops.. different found! (711026 !=711027)
Oops.. different found! (387826 !=387827)
Oops.. different found! (324995 !=324996)
Oops.. different found! (169319 !=169320)
Oops.. different found! (448590 !=448591)
Oops.. different found! (434909 !=434910)
Oops.. different found! (487544 !=487545)
Oops.. different found! (418016 !=418017)
Oops.. different found! (408018 !=408019)
Oops.. different found! (438240 !=

TypeError: 'NoneType' object is not subscriptable

Wow. Seems we found something unusual in our feature dataset. It seems that we have hit a `None` value and, from there matching stopped happening.

The code below can be used to merge `dicts`. But assumes that *both dicts* matches exactly on every single index. That is because the iteration is *sequential* so we can't really do anything without too much effort (at least I can't think of right now).

So we'll ignore next code chunk. But I'm not deleting since it might be useful sometime.

In [20]:
# tracks = []
# for track, feat in zip(tracklist, trackfeat):
#     a_track = {}
#     if track['id'] == feat['id']:
#         for key, value in track.items():
#             a_track[key] = value
#         for key, value in feat.items():
#             a_track[key] = value
#         tracks.append(a_track)
        
# print(f'Tracklist has {len(tracklist[0].keys())} columns, featlist has {len(trackfeat[0].keys())} columns')
# print(f'Merged dict, tracks, has {len(tracks[0].keys())}')
# assert len(tracks[0].keys()) == (len(tracklist[0].keys()) + len(trackfeat[0].keys()) - len(tracklist[0].keys() & trackfeat[0].keys()))

It makes more sense to `DataFrame` 'em and manipulate with an appropriate tool. In a way that we can merge rows by its `id`:

In [50]:
df_track = pd.DataFrame(data=tracklist,
                       index=[t['id'] for t in tracklist])

# we still have to remove trackfeat's None intrusive value
trackfeat_clean = [(f if f is not None else {'id': '0'}) for f in trackfeat]
df_feat = pd.DataFrame(trackfeat_clean,
                       index=[t['id'] for t in trackfeat_clean])

display(df_track.head(2))
display(df_feat.head(2))

Unnamed: 0,added_at,artist_id,artist_name,duration_ms,explicit,id,name,popularity
1wUi98zKIqQqeJaSsORKqm,2019-01-15T16:38:49Z,2XMpXAQ0B1J95en60YGE3V,Chaim,433339,False,1wUi98zKIqQqeJaSsORKqm,Wrong Diagonal,11
41Dn45cxNztFhMURz7L502,2019-01-15T10:32:29Z,5kLzaeSHrmS7okc5XNE6lv,Butch,503639,False,41Dn45cxNztFhMURz7L502,Shahrzad - Matthias Meyer Remix,40


Unnamed: 0,acousticness,analysis_url,danceability,duration_ms,energy,id,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
1wUi98zKIqQqeJaSsORKqm,1.2e-05,https://api.spotify.com/v1/audio-analysis/1wUi...,0.803,433339.0,0.685,1wUi98zKIqQqeJaSsORKqm,0.89,7.0,0.216,-13.089,1.0,0.0507,123.021,4.0,https://api.spotify.com/v1/tracks/1wUi98zKIqQq...,audio_features,spotify:track:1wUi98zKIqQqeJaSsORKqm,0.0648
41Dn45cxNztFhMURz7L502,0.000135,https://api.spotify.com/v1/audio-analysis/41Dn...,0.802,503639.0,0.658,41Dn45cxNztFhMURz7L502,0.858,11.0,0.0849,-8.895,0.0,0.0442,123.178,4.0,https://api.spotify.com/v1/tracks/41Dn45cxNztF...,audio_features,spotify:track:41Dn45cxNztFhMURz7L502,0.235


Removing unecessary columns and bad-data rows:

In [49]:
df = pd.concat([df_track, df_feat], axis=1, sort=True).drop(index='0', columns=['analysis_url', 'track_href', 'uri', 'type'])
df.head(5)

Unnamed: 0,added_at,artist_id,artist_name,duration_ms,explicit,id,name,popularity,acousticness,danceability,...,id.1,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
01Z4XXbApMlFBLLdgibLBp,2018-09-27T19:22:51Z,7cZsqiYF7SFQ0Ni24ly624,Death On The Balcony,495968.0,False,01Z4XXbApMlFBLLdgibLBp,Celestial Stranger,11.0,0.017,0.65,...,01Z4XXbApMlFBLLdgibLBp,0.791,7.0,0.102,-13.397,1.0,0.0407,121.98,4.0,0.575
01cbb52SrkVOB8VuFaPoGP,2016-04-18T20:07:15Z,3GBPw9NK25X1Wt2OUvOwY3,Jack Johnson,253933.0,False,01cbb52SrkVOB8VuFaPoGP,Breakdown,41.0,0.661,0.584,...,01cbb52SrkVOB8VuFaPoGP,0.00024,0.0,0.838,-11.161,1.0,0.0512,139.695,4.0,0.529
02Z6uQN4D8wADNH4Nvm6Rv,2018-03-23T11:58:55Z,3LRadUVuB9B2iHdArz1l1R,Anastacia Azevedo,205492.0,False,02Z6uQN4D8wADNH4Nvm6Rv,Ideologia,0.0,0.819,0.771,...,02Z6uQN4D8wADNH4Nvm6Rv,2e-06,9.0,0.0951,-9.65,1.0,0.0419,112.522,3.0,0.63
03DRnHbnJdxyXjrIFYsCc3,2018-09-18T19:26:49Z,6hjRjVNLWTCPYci9nxhI1G,Efdemin,316615.0,False,03DRnHbnJdxyXjrIFYsCc3,America,14.0,0.24,0.656,...,03DRnHbnJdxyXjrIFYsCc3,0.917,6.0,0.0878,-16.676,0.0,0.0393,129.007,4.0,0.254
03WiEOE4jbqIoDGVDc8KBd,2017-06-14T19:49:57Z,50ZyjIaVHOy5Xt7FLJ7RZl,Sam Paganini,354850.0,False,03WiEOE4jbqIoDGVDc8KBd,Daegon,0.0,0.000826,0.803,...,03WiEOE4jbqIoDGVDc8KBd,0.915,0.0,0.109,-8.625,1.0,0.0476,124.998,4.0,0.479


Finally, let's export it :)

In [51]:
df.to_csv('saved_songs.csv', header=True)