To construct the dataset, I did the following:
1. Obtain the songs I had listened to<br>
    a. Get all songs I had saved (Spotify provides an endpoint to do so)<br>
    b. Aggregated these songs by album to compute the total songs saved for each album<br>
    c. Assumed that albums from which I have saved 2 or more songs are albums that I had fully listened to<br>
    d. Get all the songs for those albums (Spotify provides an endpoint to do so)<br>
2.	Find the label for each song<br>
    a. From the list of in 1d, label the songs in the list from 1a as saved<br>
    d. Label the rest as unsaved<br>
3.	Find the features for each song (Spotify provides an endpoint to do so)

Get Authorization Token

In [1]:
import sys
import spotipy
import spotipy.util as util
import requests
import json
import pandas as pd
import numpy as np

scope = 'user-library-read'
username = 'srijanduggal17'
client_id = ''
client_secret = ''
redirect_uri = 'http://localhost/'

token = util.prompt_for_user_token(username,scope,client_id=client_id,client_secret=client_secret,redirect_uri=redirect_uri)

if token:
    sp = spotipy.Spotify(auth=token)
else:
    print("Can't get token for", username)

Get uris, albums, for all saved tracks.

Data Format:
In my library, for every album of each song, get every song
Label: whether i saved that song or not

List of my songs -> go through all of my spotify tracks and get the song uris
<br>List of albums to look through: -> go through all of my spotify tracks and get the album uris
<br>List of all song uris for dataset: -> go through list of albums and get all track uris
<br>Label song uris -> go through list of all song uris. Find the ones that are in list of my songs and label them with 1. Label the others as 0.
<br>Replace uris with audio features.

# Get all of my Song and Album uris

Get total number of songs in my library

In [2]:
nextOffset = 0
nextLimit = 50

results = sp.current_user_saved_tracks(limit=nextLimit, offset=nextOffset)
print('Total Tracks {}'.format(results['total']))

Total Tracks 2157


Get song and album ids for all songs in my library

In [3]:
df_my_songs = pd.DataFrame(columns=['song_uri', 'album_uri'])

for item in results['items']:
    df_my_songs = df_my_songs.append({
        'song_uri': item['track']['id'],
        'album_uri': item['track']['album']['id']
    }, ignore_index=True)
nextOffset += nextLimit

while (results['next'] != None): 
    results = sp.current_user_saved_tracks(limit=nextLimit, offset=nextOffset)
    for item in results['items']:
        df_my_songs = df_my_songs.append({
            'song_uri': item['track']['id'],
            'album_uri': item['track']['album']['id']
        }, ignore_index=True)
    nextOffset += nextLimit

In [4]:
print('Total Song Ids {}'.format(df_my_songs.shape[0]))    
# print(mySongIds)
print('Total Album Ids {}'.format(len(np.unique(df_my_songs.album_uri))))

Total Song Ids 2157
Total Album Ids 875


# Get Albums I listen to

Inspect my songs.
<br>Albums I "heard" mean they have more than 1 song saved from them

In [12]:
print(results['items'][0]['track']['name'])

All We Need (feat. Shy Girls)


In [13]:
print('My total songs',df_my_songs.shape[0])
df_songs_per_album = df_my_songs.groupby('album_uri').count()

print('My total albums',df_songs_per_album.shape[0])

df_albums_heard = df_songs_per_album[df_songs_per_album.song_uri > 1]
albums_heard = df_albums_heard.index.values
print('My heard albums', df_albums_heard.shape[0])

df_albums_unheard = df_songs_per_album[df_songs_per_album.song_uri == 1]
print('My unheard albums / songs from unheard albums', df_albums_unheard.shape[0])

df_saved_ids = df_my_songs[df_my_songs.album_uri.isin(albums_heard)]
print('My songs from heard albums', df_saved_ids.shape[0])
print('\n', df_saved_ids.head())

savedSongIds = set(df_saved_ids.song_uri)

My total songs 2157
My total albums 875
My heard albums 259
My unheard albums / songs from unheard albums 616
My songs from heard albums 1541

                   song_uri               album_uri
5   649o53ULWYN1y7V2OI5kgo  6blMxezujKgPe8HjHNveuG
10  65ds47DOh963oroiiBChZ9  4zn2Kj85Hew0USyxc4TJEX
12  3D8dwH690MXQRhtIZTSS9c  4zn2Kj85Hew0USyxc4TJEX
13  2r8MLH3Zwro67ElDDqth1r  4zn2Kj85Hew0USyxc4TJEX
32  4CxFN5zON70B3VOPBYbd6P  16mjtcKPxpQ4ajFHmJ0hJC


# Get uris of all Songs from Albums I listen to

In [14]:
albumSongIds = set()
from progressbar import ProgressBar
pbar = ProgressBar()

counter = 0;
for albumId in pbar(albums_heard):
    nextLimit = 50
    nextOffset = 0
    albumInfo = sp.album_tracks(albumId, limit=nextLimit)
    for item in albumInfo['items']:
        albumSongIds.add(item['id'])
    nextOffset += nextLimit
    while (albumInfo['next'] != None): 
        albumInfo = sp.album_tracks(albumId, limit=nextLimit, offset=nextOffset)
        for item in albumInfo['items']:
            albumSongIds.add(item['id'])
        nextOffset += nextLimit
    counter += 1

100% |########################################################################|


In [15]:
print(len(albumSongIds))
print(len(savedSongIds))
unsavedSongIds = albumSongIds.difference(savedSongIds)
print(len(unsavedSongIds))

3797
1541
2256


# Get Audio Features for Saved Songs

In [16]:
savedSongIds = list(savedSongIds)

exampleObj = sp.audio_features(savedSongIds[0])
df_saved_songs = pd.DataFrame(exampleObj)

startNdx = 1
endNdx = 51
nextList = savedSongIds[startNdx:endNdx]

while (endNdx < len(savedSongIds)):
    audio_features = sp.audio_features(nextList)
    df_saved_songs = df_saved_songs.append(audio_features, ignore_index=True)
    startNdx = endNdx
    endNdx += 50
    nextList = savedSongIds[startNdx:endNdx]

nextList = savedSongIds[startNdx:]
audio_features = sp.audio_features(nextList)
df_saved_songs = df_saved_songs.append(audio_features, ignore_index=True)

In [17]:
print(df_saved_songs.shape)
df_saved_songs = df_saved_songs.drop(columns=['type', 'uri', 'track_href', 'analysis_url', 'duration_ms'])
df_saved_songs['label'] = 'saved'
print(df_saved_songs.head())

(1541, 18)
   danceability  energy  key  loudness  mode  speechiness  acousticness  \
0         0.695   0.807    9    -5.123     1       0.0346        0.0121   
1         0.510   0.533    7    -6.194     1       0.3590        0.1940   
2         0.764   0.475    0   -12.618     1       0.1140        0.3450   
3         0.615   0.800    7    -7.423     1       0.0571        0.0684   
4         0.656   0.804    8    -5.191     0       0.3630        0.1730   

   instrumentalness  liveness  valence    tempo                      id  \
0               0.0     0.266    0.400   91.947  4V8uu21mnpyg7BElNNJdPs   
1               0.0     0.117    0.262   88.154  7fheaybAcABeiuYU6VgDrQ   
2               0.0     0.140    0.340  129.974  0l4EfZzVD0LyJhibHIIVxo   
3               0.0     0.106    0.309  134.052  6EAE96fgYkea6qZ3pXigNG   
4               0.0     0.837    0.314  125.882  2WWruw7ul9N7eqoHELyMc2   

   time_signature  label  
0               4  saved  
1               4  saved  
2     

# Get Audio Features for Unsaved Songs

In [18]:
unsavedSongIds = list(unsavedSongIds)

exampleObj = sp.audio_features(unsavedSongIds[0])
df_unsaved_songs = pd.DataFrame(exampleObj)

startNdx = 1
endNdx = 51
nextList = unsavedSongIds[startNdx:endNdx]

while (endNdx < len(unsavedSongIds)):
    audio_features = sp.audio_features(nextList)
    df_unsaved_songs = df_unsaved_songs.append(audio_features, ignore_index=True)
    startNdx = endNdx
    endNdx += 50
    nextList = unsavedSongIds[startNdx:endNdx]

nextList = unsavedSongIds[startNdx:]
audio_features = sp.audio_features(nextList)
df_unsaved_songs = df_unsaved_songs.append(audio_features, ignore_index=True)

In [19]:
print(df_unsaved_songs.shape)
df_unsaved_songs = df_unsaved_songs.drop(columns=['type', 'uri', 'track_href', 'analysis_url', 'duration_ms'])
df_unsaved_songs['label'] = 'unsaved'
print(df_unsaved_songs.head())

(2256, 18)
   danceability  energy  key  loudness  mode  speechiness  acousticness  \
0         0.332   0.216    2    -9.683     0       0.0284         0.984   
1         0.306   0.635    0    -3.860     1       0.0313         0.449   
2         0.388   0.889    3    -5.281     1       0.1160         0.307   
3         0.676   0.618    2    -7.495     0       0.1960         0.689   
4         0.573   0.344    9   -10.910     1       0.0361         0.811   

   instrumentalness  liveness  valence    tempo                      id  \
0          0.005610     0.170   0.2690   90.960  7uWHlTIGoJnSSlAvAcr9iW   
1          0.000004     0.287   0.0847  139.969  5OsKqfRR6OuGGaMcKPG1ti   
2          0.004860     0.142   0.3270  159.960  6JsePoT1VWserj2YIUu0hE   
3          0.008210     0.127   0.4660   79.799  0h6sfKXFb641F2E13rY4f2   
4          0.000000     0.130   0.3770  157.783  3ICdPHubhqTJ4Lm9NEb2W3   

   time_signature    label  
0               4  unsaved  
1               4  unsaved  


# Combine Into Dataset

In [20]:
df = pd.concat([df_saved_songs, df_unsaved_songs])
print(df)
df.to_csv('new_dataset.csv', index=False)

      danceability  energy  key  loudness  mode  speechiness  acousticness  \
0            0.695   0.807    9    -5.123     1       0.0346        0.0121   
1            0.510   0.533    7    -6.194     1       0.3590        0.1940   
2            0.764   0.475    0   -12.618     1       0.1140        0.3450   
3            0.615   0.800    7    -7.423     1       0.0571        0.0684   
4            0.656   0.804    8    -5.191     0       0.3630        0.1730   
...            ...     ...  ...       ...   ...          ...           ...   
2251         0.444   0.470    0    -6.897     0       0.0611        0.6170   
2252         0.569   0.857    7    -5.571     1       0.0337        0.0101   
2253         0.483   0.404    2    -8.498     1       0.0319        0.7630   
2254         0.626   0.827    2    -4.277     1       0.0447        0.4230   
2255         0.465   0.269    0   -12.140     1       0.0311        0.7910   

      instrumentalness  liveness  valence    tempo             