# Dataset Construction

Note: to run the code in this notebook, a Spotify API key and account will be needed

## Process Overview

To construct the dataset, I did the following:
1. Obtain the songs I had listened to<br>
    a. Get all songs I had saved (Spotify provides an endpoint to do so)<br>
    b. Aggregated these songs by album to compute the total songs saved for each album<br>
    c. Assumed that albums from which I have saved 2 or more songs are albums that I had fully listened to<br>
    d. Get all the songs for those albums (Spotify provides an endpoint to do so)<br>
2.	Find the label for each song<br>
    a. From the list of in 1d, label the songs in the list from 1a as saved<br>
    d. Label the rest as unsaved<br>
3.	Find the features for each song (Spotify provides an endpoint to do so)

### Imports and API Preparation

In [1]:
import sys
import spotipy
import spotipy.util as util
import requests
import json
import pandas as pd
import numpy as np

# Declare API access keys and request information
# Note: to run the code in this notebook, 
scope = 'user-library-read'
username = ''
client_id = ''
client_secret = ''
redirect_uri = 'http://localhost/'

token = util.prompt_for_user_token(username,scope,client_id=client_id,client_secret=client_secret,redirect_uri=redirect_uri)

if token:
    sp = spotipy.Spotify(auth=token)
else:
    print("Can't get token for", username)

### Obtain Songs I have Listened To

Get total number of songs in my library

In [2]:
nextOffset = 0
nextLimit = 50

results = sp.current_user_saved_tracks(limit=nextLimit, offset=nextOffset)
print('Total Tracks: {}'.format(results['total']))

Total Tracks: 2157


Get song and album ids for all songs in my library

In [3]:
# Create DataFrame to store results
df_my_songs = pd.DataFrame(columns=['song_uri', 'album_uri'])

# Add results of initial request to DataFrame
for item in results['items']:
    df_my_songs = df_my_songs.append({
        'song_uri': item['track']['id'],
        'album_uri': item['track']['album']['id']
    }, ignore_index=True)
nextOffset += nextLimit

# Continue requesting song and album ids for all songs
while (results['next'] != None): 
    results = sp.current_user_saved_tracks(limit=nextLimit, offset=nextOffset)
    for item in results['items']:
        df_my_songs = df_my_songs.append({
            'song_uri': item['track']['id'],
            'album_uri': item['track']['album']['id']
        }, ignore_index=True)
    nextOffset += nextLimit

Aggregate songs by albums

In [4]:
df_songs_per_album = df_my_songs.groupby('album_uri').count()

Find albums I have listened to

In [5]:
df_albums_heard = df_songs_per_album[df_songs_per_album.song_uri > 1]
albums_heard = df_albums_heard.index.values
print('Albums I have listened to:', df_albums_heard.shape[0])

Albums I have listened to: 259


Get all songs from albums I have listened to

In [6]:
albumSongIds = set()

counter = 0;
from progressbar import ProgressBar
pbar = ProgressBar()
for albumId in pbar(albums_heard):
    nextLimit = 50
    nextOffset = 0
    albumInfo = sp.album_tracks(albumId, limit=nextLimit)
    
    # Add first set of tracks of album
    for item in albumInfo['items']:
        albumSongIds.add(item['id'])
    nextOffset += nextLimit
    
    # Add the rest of the album's tracks
    while (albumInfo['next'] != None): 
        albumInfo = sp.album_tracks(albumId, limit=nextLimit, offset=nextOffset)
        for item in albumInfo['items']:
            albumSongIds.add(item['id'])
        nextOffset += nextLimit
    counter += 1

100% |########################################################################|


### Find the label for each song

In [7]:
df_saved_ids = df_my_songs[df_my_songs.album_uri.isin(albums_heard)]
savedSongIds = set(df_saved_ids.song_uri)
unsavedSongIds = albumSongIds.difference(savedSongIds)

print('Songs I have listened to: {}'.format(len(albumSongIds)))
print('Songs I have listened to and saved: {}'.format(len(savedSongIds)))
print('Songs I have listened to and not saved: {}'.format(len(unsavedSongIds)))

Songs I have listened to: 3797
Songs I have listened to and saved: 1541
Songs I have listened to and not saved: 2256


### Get Audio Features for All Songs

For Saved songs:

In [8]:
savedSongIds = list(savedSongIds)

# Create DataFrame to store features for saved songs
exampleObj = sp.audio_features(savedSongIds[0])
df_saved_songs = pd.DataFrame(exampleObj)

startNdx = 1
endNdx = 51
nextList = savedSongIds[startNdx:endNdx]

# Get features for each song
while (endNdx < len(savedSongIds)):
    audio_features = sp.audio_features(nextList)
    df_saved_songs = df_saved_songs.append(audio_features, ignore_index=True)
    startNdx = endNdx
    endNdx += 50
    nextList = savedSongIds[startNdx:endNdx]

nextList = savedSongIds[startNdx:]
audio_features = sp.audio_features(nextList)
df_saved_songs = df_saved_songs.append(audio_features, ignore_index=True)
df_saved_songs['label'] = 'saved'

For Unsaved songs:

In [9]:
unsavedSongIds = list(unsavedSongIds)

# Create DataFrame to store features for saved songs
exampleObj = sp.audio_features(unsavedSongIds[0])
df_unsaved_songs = pd.DataFrame(exampleObj)

startNdx = 1
endNdx = 51
nextList = unsavedSongIds[startNdx:endNdx]

# Get features for each song
while (endNdx < len(unsavedSongIds)):
    audio_features = sp.audio_features(nextList)
    df_unsaved_songs = df_unsaved_songs.append(audio_features, ignore_index=True)
    startNdx = endNdx
    endNdx += 50
    nextList = unsavedSongIds[startNdx:endNdx]

nextList = unsavedSongIds[startNdx:]
audio_features = sp.audio_features(nextList)
df_unsaved_songs = df_unsaved_songs.append(audio_features, ignore_index=True)
df_unsaved_songs['label'] = 'unsaved'

## Combine into Dataset

In [10]:
df = pd.concat([df_saved_songs, df_unsaved_songs])

# Encode Mode variable
df['mode_major'] = df['mode']
df['mode_minor'] = 1 - df['mode']

# Drop unnecessary columns
df = df.drop(columns=['type', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'mode', 'id'])

print('Shape of Dataset {}'.format(df.shape))
print(df.head())

Shape of Dataset (3797, 14)
   danceability  energy  key  loudness  speechiness  acousticness  \
0         0.823   0.467    0   -10.394        0.301         0.543   
1         0.445   0.378    0    -8.043        0.031         0.318   
2         0.810   0.451   10    -6.348        0.249         0.152   
3         0.637   0.569    6    -5.858        0.550         0.173   
4         0.747   0.492   11    -8.399        0.110         0.271   

   instrumentalness  liveness  valence    tempo  time_signature  label  \
0          0.000000     0.135   0.6180   95.024               4  saved   
1          0.041600     0.142   0.0729   71.835               4  saved   
2          0.053700     0.108   0.3590   85.417               4  saved   
3          0.000000     0.180   0.1480  140.269               4  saved   
4          0.000011     0.263   0.1890  124.870               4  saved   

   mode_major  mode_minor  
0           0           1  
1           1           0  
2           0           1  


Export Data

In [11]:
df.to_csv('dataset.csv', index=False)