# Dataset Construction

Note: to run the code in this notebook, a Spotify API key and account will be needed

## Process Overview

To construct the dataset, I did the following:
1. Get all songs I had saved (Spotify provides an endpoint to do so)
2.	Find the features for each song (Spotify provides an endpoint to do so)

### Imports and API Preparation

In [2]:
import sys
import spotipy
import spotipy.util as util
import requests
import json
import pandas as pd
import numpy as np

# Declare API access keys and request information
scope = 'user-library-read'
username = 'srijanduggal17'
client_id = ''
client_secret = ''
redirect_uri = 'http://localhost/'

token = util.prompt_for_user_token(username,scope,client_id=client_id,client_secret=client_secret,redirect_uri=redirect_uri)

if token:
    sp = spotipy.Spotify(auth=token)
else:
    print("Can't get token for", username)

### Obtain Songs I have Listened To

Get total number of songs in my library

In [14]:
nextOffset = 0
nextLimit = 50

results = sp.current_user_saved_tracks(limit=nextLimit, offset=nextOffset)
print('Total Tracks: {}'.format(results['total']))

Total Tracks: 2156


Get song id and names for all songs in my library

In [15]:
# Create DataFrame to store results
df_my_songs = pd.DataFrame(columns=['song_uri', 'name'])

# Add results of initial request to DataFrame
for item in results['items']:
    df_my_songs = df_my_songs.append({
        'song_uri': item['track']['id'],
        'name': item['track']['name']
    }, ignore_index=True)
nextOffset += nextLimit

# Continue requesting song and names for all songs
while (results['next'] != None): 
    results = sp.current_user_saved_tracks(limit=nextLimit, offset=nextOffset)
    for item in results['items']:
        df_my_songs = df_my_songs.append({
            'song_uri': item['track']['id'],
            'name': item['track']['name']
        }, ignore_index=True)
    nextOffset += nextLimit

Inspect results

In [16]:
print(df_my_songs.head())

                 song_uri                     name
0  3TMUdD9vE4DoqDYi7VXStt              Fool's Gold
1  3TKpJrY9q49Mj1JOsM9zGL                   Family
2  0CJvDUBeEL1Rmpx7MH28CQ                  For You
3  3GRSqlALWISqLeNncZMbpX                  Mean It
4  1yTTMcUhL7rtz08Dsgb7Qb  The Bones - with Hozier


### Get Audio Features

In [20]:
savedSongIds = list(df_my_songs.song_uri)

# Create DataFrame to store features for saved songs
exampleObj = sp.audio_features(savedSongIds[0])
df_saved_songs = pd.DataFrame(exampleObj)

startNdx = 1
endNdx = 51
nextList = savedSongIds[startNdx:endNdx]

# Get features for each song
while (endNdx < len(savedSongIds)):
    audio_features = sp.audio_features(nextList)
    df_saved_songs = df_saved_songs.append(audio_features, ignore_index=True)
    startNdx = endNdx
    endNdx += 50
    nextList = savedSongIds[startNdx:endNdx]

nextList = savedSongIds[startNdx:]
audio_features = sp.audio_features(nextList)
df_saved_songs = df_saved_songs.append(audio_features, ignore_index=True)

## Combine into Dataset

In [32]:
df = df_saved_songs.copy()

# Add names
df = pd.merge(df, df_my_songs, left_on='id', right_on='song_uri', how='left')

# Encode Mode variable
df['mode_major'] = df['mode']
df['mode_minor'] = 1 - df['mode']


# # Drop unnecessary columns
df = df.drop(columns=['type', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'mode', 'id'])

print('Shape of Dataset {}'.format(df.shape))
print(df.head())

Shape of Dataset (2156, 15)
   danceability  energy  key  loudness  speechiness  acousticness  \
0         0.680   0.729    7    -5.097       0.0418        0.3670   
1         0.584   0.607   11    -6.605       0.0356        0.4260   
2         0.703   0.643    7    -5.544       0.0706        0.1920   
3         0.746   0.450    7    -8.543       0.0872        0.0407   
4         0.561   0.597   11    -6.000       0.0405        0.2860   

   instrumentalness  liveness  valence    tempo  time_signature  \
0               0.0    0.1590    0.830  120.029               4   
1               0.0    0.1010    0.374  117.817               4   
2               0.0    0.1430    0.528  102.059               4   
3               0.0    0.1720    0.336   95.998               4   
4               0.0    0.0979    0.355   76.826               4   

                 song_uri                     name  mode_major  mode_minor  
0  3TMUdD9vE4DoqDYi7VXStt              Fool's Gold           1           0  


Export Data

In [33]:
df.to_csv('dataset.csv', index=False)