# Creating the "Advanced" Dataset
#### 11/02/2023
In this notebook, I again use the Spotify API to get the audio feature data from each song. After collecting this data, I combine it with the basic dataset to get the advanced dataset, my final task for this project.

## Reading in the track IDs
In order to collect the audio features for each song, we'll need to get the track IDs. Let's pull them from the basic dataset we made last time:

In [1]:
import pandas as pd
import requests
import json
from tqdm import tqdm
import time

In [2]:
df = pd.read_csv('billboard_basic.csv')
track_ids = df['track_id']

## Gathering all the features
Using these IDs, we can request Spotify for the audio features. Firstly, let's do the authentication:

### Authentication

In [3]:
import base64

# Request setup
CLIENT_ID = '526daa7bf54e45ee891e83cc0f806f10'
CLIENT_SECRET = '25f3c9ec9d21478083879659c40455b5'
AUTH_URL = 'https://accounts.spotify.com/api/token'
headers = {
    'Authorization': 'Basic ' + base64.b64encode(f'{CLIENT_ID}:{CLIENT_SECRET}'.encode()).decode()
}
data = {
    'grant_type': 'client_credentials'
}
auth_response = requests.post(AUTH_URL, headers=headers, data=data)

# Get the access token
access_token = auth_response.json().get('access_token')

# Set up headers for future API calls
headers = {
    'Authorization': 'Bearer ' + access_token
}

# Check if we successfully fetched the token
if access_token:
    print("Successfully authenticated!")
else:
    print("Failed to authenticate")


Successfully authenticated!


Now let's define a function to collect the audio features based on a given index, and save these audio features to a json file:

In [4]:
BASE_URL = "https://api.spotify.com/v1/"

def save_audio_features(index):
    try:
        with open("cache.json", "r") as cache_file:
            cache = json.load(cache_file)
    except (FileNotFoundError, json.JSONDecodeError):
        # If the cache file doesn't exist or is empty, initialize an empty cache
        cache = {}
    
    # If the track_id is found in cache, skip this index
    if str(index) in cache:
        return
    
    URL = BASE_URL + f"audio-features/{track_ids[index]}"
    response = requests.get(URL, headers=headers)
    features = response.json()
    
    if 'error' in features:
        raise Exception("Too many requests")
    
    cache[index] = features
    with open("cache.json", "w") as cache_file:
        json.dump(cache, cache_file)

Now here's the hard part. Since I run the risk of getting a 429 error and killing the kernel, I'll break it up into parts and do this over the course of a few hours.

In [12]:
for index in tqdm(range(2000, 2300)):
    save_audio_features(index)

100%|█████████████████████████████████████████| 300/300 [01:42<00:00,  2.92it/s]


In [13]:
# Let's make a new dataframe from the json file
df2 = pd.read_json('cache.json', orient='index')

In [14]:
df2

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.529,0.496,7,-9.007,1,0.0290,0.17300,0.000000,0.2510,0.278,136.859,audio_features,3y4LxiYMgDl4RethdzpmNe,spotify:track:3y4LxiYMgDl4RethdzpmNe,https://api.spotify.com/v1/tracks/3y4LxiYMgDl4...,https://api.spotify.com/v1/audio-analysis/3y4L...,250547,4
1,0.609,0.923,9,-3.908,1,0.0338,0.16000,0.000005,0.2950,0.961,115.996,audio_features,0n2SEXB2qoRQg171q7XqeW,spotify:track:0n2SEXB2qoRQg171q7XqeW,https://api.spotify.com/v1/tracks/0n2SEXB2qoRQ...,https://api.spotify.com/v1/audio-analysis/0n2S...,294987,4
2,0.777,0.601,2,-5.931,1,0.1260,0.04060,0.002010,0.0348,0.680,97.911,audio_features,3XKIUb7HzIF1Vu9usunMzc,spotify:track:3XKIUb7HzIF1Vu9usunMzc,https://api.spotify.com/v1/tracks/3XKIUb7HzIF1...,https://api.spotify.com/v1/audio-analysis/3XKI...,261973,4
3,0.725,0.487,8,-5.959,0,0.0368,0.26000,0.000011,0.4310,0.599,136.086,audio_features,1m2xMsxbtxv21Brome189p,spotify:track:1m2xMsxbtxv21Brome189p,https://api.spotify.com/v1/tracks/1m2xMsxbtxv2...,https://api.spotify.com/v1/audio-analysis/1m2x...,296693,4
4,0.636,0.761,3,-6.389,0,0.0306,0.05170,0.000000,0.0642,0.736,93.896,audio_features,4cKGldbhGJniI8BrB3K6tb,spotify:track:4cKGldbhGJniI8BrB3K6tb,https://api.spotify.com/v1/tracks/4cKGldbhGJni...,https://api.spotify.com/v1/audio-analysis/4cKG...,257067,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2295,0.527,0.461,7,-5.908,1,0.0269,0.11800,0.000000,0.0831,0.227,128.153,audio_features,2ccuOtUjIyx3tPcsnpeBzJ,spotify:track:2ccuOtUjIyx3tPcsnpeBzJ,https://api.spotify.com/v1/tracks/2ccuOtUjIyx3...,https://api.spotify.com/v1/audio-analysis/2ccu...,214405,3
2296,0.745,0.650,2,-11.814,1,0.3460,0.04510,0.007580,0.1110,0.386,144.047,audio_features,5vUnjhBzRJJIAOJPde6zDx,spotify:track:5vUnjhBzRJJIAOJPde6zDx,https://api.spotify.com/v1/tracks/5vUnjhBzRJJI...,https://api.spotify.com/v1/audio-analysis/5vUn...,152137,4
2297,0.714,0.472,2,-7.375,1,0.0864,0.01300,0.000005,0.2660,0.238,131.121,audio_features,3nqQXoyQOWXiESFLlDF1hG,spotify:track:3nqQXoyQOWXiESFLlDF1hG,https://api.spotify.com/v1/tracks/3nqQXoyQOWXi...,https://api.spotify.com/v1/audio-analysis/3nqQ...,156943,4
2298,0.463,0.840,0,-5.807,1,0.0385,0.00204,0.000002,0.1970,0.575,99.991,audio_features,5ekA7j4MPQa3NZbZQSpRfF,spotify:track:5ekA7j4MPQa3NZbZQSpRfF,https://api.spotify.com/v1/tracks/5ekA7j4MPQa3...,https://api.spotify.com/v1/audio-analysis/5ekA...,214293,4


Finally, we've gotten all the song features. Now, it's time to connect them to their respective songs.

## Concatenating the two dataframes
This part should be easy since the two dataframes have matching indicies:

In [18]:
combined_df = pd.concat([df, df2], axis=1)
combined_df

Unnamed: 0,track_id,title,artist,album,release_date,popularity,duration_ms,explicit,year,ranking,...,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms.1,time_signature
0,3y4LxiYMgDl4RethdzpmNe,Breathe,Faith Hill,Breathe,1999-11-09,69,250546,False,2000,1,...,0.2510,0.278,136.859,audio_features,3y4LxiYMgDl4RethdzpmNe,spotify:track:3y4LxiYMgDl4RethdzpmNe,https://api.spotify.com/v1/tracks/3y4LxiYMgDl4...,https://api.spotify.com/v1/audio-analysis/3y4L...,250547,4
1,0n2SEXB2qoRQg171q7XqeW,Smooth (feat. Rob Thomas),Santana,Supernatural (Remastered),1999-06-15,73,294986,False,2000,2,...,0.2950,0.961,115.996,audio_features,0n2SEXB2qoRQg171q7XqeW,spotify:track:0n2SEXB2qoRQg171q7XqeW,https://api.spotify.com/v1/tracks/0n2SEXB2qoRQ...,https://api.spotify.com/v1/audio-analysis/0n2S...,294987,4
2,3XKIUb7HzIF1Vu9usunMzc,Maria Maria (feat. The Product G&B),Santana,Supernatural (Remastered),1999-06-15,79,261973,False,2000,3,...,0.0348,0.680,97.911,audio_features,3XKIUb7HzIF1Vu9usunMzc,spotify:track:3XKIUb7HzIF1Vu9usunMzc,https://api.spotify.com/v1/tracks/3XKIUb7HzIF1...,https://api.spotify.com/v1/audio-analysis/3XKI...,261973,4
3,1m2xMsxbtxv21Brome189p,I Wanna Know,Joe,My Name Is Joe,2000-04-18,69,296693,False,2000,4,...,0.4310,0.599,136.086,audio_features,1m2xMsxbtxv21Brome189p,spotify:track:1m2xMsxbtxv21Brome189p,https://api.spotify.com/v1/tracks/1m2xMsxbtxv2...,https://api.spotify.com/v1/audio-analysis/1m2x...,296693,4
4,4cKGldbhGJniI8BrB3K6tb,Everything You Want,Vertical Horizon,Everything You Want,1999-06-14,64,257066,False,2000,5,...,0.0642,0.736,93.896,audio_features,4cKGldbhGJniI8BrB3K6tb,spotify:track:4cKGldbhGJniI8BrB3K6tb,https://api.spotify.com/v1/tracks/4cKGldbhGJni...,https://api.spotify.com/v1/audio-analysis/4cKG...,257067,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2295,2ccuOtUjIyx3tPcsnpeBzJ,Flower Shops (feat. Morgan Wallen),ERNEST,Flower Shops (feat. Morgan Wallen),2021-12-31,72,214405,False,2022,96,...,0.0831,0.227,128.153,audio_features,2ccuOtUjIyx3tPcsnpeBzJ,spotify:track:2ccuOtUjIyx3tPcsnpeBzJ,https://api.spotify.com/v1/tracks/2ccuOtUjIyx3...,https://api.spotify.com/v1/audio-analysis/2ccu...,214405,3
2296,5vUnjhBzRJJIAOJPde6zDx,TO THE MOON,JNR CHOI,TO THE MOON,2021-11-06,72,152137,True,2022,97,...,0.1110,0.386,144.047,audio_features,5vUnjhBzRJJIAOJPde6zDx,spotify:track:5vUnjhBzRJJIAOJPde6zDx,https://api.spotify.com/v1/tracks/5vUnjhBzRJJI...,https://api.spotify.com/v1/audio-analysis/5vUn...,152137,4
2297,3nqQXoyQOWXiESFLlDF1hG,Unholy (feat. Kim Petras),Sam Smith,Unholy (feat. Kim Petras),2022-09-22,86,156943,False,2022,98,...,0.2660,0.238,131.121,audio_features,3nqQXoyQOWXiESFLlDF1hG,spotify:track:3nqQXoyQOWXiESFLlDF1hG,https://api.spotify.com/v1/tracks/3nqQXoyQOWXi...,https://api.spotify.com/v1/audio-analysis/3nqQ...,156943,4
2298,5ekA7j4MPQa3NZbZQSpRfF,One Mississippi,Kane Brown,Different Man,2022-09-09,62,214293,False,2022,99,...,0.1970,0.575,99.991,audio_features,5ekA7j4MPQa3NZbZQSpRfF,spotify:track:5ekA7j4MPQa3NZbZQSpRfF,https://api.spotify.com/v1/tracks/5ekA7j4MPQa3...,https://api.spotify.com/v1/audio-analysis/5ekA...,214293,4


## Pre-processing (final touches)
Lastly, let's remove some columns that we don't need.

In [21]:
billboard_advanced = combined_df.drop(str.split('type id uri track_href analysis_url'), axis=1)

In [22]:
billboard_advanced.to_csv('billboard_advanced.csv', index=False)