<img src="https://github.com/rjpost20/Onramp-Project/blob/main/data/pexels-vishnu-r-nair-1105666.jpg?raw=true">
Image by <a href="https://www.pexels.com/@vishnurnair/" >Vishnu R Nair</a> on <a href="https://www.pexels.com/photo/people-at-concert-1105666/" >Pexels.com</a>

# *Onramp x Vanguard Spotify Project*

## Notebook 1: Data Ingestion

## *By Ryan Posternak*

<br>

## Table of Contents:

* __Step 1: Data Ingestion__
 * __Establish Connection to Spotify's API__
 * __Part 1: Artists Dataframe__
 * __Part 2: Albums Dataframe__
 * __Part 3: Tracks Dataframe__
 * __Part 4: Track Features Dataframe__
<br><br>
* Step 2: Data Transformation
 * Part 1: Handling Null/Missing Values
 * Part 2: Deduplication
   * 2.1: Remove duplicate albums
   * 2.2: Remove duplicate songs
<br><br>
* Step 3: Storage
 * Part 1: artist Table
 * Part 2: album Table
 * Part 3: track Table
 * Part 4: track_feature Table
<br><br>
* Step 4: Analytics / Visualizations
 * Part 1: SQL Query Analytics
   * 1.1: Top songs by artist in terms of `duration_ms`
   * 1.2: Top artists in the database by number of `followers`
   * 1.3: Top songs by artist in terms of `tempo`
   * 1.4: Average `track_feature` values by artist
   * 1.5: Max and min average `track_feature` and `duration_ms` values by album
   * 1.6: Proportion of `explicit` tracks by artist
 * Part 2: Data Visualizations
   * 2.1: Mean `track_feature` values by `genre`
   * 2.2: Distribution of `track_feature` and `duration_ms` values
   * 2.3: Evolution of artists, visualized
   * 2.4: Evolution of genres, visualized
   * 2.5: Correlations between `track_feature` + `duration_ms` + `followers` values

## Imports

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
from pprint import pprint
import time, re, os

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

<br>

# Step 1: Data Ingestion

### Establish connection to Spotify's API

In [2]:
# Set API keys as environment variables (sensitive information!). Credentials are stored securely in a local file.
with open("API.txt") as f:
    text = f.readlines()
    client_id = text[0].strip()
    client_secret = text[1].strip()
    redirect_uri = text[2].strip()
    
# Assign API keys to a Spotipy credential manager
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, 
                                                      client_secret=client_secret, 
                                                      requests_timeout=None) # Default timeout setting is too short

# Connect to Spotipy by passing in credential manager
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

sp

<spotipy.client.Spotify at 0x136b54d30>

## Part 1: Artists Dataframe

### Obtain artist data for top 20 favorite artists

In [3]:
# Define list of 20 favorite artists
artists = ['Beyoncé', 'Billie Eilish', 'Bob Dylan', 'Bob Marley & The Wailers', 'Cuco', 'Doja Cat', 'Drake', \
           'Ellie Goulding', 'J. Cole', 'Jack Johnson', 'James Taylor', 'Khalid', 'Kid Cudi', 'Pink Floyd', \
           'Post Malone', 'Simon & Garfunkel', 'The Beatles', 'The Chainsmokers', 'The Weeknd', 'blackbear']

assert len(artists) == 20

In [4]:
# Preview artist output format
preview = sp.search('The Beatles', limit=1, type='artist')
pprint(preview)

{'artists': {'href': 'https://api.spotify.com/v1/search?query=The+Beatles&type=artist&offset=0&limit=1',
             'items': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/3WrFJ7ztbogyGnTHbHJFl2'},
                        'followers': {'href': None, 'total': 23583228},
                        'genres': ['beatlesque',
                                   'british invasion',
                                   'classic rock',
                                   'merseybeat',
                                   'psychedelic rock',
                                   'rock'],
                        'href': 'https://api.spotify.com/v1/artists/3WrFJ7ztbogyGnTHbHJFl2',
                        'id': '3WrFJ7ztbogyGnTHbHJFl2',
                        'images': [{'height': 640,
                                    'url': 'https://i.scdn.co/image/ab6761610000e5ebe9348cc01ff5d55971b22433',
                                    'width': 640},
                                   {'height': 

In [5]:
# Create function to fill dictionary based on API-call data lists and column names
def add_to_dict(dictionary, column_names, lists):
    for column_name, lst in zip(column_names, lists):
        dictionary[column_name] = lst
    print('Dictionary successfully filled')

In [6]:
# Define dictionary to contain all artist data
artists_dict = {}

# Define list containers for artists info
artist_ids, artist_names, external_urls, genres, image_urls, followers, \
popularities, types, artist_uris = [], [], [], [], [], [], [], [], []

# Append artist info to each respective list
for artist in artists:
    artist_info = sp.search(artist, limit=1, type='artist')
    info_items = artist_info['artists']['items'][0]
    
    artist_ids.append(info_items['id'])
    artist_names.append(info_items['name'])
    external_urls.append(info_items['external_urls']['spotify'])
    genres.append(info_items['genres'][0])  # Take first genre from list
    image_urls.append(info_items['images'][0]['url'])  # Take first image url from list
    followers.append(info_items['followers']['total'])
    popularities.append(info_items['popularity'])
    types.append(info_items['type'])
    artist_uris.append(info_items['uri'])
    
    # Set a delay after each API call to prevent loop from crashing
    time.sleep(0.1)

# Create column_name: feature_list key-value pairs for artists_dict dictionary
column_names = ['artist_id', 'artist_name', 'external_url', 'genre', 'image_url', 'followers', 'popularity', 
                'type', 'artist_uri']
lists = [artist_ids, artist_names, external_urls, genres, image_urls, followers, popularities, 
         types, artist_uris]

add_to_dict(artists_dict, column_names, lists)

Dictionary successfully filled


In [7]:
# Compile into Pandas dataframe
artists_df = pd.DataFrame(data=artists_dict)
assert artists_df.shape[0] == 20

# Preview artists dataframe
artists_df.head()

Unnamed: 0,artist_id,artist_name,external_url,genre,image_url,followers,popularity,type,artist_uri
0,6vWDO969PvNqNYHIOW5v0m,Beyoncé,https://open.spotify.com/artist/6vWDO969PvNqNY...,dance pop,https://i.scdn.co/image/ab6761610000e5eb676338...,32173211,88,artist,spotify:artist:6vWDO969PvNqNYHIOW5v0m
1,6qqNVTkY8uBg9cP3Jd7DAH,Billie Eilish,https://open.spotify.com/artist/6qqNVTkY8uBg9c...,art pop,https://i.scdn.co/image/ab6761610000e5ebd8b998...,68915938,88,artist,spotify:artist:6qqNVTkY8uBg9cP3Jd7DAH
2,74ASZWbe4lXaubB36ztrGX,Bob Dylan,https://open.spotify.com/artist/74ASZWbe4lXaub...,album rock,https://i.scdn.co/image/ab6772690000c46cf79ca0...,5782357,71,artist,spotify:artist:74ASZWbe4lXaubB36ztrGX
3,2QsynagSdAqZj3U9HgDzjD,Bob Marley & The Wailers,https://open.spotify.com/artist/2QsynagSdAqZj3...,reggae,https://i.scdn.co/image/b5aae2067db80f694a980e...,10865517,78,artist,spotify:artist:2QsynagSdAqZj3U9HgDzjD
4,2Tglaf8nvDzwSQnpSrjLHP,Cuco,https://open.spotify.com/artist/2Tglaf8nvDzwSQ...,bedroom pop,https://i.scdn.co/image/ab6761610000e5ebcfb53b...,1851319,71,artist,spotify:artist:2Tglaf8nvDzwSQnpSrjLHP


<br>

## Part 2: Albums Dataframe

### Obtain album data for (max) ten albums for each of top 20 favorite artists

In [8]:
# Preview albums output format
preview = sp.artist_albums(artist_id='6vWDO969PvNqNYHIOW5v0m', limit=10, country='US')
pprint(preview['items'][0])

{'album_group': 'album',
 'album_type': 'album',
 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/6vWDO969PvNqNYHIOW5v0m'},
              'href': 'https://api.spotify.com/v1/artists/6vWDO969PvNqNYHIOW5v0m',
              'id': '6vWDO969PvNqNYHIOW5v0m',
              'name': 'Beyoncé',
              'type': 'artist',
              'uri': 'spotify:artist:6vWDO969PvNqNYHIOW5v0m'}],
 'external_urls': {'spotify': 'https://open.spotify.com/album/6FJxoadUE4JNVwWHghBwnb'},
 'href': 'https://api.spotify.com/v1/albums/6FJxoadUE4JNVwWHghBwnb',
 'id': '6FJxoadUE4JNVwWHghBwnb',
 'images': [{'height': 640,
             'url': 'https://i.scdn.co/image/ab67616d0000b2730e58a0f8308c1ad403d105e7',
             'width': 640},
            {'height': 300,
             'url': 'https://i.scdn.co/image/ab67616d00001e020e58a0f8308c1ad403d105e7',
             'width': 300},
            {'height': 64,
             'url': 'https://i.scdn.co/image/ab67616d000048510e58a0f8308c1ad403d105e7'

In [9]:
# Display names of albums
for album in preview['items']:
    pprint(album['name'])

'RENAISSANCE'
'RENAISSANCE'
'The Lion King: The Gift [Deluxe Edition]'
'The Lion King: The Gift [Deluxe Edition]'
'The Lion King: The Gift'
'HOMECOMING: THE LIVE ALBUM'
'HOMECOMING: THE LIVE ALBUM'
'Lemonade'
'Lemonade'
'BEYONCÉ [Platinum Edition]'


**Remarks:**
- It looks like duplicate albums will be an issue. Further, it looks like slight variations on the name or edition (e.g. regular edition vs. deluxe edition) will also be an issue. We'll address this with a RegEx search to remove any text inside square brackets or parentheses when checking for duplicate albums, which should catch most of these duplicates.
- While not a problem for this artist, many artist will likely have one or more live-recorded versions of albums among the mix, which one would assume are almost entirely repeats of songs that are in one of their other (recorded) albums. To address this, we'll skip any album with the word "live" in it. Since this could mean losing a lot of albums, we'll set a relatively high limit of 20 in the API search.

In [10]:
# Display names of albums with album variation comments removed
for album in preview['items']:
    pprint(re.sub("[\(\[].*?[\)\]]", "", album['name']).strip())

'RENAISSANCE'
'RENAISSANCE'
'The Lion King: The Gift'
'The Lion King: The Gift'
'The Lion King: The Gift'
'HOMECOMING: THE LIVE ALBUM'
'HOMECOMING: THE LIVE ALBUM'
'Lemonade'
'Lemonade'
'BEYONCÉ'


Below I structure an API call of 50 albums per artist, however I break the loop once 10 albums have been retrieved for an individual artist. I call 50 albums in the API because I skip all identically named albums (after RegEx search above), all albums with any word in the 'skips' list, and for four older artists, any albums after specified dates (to avoid later-released remastered/special edition albums). So, we need extra albums in the API call to account for those.

I skip albums with these words because the goal is to obtain solely original-release albums — no compilations and no live-recorded albums (as these same songs are presumably found on recorded albums). The analytics and visualizations should be more informative if we can obtain the original albums (and the associated original release dates) rather than the later-released remastered, recompiled, or otherwise special-edition albums.

Additionally, though I call for a max of 10 albums per artist, many artists (especially contemporary ones) will have far fewer than that because they have not released that many albums thus far. Furthermore, even more albums will be dropped later on when I do a manual review of duplicate albums. Due to the above, most artists will not end up with 10 albums in the dataframes, but for some prolific artists such as The Beatles, obtaining the majority of their discography by allowing up to 10 will make for more informative visualizations.

Specifying an `album_type` of 'album' helps filter out compilation albums, but some still get through, not to mention live-recorded albums.

For The Beatles, James Taylor and Pink Floyd, I was having trouble isolating their original release albums due to the abundance of albums available under their names (many of which were live-recorded or compilation albums). To improve the search I skipped any albums which were outside of the years in which they released their original studio albums (they may have some original albums released after the specified time period, but it's inconsequential as seven albums are obtained for all three artists before it reaches that point in time).

In [11]:
# Define dictionary to contain all album data
albums_dict = {}

# Set up containers for albums info
album_ids, album_names, external_urls, image_urls, release_dates, total_tracks, types, \
album_uris, album_artist_ids = [], [], [], [], [], [], [], [], []

skips = ['live']

# Loop through each artist_id in artist_ids list
for artist_id in artist_ids:
    # API call for 50 albums by artist (need to account for skipped albums)
    albums_info = sp.artist_albums(artist_id=artist_id, album_type='album', limit=40, country='US')
    
    # Prevent duplicate albums from being added
    dup_album_check = []
    
    # Retrieve info for each album
    for album in albums_info['items']:

        # Record unique album name (after removing notes in parentheses or brackets)
        unique_album_name = re.sub("[\(\[].*?[\)\]]", "", album['name']).strip()
        # Skip album if in dup_album_check list
        if unique_album_name in dup_album_check:
            continue  
        # Skip album if any word in skips list is in album title
        if any([skip in album['name'].lower().strip(')]') for skip in skips]):
            continue
        
        # Only keep Beatles albums released before 1975
        if artist_id == '3WrFJ7ztbogyGnTHbHJFl2' and album['release_date'] > '1975-01-01':
            continue
        # Only keep Bob Marley albums released before 1984  
        if artist_id == '2QsynagSdAqZj3U9HgDzjD' and album['release_date'] > '1984-01-01':
            continue
        # Only keep James Taylor and Bob Dylan albums released before 1990
        if (artist_id == '0vn7UBvSQECKJm2817Yf1P' or artist_id == '74ASZWbe4lXaubB36ztrGX')\
        and album['release_date'] > '1990-01-01':
            continue
        # Only keep Pink Floyd albums released before 1995
        if artist_id == '0k17h0D3J5VfsdmQ1iZtE9' and album['release_date'] > '1995-01-01':
            continue

        album_ids.append(album['id'])
        album_names.append(album['name'])
        external_urls.append(album['external_urls']['spotify'])
        image_urls.append(album['images'][0]['url'])  # Take first image url from list
        release_dates.append(album['release_date'])
        total_tracks.append(album['total_tracks'])
        types.append(album['type'])
        album_uris.append(album['uri'])
        album_artist_ids.append(artist_id)

        dup_album_check.append(unique_album_name)
        # Set maximum number of albums per artist to 10
        if len(dup_album_check) == 10:
            break
    
    # Set a delay after each API call
    time.sleep(0.1)
    
    
# Create column_name: feature_list key-value pairs for albums_dict dictionary
column_names = ['album_id', 'album_name', 'external_url', 'image_url', 'release_date', 'total_tracks', 
                'type', 'album_uri', 'artist_id']
lists = [album_ids, album_names, external_urls, image_urls, release_dates, total_tracks, 
         types, album_uris, album_artist_ids]

add_to_dict(albums_dict, column_names, lists)

Dictionary successfully filled


In [12]:
# Compile into Pandas dataframe
albums_df = pd.DataFrame(data=albums_dict)

# Verify we retrieved albums for all 20 artists
assert set(artist_ids) == set(album_artist_ids)  

# Preview albums dataframe
print('Albums:', albums_df.shape[0])
albums_df.head()

Albums: 150


Unnamed: 0,album_id,album_name,external_url,image_url,release_date,total_tracks,type,album_uri,artist_id
0,6FJxoadUE4JNVwWHghBwnb,RENAISSANCE,https://open.spotify.com/album/6FJxoadUE4JNVwW...,https://i.scdn.co/image/ab67616d0000b2730e58a0...,2022-07-29,16,album,spotify:album:6FJxoadUE4JNVwWHghBwnb,6vWDO969PvNqNYHIOW5v0m
1,7kUuNU2LRmr9XbwLHXU9UZ,The Lion King: The Gift [Deluxe Edition],https://open.spotify.com/album/7kUuNU2LRmr9Xbw...,https://i.scdn.co/image/ab67616d0000b27360e232...,2020-07-31,17,album,spotify:album:7kUuNU2LRmr9XbwLHXU9UZ,6vWDO969PvNqNYHIOW5v0m
2,7dK54iZuOxXFarGhXwEXfF,Lemonade,https://open.spotify.com/album/7dK54iZuOxXFarG...,https://i.scdn.co/image/ab67616d0000b27389992f...,2016-04-23,13,album,spotify:album:7dK54iZuOxXFarGhXwEXfF,6vWDO969PvNqNYHIOW5v0m
3,2UJwKSBUz6rtW4QLK74kQu,BEYONCÉ [Platinum Edition],https://open.spotify.com/album/2UJwKSBUz6rtW4Q...,https://i.scdn.co/image/ab67616d0000b2730d1d6e...,2014-11-24,20,album,spotify:album:2UJwKSBUz6rtW4QLK74kQu,6vWDO969PvNqNYHIOW5v0m
4,1gIC63gC3B7o7FfpPACZQJ,4,https://open.spotify.com/album/1gIC63gC3B7o7Ff...,https://i.scdn.co/image/ab67616d0000b273ff5429...,2011-06-24,14,album,spotify:album:1gIC63gC3B7o7FfpPACZQJ,6vWDO969PvNqNYHIOW5v0m


<br>

## Part 3: Tracks Dataframe

### Obtain track data for (max) 50 tracks (minus duplicates) in each album obtained above

In [13]:
# Preview tracks output format
preview = sp.album_tracks(album_id='6FJxoadUE4JNVwWHghBwnb', limit=10, market='US')
pprint(preview['items'][0])

{'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/6vWDO969PvNqNYHIOW5v0m'},
              'href': 'https://api.spotify.com/v1/artists/6vWDO969PvNqNYHIOW5v0m',
              'id': '6vWDO969PvNqNYHIOW5v0m',
              'name': 'Beyoncé',
              'type': 'artist',
              'uri': 'spotify:artist:6vWDO969PvNqNYHIOW5v0m'}],
 'disc_number': 1,
 'duration_ms': 208014,
 'explicit': True,
 'external_urls': {'spotify': 'https://open.spotify.com/track/1MpCaOeUWhox2Fgigbe1cL'},
 'href': 'https://api.spotify.com/v1/tracks/1MpCaOeUWhox2Fgigbe1cL',
 'id': '1MpCaOeUWhox2Fgigbe1cL',
 'is_local': False,
 'is_playable': True,
 'name': "I'M THAT GIRL",
 'preview_url': 'https://p.scdn.co/mp3-preview/c7cece6b1b9cb3637fc48924f23baf9c7e1ec15c?cid=a3c3419e623d4410ad1aadf01bc737d5',
 'track_number': 1,
 'type': 'track',
 'uri': 'spotify:track:1MpCaOeUWhox2Fgigbe1cL'}


In [14]:
# Display names of 10 songs
for track in preview['items']:
    pprint(track['name'])

"I'M THAT GIRL"
'COZY'
'ALIEN SUPERSTAR'
'CUFF IT'
'ENERGY (feat. Beam)'
'BREAK MY SOUL'
'CHURCH GIRL'
'PLASTIC OFF THE SOFA'
"VIRGO'S GROOVE"
'MOVE (feat. Grace Jones & Tems)'


**Remarks:**
- It looks like duplicate songs might not be as big an issue with track searches. I don't see a need to use a RegEx search like in the last API call, but we'll still check for identically named songs and skip those.

In [15]:
# Define dictionary to contain all tracks data
tracks_dict = {}

# Set up containers for tracks info
track_ids, song_names, external_urls, durations_ms, explicit, disc_numbers, types, \
song_uris, track_album_ids = [], [], [], [], [], [], [], [], []

# Loop through each album_id in album_ids list
for album_id in album_ids:
    # API call for (max) 50 tracks per album
    tracks_info = sp.album_tracks(album_id=album_id, limit=50, market='US')
    
    # Prevent duplicate tracks from being added
    dup_track_check = []
    
    # Retrieve info for each album
    for track in tracks_info['items']:
        
        track_name = track['name']
        if track_name in dup_track_check:  # Skip track if in dup_track_check list
            continue
            
        track_ids.append(track['id'])
        song_names.append(track['name'])
        external_urls.append(track['external_urls']['spotify'])
        durations_ms.append(track['duration_ms'])
        explicit.append(track['explicit'])
        disc_numbers.append(track['disc_number'])
        types.append(track['type'])
        song_uris.append(track['uri'])
        track_album_ids.append(album_id)
        
        dup_track_check.append(track_name)
    
    # Set a delay after each API call
    time.sleep(0.01)


# Create column_name: feature_list key-value pairs for tracks_dict dictionary
column_names = ['track_id', 'song_name', 'external_url', 'duration_ms', 'explicit', 'disc_number', 
                'type', 'song_uri', 'album_id']
lists = [track_ids, song_names, external_urls, durations_ms, explicit, disc_numbers, 
        types, song_uris, track_album_ids]

add_to_dict(tracks_dict, column_names, lists)

Dictionary successfully filled


In [16]:
# Compile into Pandas dataframe
tracks_df = pd.DataFrame(data=tracks_dict)

# Verify we retrieved tracks for all albums in album_ids
assert set(album_ids) == set(track_album_ids)

# Preview tracks dataframe
print('Tracks:', tracks_df.shape[0])
tracks_df.head()

Tracks: 2152


Unnamed: 0,track_id,song_name,external_url,duration_ms,explicit,disc_number,type,song_uri,album_id
0,1MpCaOeUWhox2Fgigbe1cL,I'M THAT GIRL,https://open.spotify.com/track/1MpCaOeUWhox2Fg...,208014,True,1,track,spotify:track:1MpCaOeUWhox2Fgigbe1cL,6FJxoadUE4JNVwWHghBwnb
1,0mKGwFMHzTprtS2vpR3b6s,COZY,https://open.spotify.com/track/0mKGwFMHzTprtS2...,210372,True,1,track,spotify:track:0mKGwFMHzTprtS2vpR3b6s,6FJxoadUE4JNVwWHghBwnb
2,1Hohk6AufHZOrrhMXZppax,ALIEN SUPERSTAR,https://open.spotify.com/track/1Hohk6AufHZOrrh...,215459,True,1,track,spotify:track:1Hohk6AufHZOrrhMXZppax,6FJxoadUE4JNVwWHghBwnb
3,1xzi1Jcr7mEi9K2RfzLOqS,CUFF IT,https://open.spotify.com/track/1xzi1Jcr7mEi9K2...,225388,True,1,track,spotify:track:1xzi1Jcr7mEi9K2RfzLOqS,6FJxoadUE4JNVwWHghBwnb
4,0314PeD1sQNonfVWix3B2K,ENERGY (feat. Beam),https://open.spotify.com/track/0314PeD1sQNonfV...,116727,False,1,track,spotify:track:0314PeD1sQNonfVWix3B2K,6FJxoadUE4JNVwWHghBwnb


<br>

## Part 4: Track Features Dataframe

### Obtain track features data for each of the tracks obtained above

To prevent having to run a separate API call for each track (which is slow and may cause the API call to crash during the loop), I chunk the `track_ids` list into lists of 100 tracks each (the maximum allowable tracks to request for each Spotipy `audio_features` API call).

In [17]:
# Preview track features output format
preview = sp.audio_features('1MpCaOeUWhox2Fgigbe1cL')[0]
pprint(preview)

{'acousticness': 0.0616,
 'analysis_url': 'https://api.spotify.com/v1/audio-analysis/1MpCaOeUWhox2Fgigbe1cL',
 'danceability': 0.554,
 'duration_ms': 208014,
 'energy': 0.535,
 'id': '1MpCaOeUWhox2Fgigbe1cL',
 'instrumentalness': 1.32e-05,
 'key': 5,
 'liveness': 0.124,
 'loudness': -8.959,
 'mode': 0,
 'speechiness': 0.186,
 'tempo': 105.865,
 'time_signature': 4,
 'track_href': 'https://api.spotify.com/v1/tracks/1MpCaOeUWhox2Fgigbe1cL',
 'type': 'audio_features',
 'uri': 'spotify:track:1MpCaOeUWhox2Fgigbe1cL',
 'valence': 0.136}


In [18]:
# Create function to return successive n-sized chunks from a list
def chunks(lst, n):
    chunked_lists = []
    for i in range(0, len(lst), n):
        chunked_lists.append(lst[i:i + n])
    return chunked_lists
        
chunked_track_ids = chunks(track_ids, 100)

In [19]:
# Define dictionary to contain all track features data
track_features_dict = {}

# Set up containers for tracks info
track_features_track_ids, danceability, energy, instrumentalness, liveness, loudness, speechiness, \
tempo, types, valence, track_features_song_uris = [], [], [], [], [], [], [], [], [], [], []

# Append track features info to each respective list
for chunked_track_id_lst in chunked_track_ids:
    # API call for chunk of track features info
    track_features_chunk = sp.audio_features(chunked_track_id_lst)
    
    # Loop through each track_features dictionary in track_features_chunk
    for track_features in track_features_chunk:
        
        track_features_track_ids.append(track_features['id'])
        danceability.append(track_features['danceability'])
        energy.append(track_features['energy'])
        instrumentalness.append(track_features['instrumentalness'])
        liveness.append(track_features['liveness'])
        loudness.append(track_features['loudness'])
        speechiness.append(track_features['speechiness'])
        tempo.append(track_features['tempo'])
        types.append(track_features['type'])
        valence.append(track_features['valence'])
        track_features_song_uris.append(track_features['uri'])
    
    # Set a delay after each API call
    time.sleep(0.1)
    
    
    
# Create column_name: feature_list key-value pairs for track_features_dict dictionary
column_names = ['track_id', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 
                'speechiness', 'tempo', 'type', 'valence', 'song_uri']
lists = [track_features_track_ids, danceability, energy, instrumentalness, liveness, loudness, 
         speechiness, tempo, types, valence, track_features_song_uris]

add_to_dict(track_features_dict, column_names, lists)

Dictionary successfully filled


In [20]:
# Compile into Pandas dataframe
track_features_df = pd.DataFrame(data=track_features_dict)

# Verify we retrieved track features for all tracks in track_ids
assert set(track_ids) == set(track_features_track_ids)

# Preview tracks dataframe
print('Track features:', track_features_df.shape[0])
track_features_df.head()

Track features: 2152


Unnamed: 0,track_id,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,type,valence,song_uri
0,1MpCaOeUWhox2Fgigbe1cL,0.554,0.535,1.3e-05,0.124,-8.959,0.186,105.865,audio_features,0.136,spotify:track:1MpCaOeUWhox2Fgigbe1cL
1,0mKGwFMHzTprtS2vpR3b6s,0.556,0.63,0.00468,0.155,-8.15,0.102,149.147,audio_features,0.367,spotify:track:0mKGwFMHzTprtS2vpR3b6s
2,1Hohk6AufHZOrrhMXZppax,0.545,0.641,6.6e-05,0.171,-6.398,0.0998,121.892,audio_features,0.464,spotify:track:1Hohk6AufHZOrrhMXZppax
3,1xzi1Jcr7mEi9K2RfzLOqS,0.78,0.689,1e-05,0.0698,-5.668,0.141,115.042,audio_features,0.642,spotify:track:1xzi1Jcr7mEi9K2RfzLOqS
4,0314PeD1sQNonfVWix3B2K,0.903,0.519,0.000106,0.155,-9.151,0.26,114.991,audio_features,0.587,spotify:track:0314PeD1sQNonfVWix3B2K


<br>

### Save dataframes as `pkl` files

In [21]:
if not os.path.exists('data/artists_df_unprocessed.pkl'):
    artists_df.to_pickle('data/artists_df_unprocessed.pkl')
    
if not os.path.exists('data/albums_df_unprocessed.pkl'):
    albums_df.to_pickle('data/albums_df_unprocessed.pkl')
    
if not os.path.exists('data/tracks_df_unprocessed.pkl'):
    tracks_df.to_pickle('data/tracks_df_unprocessed.pkl')
    
if not os.path.exists('data/track_features_df_unprocessed.pkl'):
    track_features_df.to_pickle('data/track_features_df_unprocessed.pkl')