# Getting Data 

Data Last Acessed: 12/5/2020

Author: Sean Reidy sreidy1@umbc.edu

## Table Of Contents:

 - 1. Import Packages
 - 2. Setting up API Access 
 - 3. Designing the Dataset 
 - 4. Data Querining & Pre-Processing Functions 
 - 5. Save & Export Data 

## 1. Import Packages 

In [1]:
#A Python Wrapper for the Spotify Web API
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
# A Python Wrapper for the Genius API
import lyricsgenius
from lyricsgenius import Genius
import numpy as np

import pandas as pd
import getpass
import pickle
import sys
import time

 - Code Used to Draw Progress Bars when Pulling Data From Both Spotify and Genius, this process can take a long time 

In [2]:
def drawProgressBar(percent, barLen = 20):
    # percent float from 0 to 1. 
    sys.stdout.write("\r")
    sys.stdout.write("[{:<{}}] {:.0f}%".format("=" * int(barLen * percent), barLen, percent * 100))
    sys.stdout.flush()

## 2. Setting up API Access 
### Seting up Spotify Web API Credentials & Using Spotipy

 - We will need the following:
     + [A spotify Account](https://developer.spotify.com/)
     + Create an App in the [Spotify Developer Dashboard](https://developer.spotify.com/dashboard/applications) 
     + Within the newly create Spotify App
         + Spotify Client ID 
         + Spotify Client Secret 

In [3]:
SPOTIFY_CLIENT_ID = getpass.getpass(prompt = "your-spotify-client-id")

your-spotify-client-id········


In [4]:
SPOTIFY_CLIENT_SECRET = getpass.getpass(prompt = "your-spotify-client-secret")

your-spotify-client-secret········


#### Get Spotify Oauth Token

In [5]:
import warnings;
warnings.filterwarnings('ignore');

token = SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, client_secret=SPOTIFY_CLIENT_SECRET)
cache_token = token.get_access_token()
#save token as sp, will be used in future API calls
sp = spotipy.Spotify(cache_token)

#### Test Spotify API

In [6]:
# Enter artist name 
artist_name = "My Bloody Valentine"
#Search for an Artists and print out 20 popular tracks 
results = sp.search(q=artist_name, limit=20)
for i, t in enumerate(results['tracks']['items']):
    print(' ', i, t['name'])

  0 When You Sleep
  1 My Bloody Valentine
  2 Only Shallow
  3 Sometimes
  4 Soon
  5 To Here Knows When
  6 Loomer
  7 I Only Said
  8 Come in Alone
  9 Touched
  10 Blown a Wish
  11 What You Want
  12 Lose My Breath
  13 Off Your Face
  14 My Bloody Valentine
  15 Soft as Snow (But Warm Inside)
  16 Cupid Come
  17 Honey Power (EP Version)
  18 (When You Wake) You're Still in a Dream
  19 Feed Me with Your Kiss


### Seting up Genius Web API Credentials & Using lyricsgenius

 - We will need the following:
     + [A Genius Account](https://genius.com/developers)
     + Create an App in the [Genius Developer Dashboard](https://genius.com/api-clients) 
         + You will have to set an Api URL. (I used this github project) 
     + Within the newly create Spotify App
         + Click Get Access Token 
            + Genius Client Access Token

In [7]:
GENIUS_CLIENT_ACCESS_TOKEN = getpass.getpass(prompt = "your-genius-client-access-token")

your-genius-client-access-token········


#### Get Genius Oauth Token

In [8]:
genius =Genius(GENIUS_CLIENT_ACCESS_TOKEN,verbose = False)

#### Test Genius API

In [9]:
trackName = "When You Sleep"
artistName = "My Bloody Valentine"
song = genius.search_song(trackName,artistName)
print(song.lyrics)

[Verse 1: Kevin Shields]
When I look at you
Oh, I don't know what's real
Once in a while
And you make me laugh
And I'll sleep tomorrow
And it won't be long
Once in a while
Then you take me down
Then you walk away

[Verse 2: Kevin Shields]
When you say "I do"
Oh, I don't believe in you
I can't forget it, ooh
When you sleep tomorrow
And it won't be long
Once in a while
When you make me smile
And you turn your long blonde hair

[Verse 3: Kevin Shields]
When I look at you
Oh, I don't know what's real
Once in a while
And you make me laugh
And I'll sleep tomorrow
And it won't be long
Once in a while
Then you take me down
Then you walk away


## 3.  Designing the Dataset 

![design](../media/spotifydatadiagram.png)

##### Audio Features 

The Spotify Web Api provided the following Audio Features for each track [more info here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/)

- track_id : a spotify primary key; unique for each track
- artist: name of artist
- album: name of album
- trackName: title of track
- acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. 
- duration_ms: The duration of the track in milliseconds.
- energy: a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. 
- instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context.
- key: The estimated overall key of the track. Integers map to pitches using standard [Pitch Class notation](https://en.wikipedia.org/wiki/Pitch_class) 
- liveness: Detects the presence of an audience in the recording.
- loudness: The overall loudness of a track in decibels (dB). 
- mode: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- speechiness: Speechiness detects the presence of spoken words in a track.
- tempo: BPM of track
- time_signature: An estimated overall time signature of a track. 
- valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. 
- category_id: The Spotify Category ID of the track 
- popularity: The value will be between 0 and 100, with 100 being the most popular. The artist’s popularity is calculated from the popularity of all the artist’s tracks.


## 4. Data Querining & Pre-Processing Functions 

In [10]:
#generate_playlist_ids: Creates a 2d array of playlists from a denre dict

#Parameters
#:genre_dict: (dictorary) A python dictorary of the where each key is a genre, and each genre maps to a list of 
#                        sub genres 

#Returns 
#:playlist_id_list: (list) A 2d array of all playlits found from the spotify web api search 


def generate_playlist_ids(genre_dict, n_playlists = 20):
    
    #playlist_id_df = pd.DataFrame(columns= ['root genre','sub genre','playlist_id','playlist name'])
    playlist_id_list = []
    #loop over all root genres
    for root_genre, sub_genres in genre_dict.items():
        #for every sub genre 
        for search_genre in sub_genres:
            #get 25 playlists for each sub genre 
            searched_playlists = sp.search(q = search_genre, type = 'playlist', limit = n_playlists, market = 'US')
            playlists = searched_playlists['playlists']['items']
            #loop over the 25 playlists 
            for i in playlists:
                #gather infor and add it to the dataframe 
                #playlist_info = pd.Series([root_genre,search_genre,i['id'],i['name']],index=playlist_id_df.columns)
                #playlist_id_df.append(playlist_info, ignore_index= True)
                 playlist_id_list.append([root_genre,search_genre,i['id'],i['name']])
    return  playlist_id_list


#### Testing generate_playlist_ids

In [11]:
genre_dict = {
    'pop' : ['pop','post-teen pop','dance pop','electropop','pop dance','indie pop'],
    'rap' : ['rap','hip hop','souther hip hop','gangster rap','trap','dirty south rap'],
    'edm' : ['edm','electro house','big room','pop edm','pop dance','complextro'],
    'r&b' : ['r&b', 'urban contemporary','new jack swing','neo soul','hip pop','pop r&b'],
    'country' : ['country', 'country road','contemporary country','moden country rock','country rock','country dawn'],
    'rock' : ['rock','album rock','classic rock','permanet wave','hard rock','modern rock']
    }
 
#for the genres in genre_dict, get one playlist and print the first 10 
test_playlist_list = generate_playlist_ids(genre_dict,1)
print(test_playlist_list[0:10])

[['pop', 'pop', '37i9dQZF1DXdY5tVYFPWb2', 'City Pop: シティ・ポップの今'], ['pop', 'post-teen pop', '3IdDjr6DGMU0qqsJsd38mf', 'Post-Teen Pop'], ['pop', 'dance pop', '37i9dQZF1DWZQaaqNMbbXa', 'Dance Pop Hits'], ['pop', 'electropop', '4frhr6RQM2fMOm2mpvOVo6', 'ElectroPop 2020'], ['pop', 'pop dance', '5mN5EWwgU8sJsNPTqjbnze', 'Pop Dance Remixes - 2000s 2010s 2020s'], ['pop', 'indie pop', '37i9dQZF1DWWEcRhUVtL8n', 'Indie Pop'], ['rap', 'rap', '37i9dQZF1DX0XUsuxWHRQd', 'RapCaviar'], ['rap', 'hip hop', '37i9dQZF1DWT5MrZnPU1zD', 'Hip Hop Controller'], ['rap', 'souther hip hop', '4lcyWQDOzPfcbZrcBI3FOW', 'Southern Hip Hop'], ['rap', 'gangster rap', '0ZRwrJ2EDGyKR6YgQPWXeO', 'Gangster Rap Workout ']]


In [12]:
#get_track_lyrics: Using the Genuis API look up the Lyric String of Track, return none if lyrics not found 

#Parameters
#:spotify_artist: (string) A string of the artist name from the Spotify API
#:spotify_album: (string) A string of the album title for the Spotify API
#:spotify_trackName: (string) A string of the title of the track from the Spotify API

#Returns 
#:lyrics_str: (string) A full string of the track lyrics if lyrics found, None if no lyrics found 

def get_track_lyrics(spotify_artist,spotify_trackName, pre_processing = False):
    
    try: 

        #Search the genuis api for the song
        track = genius.search_song(spotify_artist,spotify_trackName)
        #if song is none, then no matching song was found 
        if track == None:
            return None
    
        # Pre process the lyrics for Future NLP work
        if pre_processing:
            pass 
    
        return track.lyrics
    except: 
        return None
    
        

#### Testing generate_playlist_ids

In [13]:
test_artist = "The Mountain Goats"
test_song = "This Year"

test_lyrics = get_track_lyrics(test_artist,test_song)
print(test_lyrics)

[Verse 1]
I broke free on a Saturday morning
I put the pedal to the floor
Headed north on Mills Avenue
And listened to the engine roar

[Verse 2]
My broken house behind me
And good things ahead
A girl named Cathy
Wants a little of my time
Six cylinders underneath the hood
Crashing and kicking
Aha! Listen to the engine whine

[Chorus]
I am going to make it through this year
If it kills me
I am going to make it though this year
If it kills me

[Verse 3]
I played video games in a drunken haze
I was seventeen years young
Hurt my knuckles punching the machines
The taste of Scotch rich on my tongue

[Verse 4]
And then Cathy showed up
And we hung out
Trading swigs from a bottle
All bitter and clean
Locking eyes
Holding hands
Twin high maintenance machines

[Chorus]
I am going to make it through this year
If it kills me
I am going to make it though this year
If it kills me

[Verse 5]
I drove home in the California dusk
I could feel the alcohol inside of me hum
Pictured the look on my stepfathe

In [14]:
#query_spotify_playlist: Creates a dataframe of songs and assorted metadata from a playlist.

# Parameters 
# :playlist_id: (string) a unique primary key for the playlist (found from playlist URL extention)

# Returns 
# :playlist_df: a pandas dataframe of the playlist 
 
def query_spotify_playlist( playlist_id):
    
    track_feature_list = ["acousticness","danceability","duration_ms","energy","instrumentalness",
                          "key","liveness","loudness","mode","speechiness","tempo","time_signature","valence"]
    metadata_list = ["track_id","artist","artist_id","album","trackName"]
    
    colNames = metadata_list + track_feature_list
    
    #set DataFrame for the playlist 
    playlist_df = pd.DataFrame(columns = colNames)
    
    #pull playlist data from Spotify Web API
    playlist_raw_dict = sp.playlist_tracks(playlist_id)
    playlist = playlist_raw_dict['items']
    check_if_next = playlist_raw_dict['next']
    total_len = playlist_raw_dict['total']
    
    #Handling playlist of over 100 songs, spotify web api limits to 100 tracks at a time
    if check_if_next != None:
        spotify_offset = 100
        while check_if_next != None:
            next_playlist_chunk = sp.playlist_tracks(playlist_id,offset = spotify_offset)
            playlist.extend(next_playlist_chunk['items'])
            check_if_next = next_playlist_chunk['next']
            spotify_offset += 100

            
            
    n = len(playlist)
    count = 0
    #loop over all items in a playlist 
    for i in playlist:
        progress = count / n 
        drawProgressBar(progress)
        
        playlist_elem = {}
        
        # Metadata (hardcoded locations of mettadata)
        playlist_elem["track_id"] = i["track"]["id"]
        playlist_elem["artist"] = i["track"]["album"]["artists"][0]["name"]
        playlist_elem["artist_id"] = i["track"]["album"]["artists"][0]["id"]
        playlist_elem["album"] = i["track"]["album"]["name"]
        playlist_elem["trackName"] = i["track"]["name"]
        
        # Artist Genre list, 
        artist = sp.artist(i["track"]["album"]["artists"][0]["id"])
        track_genre_list = artist['genres']
        playlist_elem['genre'] = str(track_genre_list)
        
        # Artist Popularity
        artist_pop = artist['popularity']
        playlist_elem['popularity'] = artist_pop
        
        # Track Audio Features
        audio_features = sp.audio_features(playlist_elem["track_id"])[0]
        for feature in track_feature_list:
            playlist_elem[feature] = audio_features[feature]
            
        # Get Lyrics String 
        playlist_elem['lyrics'] = get_track_lyrics(playlist_elem["artist"],playlist_elem["trackName"])
        
            
        #print(playlist_elem) 
        # add track data to playlist dataframe     
        track_df = pd.DataFrame(playlist_elem, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
        count += 1
    
    drawProgressBar(1)  
    return playlist_df 

#### test query_spotify_playlist

In [15]:
playlistID_1 = "37i9dQZF1DX4jsulumEbDn"
playlist1_df = query_spotify_playlist(playlistID_1)
playlist1_df.head()



Unnamed: 0,track_id,artist,artist_id,album,trackName,acousticness,danceability,duration_ms,energy,instrumentalness,...,liveness,loudness,mode,speechiness,tempo,time_signature,valence,genre,popularity,lyrics
0,2Ud3deeqLAG988pfW0Kwcl,LCD Soundsystem,066X20Nz7iquqkkCW6Jxy6,Sound of Silver,All My Friends,0.148,0.701,462267,0.788,0.625,...,0.0977,-5.611,1,0.0425,142.584,4,0.795,"['alternative dance', 'alternative rock', 'dan...",64.0,[Verse 1]\nThat's how it starts\nWe go back to...
1,2cmRpmO04TLaKPzmAzySYZ,LCD Soundsystem,066X20Nz7iquqkkCW6Jxy6,This Is Happening,Dance Yrself Clean,0.00557,0.739,536471,0.611,0.725,...,0.04,-9.829,1,0.0622,98.004,4,0.794,"['alternative dance', 'alternative rock', 'dan...",64.0,[Verse 1]\nWalking up to me expecting\nWalking...
2,53PkA8aXiwH4ppa0V0iO7o,LCD Soundsystem,066X20Nz7iquqkkCW6Jxy6,american dream,oh baby,0.0192,0.58,349693,0.622,0.653,...,0.648,-12.005,1,0.0352,169.442,4,0.781,"['alternative dance', 'alternative rock', 'dan...",64.0,Oh baby\nOh baby\nYou're having a bad dream\nH...
3,2VGDntFPvgvqSiUf9ITEfW,LCD Soundsystem,066X20Nz7iquqkkCW6Jxy6,Sound of Silver,Someone Great,0.0418,0.722,390013,0.894,0.00374,...,0.11,-5.903,1,0.0325,113.003,4,0.396,"['alternative dance', 'alternative rock', 'dan...",64.0,[Verse 1]\nI wish that we could talk about it\...
4,1XlDNpWy8dyEljyRd0RC2J,LCD Soundsystem,066X20Nz7iquqkkCW6Jxy6,LCD Soundsystem,Losing My Edge,0.000208,0.769,473067,0.959,0.631,...,0.331,-8.717,0,0.0634,116.168,4,0.893,"['alternative dance', 'alternative rock', 'dan...",64.0,"Yeah, I'm losing my edge\nI'm losing my edge\n..."


In [16]:
#query_multiple_spotify_playlists: Takes a dictionary of playlists and returns a dictionary of pandas dataframes 

# Parameters 
# :playlists: (2dlist) ['root genre','sub genre','playlist_id','playlist name'])

# Returns 
# :spotify_df: final dataframe 

def query_multiple_spotify_playlists(playlists, echo = False):
    
    # Create playlist_dict to hold all output playlists 
    playlist_dict = {}
    
    #Loop over every playlist in playlists  
    for playlist in playlists:
        
        playlist_name = playlist[3];playlist_id = playlist[2];playlist_root_genre = playlist[0];playlist_sub_genre = playlist[1];
        
        # Print curent playlist being pulled from API
        if echo:
            print(playlist_name + " " + playlist_id + " " + playlist_root_genre+ " "+playlist_sub_genre)  
    
        # if API unable to find playlist data, skip playlist 
        #try:
        #query track data from playlist, save as dataframe 
        playlist_df = query_spotify_playlist(playlist_id)
        
            #lable track data with spotify (genera)
        playlist_df = playlist_df.assign(root_genre = playlist_root_genre)
        playlist_df = playlist_df.assign(sub_genre = playlist_sub_genre)
        
            #add playlsit_df to playlist_dict
        playlist_dict[playlist_name] = playlist_df
            
        #except:
         #   print("playlist Error: Unable to query " + playlist_name + " of genre " + playlist_root_genre)
          #  print(playlist_id)
           # continue
        if echo:
            print("")

        
    #check if all playlists dataframes added to dictionary     
    #assert len(playlists) == len(playlist_dict)
    #return dict of data_frames 
    print("")
    spotify_df = pd.concat(playlist_dict)
    return spotify_df
    

### Writing for Multithreded 

In [24]:
def multi_get_lyrics(playlist_elem):
    spotify_artist = playlist_elem["artist"]; spotify_trackName = playlist_elem["trackName"]
    try:
        #Search the genuis api for the song
        track = genius.search_song(spotify_artist,spotify_trackName)
    except:
        time.sleep(.5)
        track = genius.search_song(spotify_artist,spotify_trackName)
        #if song is none, then no matching song was found 
    if track == None:
        playlist_elem['lyrics'] = np.nan
    else:
        playlist_elem['lyrics'] = track.lyrics
    return playlist_elem


In [23]:
def multi_get_artist_info(playlist_elem):
    try:
        artist = sp.artist(playlist_elem['artist_id'])
    except:
        time.sleep(0.25)
        artist = sp.artist(playlist_elem['artist_id'])
        
    artist_pop = artist['popularity']
    playlist_elem['popularity'] = artist_pop
    track_genre_list = artist['genres']
    playlist_elem['spotify_genre_list'] = str(track_genre_list)
    return playlist_elem

In [22]:
def multi_get_audio_features(playlist_elem):
# Track Audio Features
    track_feature_list = ["acousticness","danceability","duration_ms","energy","instrumentalness",
                          "key","liveness","loudness","mode","speechiness","tempo","time_signature","valence"]
    try:
        audio_features = sp.audio_features(playlist_elem["track_id"])[0]
    except:
        time.sleep(0.25)
        audio_features = sp.audio_features(playlist_elem["track_id"])[0]
    for feature in track_feature_list:
        playlist_elem[feature] = audio_features[feature]
            
        
    return playlist_elem
        
    

In [6]:
#query_spotify_playlist: Creates a dataframe of songs and assorted metadata from a playlist.

# Parameters 
# :playlist_id: (string) a unique primary key for the playlist (found from playlist URL extention)

# Returns 
# :playlist_df: a pandas dataframe of the playlist 
 
def multi_query_spotify_playlist(plst_info):
    playlist_name = plst_info[3];playlist = plst_info[2];root_genre = plst_info[0];sub_genre = plst_info[1]
    
   
    
    #set list for the playlist 
    playlist_data_list = []
    
      

    #loop over all items in a playlist 
    for i in playlist:
        #progress = count / n 
        #drawProgressBar(progress)
    
            
        #playlist_elem = {}
        playlist_elem = pd.Series()
        # Metadata (hardcoded locations of mettadata)
        playlist_elem["track_id"] = i["track"]["id"]
        playlist_elem["artist"] = i["track"]["album"]["artists"][0]["name"]
        playlist_elem["artist_id"] = i["track"]["album"]["artists"][0]["id"]
        playlist_elem["album"] = i["track"]["album"]["name"]
        playlist_elem["trackName"] = i["track"]["name"]
        
 
        #artist = sp.artist(i["track"]["album"]["artists"][0]["id"])
        #track_genre_list = artist['genres']
        #playlist_elem['genre'] = str(track_genre_list)
        playlist_elem['root genre'] = root_genre
        playlist_elem['sub genre'] = sub_genre
        
        # Artist Popularity
        #artist_pop = artist['popularity']
        #playlist_elem['popularity'] = artist_pop
  
        #print(playlist_elem) 
        # add track data to playlist dataframe     
        playlist_data_list.append(playlist_elem)
        
        #count += 1
    
    #drawProgressBar(1)  
    return playlist_data_list 



In [21]:
def multi_get_all_tracks(plst_info):
    
    playlist_name = plst_info[3];playlist_id = plst_info[2];root_genre = plst_info[0];sub_genre = plst_info[1]
    #pull playlist data from Spotify Web API
    try:
        playlist_raw_dict = sp.playlist_tracks(playlist_id)
    except:
        time.sleep(0.25)
        playlist_raw_dict = sp.playlist_tracks(playlist_id)
        
    playlist = playlist_raw_dict['items']
    check_if_next = playlist_raw_dict['next']
    total_len = playlist_raw_dict['total']
    
    #Handling playlist of over 100 songs, spotify web api limits to 100 tracks at a time
    if check_if_next != None:
        spotify_offset = 100
        while check_if_next != None:
            try:
                next_playlist_chunk = sp.playlist_tracks(playlist_id,offset = spotify_offset)
            except:
                time.sleep(0.25)
                next_playlist_chunk = sp.playlist_tracks(playlist_id,offset = spotify_offset)
            playlist.extend(next_playlist_chunk['items'])
            check_if_next = next_playlist_chunk['next']
            spotify_offset += 100
    return [root_genre,sub_genre,playlist,playlist_name]

    

In [30]:
def Api_refresh():
    warnings.filterwarnings('ignore');
    token = SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, client_secret=SPOTIFY_CLIENT_SECRET)
    cache_token = token.get_access_token()
    #save token as sp, will be used in future API calls
    global sp
    global genius
    sp = spotipy.Spotify(cache_token)
    genius =Genius(GENIUS_CLIENT_ACCESS_TOKEN,verbose = False,timeout= 20, sleep_time = 0.5, remove_section_headers=True)

In [36]:
import concurrent.futures
from tqdm import tqdm
import warnings;


def get_data(n_playlists):
    Api_refresh()
    
    genre_dict = {
    'pop' : ['pop','post-teen pop','dance pop','electropop','pop dance','indie pop'],
    'rap' : ['rap','hip hop','souther hip hop','gangster rap','trap','dirty south rap'],
    'edm' : ['edm','electro house','big room','pop edm','pop dance','complextro'],
    'r&b' : ['r&b', 'urban contemporary','new jack swing','neo soul','hip pop','pop r&b'],
    'country' : ['country', 'country road','contemporary country','moden country rock','country rock','country dawn'],
    'rock' : ['rock','album rock','classic rock','permanet wave','hard rock','modern rock']
    }
    

    playlist_list = generate_playlist_ids(genre_dict,n_playlists)

    n_max_threads = 10
    

    output_a = []
    with tqdm(total=len(playlist_list)) as pbar:
        with concurrent.futures.ThreadPoolExecutor(max_workers = n_max_threads) as executor:
            future_to_result = {executor.submit(multi_get_all_tracks, plylist ): plylist for plylist in playlist_list}
            for future in concurrent.futures.as_completed(future_to_result):
                
                try:
                    df = future.result()
                    output_a.append(df)
                    pbar.update(1) 
                except Exception:
                    continue

     # EXTRACT TRACK INFO FROM JSON AND LOOK UP ARTIST   
     #Refresh API 
    Api_refresh()
    n_max_threads = len(playlist_list)
    output_b = []
    #n_max_threads = min(10,len(output_a))
    with tqdm(total=len(output_a)) as pbar:
        with concurrent.futures.ThreadPoolExecutor(max_workers = n_max_threads) as executor:
            future_to_result = {executor.submit(multi_query_spotify_playlist, plylist ): plylist for plylist in output_a}
            for future in concurrent.futures.as_completed(future_to_result):
                
                try:
                    df = future.result()
                    output_b.append(df)
                    pbar.update(1) 
                except Exception:
                    continue
    #flatten to large list of tracks          
     #Refresh API 
    n_max_threads = 10
    Api_refresh()
    output_b_flat = [j for i in output_b for j in i]
    # GET AUDIO FEATURES FOR EACH TRACK          
    output_c = []
    #n_max_threads = min(10,len(output_b_flat))
    with tqdm(total=len(output_b_flat)) as pbar:
        with concurrent.futures.ThreadPoolExecutor(max_workers = n_max_threads) as executor:
            future_to_result = {executor.submit(multi_get_audio_features, elem ): elem for elem in output_b_flat}
            for future in concurrent.futures.as_completed(future_to_result):
                
                try:
                    df = future.result()
                    output_c.append(df)
                    pbar.update(1) 
                except Exception:
                    continue
        # GET AUDIO FEATURES FOR EACH TRACK   
     #Refresh API 
    Api_refresh()
    output_d = []
    #n_max_threads = min(10,len(output_c))
    with tqdm(total=len(output_c)) as pbar:
        with concurrent.futures.ThreadPoolExecutor(max_workers = n_max_threads) as executor:
            future_to_result = {executor.submit(multi_get_artist_info, elem ): elem for elem in output_c}
            for future in concurrent.futures.as_completed(future_to_result):
                
                try:
                    df = future.result()
                    output_d.append(df)
                    pbar.update(1) 
                except Exception:
                    continue
    #GET LYRICS
    #Refresh API 
    Api_refresh()
    output_e = []
    with tqdm(total=len(output_d)) as pbar:
        with concurrent.futures.ThreadPoolExecutor(max_workers = n_max_threads) as executor:
            future_to_result = {executor.submit(multi_get_lyrics, elem ): elem for elem in output_d}
            for future in concurrent.futures.as_completed(future_to_result):
                
                try:
                    df = future.result()
                    output_e.append(df)
                    pbar.update(1) 
                except Exception:
                    continue
                
    return output_e    

In [37]:
spotify_df = get_data(5)

100%|██████████| 177/177 [00:48<00:00,  3.64it/s]
 94%|█████████▍| 166/177 [01:45<00:07,  1.57it/s]
  2%|▏         | 589/33252 [00:43<24:07, 22.56it/s]  Max Retries reached
100%|█████████▉| 33251/33252 [54:53<00:00, 10.10it/s]  
  6%|▌         | 2011/33251 [03:19<13:21, 38.97it/s]  Max Retries reached
  7%|▋         | 2407/33251 [03:59<13:48, 37.22it/s]  Max Retries reached
Max Retries reached
Max Retries reached
 43%|████▎     | 14180/33251 [23:29<07:08, 44.48it/s]  Max Retries reached
Max Retries reached
 49%|████▊     | 16187/33251 [26:49<08:44, 32.53it/s]  Max Retries reached
 62%|██████▏   | 20497/33251 [33:59<04:40, 45.47it/s]  Max Retries reached
 69%|██████▉   | 23013/33251 [38:09<03:30, 48.69it/s]  Max Retries reached
Max Retries reached
 93%|█████████▎| 31038/33251 [51:29<01:00, 36.41it/s]Max Retries reached
Max Retries reached
100%|██████████| 33251/33251 [55:19<00:00, 10.02it/s]
 80%|███████▉  | 26561/33251 [2:20:59<35:30,  3.14it/s]  


In [38]:
len(spotify_df)

26561

In [39]:
spotify_df_cat = pd.DataFrame(spotify_df)

In [40]:
spotify_df_cat

        

Unnamed: 0,track_id,artist,artist_id,album,trackName,root genre,sub genre,acousticness,danceability,duration_ms,...,liveness,loudness,mode,speechiness,tempo,time_signature,valence,popularity,spotify_genre_list,lyrics
0,26auE2wH2FdGQWZf8aQ7G4,Halsey,26VFTg2z8YR0cCuwLzESi2,BADLANDS (Live From Webster Hall),Castle,pop,post-teen pop,0.252000,0.627,277623,...,0.0946,-7.461,0,0.0328,129.965,4,0.163,89,"['dance pop', 'electropop', 'etherpop', 'indie...","Sick of all these people talking, sick of all ..."
1,4k0yY041y3wLd41uDjqROv,Halsey,26VFTg2z8YR0cCuwLzESi2,BADLANDS (Live From Webster Hall),Gasoline,pop,post-teen pop,0.223000,0.731,199593,...,0.1290,-7.328,0,0.0399,120.001,4,0.319,89,"['dance pop', 'electropop', 'etherpop', 'indie...",Are you insane like me? Been in pain like me?\...
2,2qxmye6gAegTMjLKEBoR3d,Alec Benjamin,5IH6FPUwQTxPSXurCrcIov,Narrated For You,Let Me Down Slowly,pop,post-teen pop,0.740000,0.652,169354,...,0.1240,-5.714,0,0.0318,150.073,4,0.483,79,"['electropop', 'pop']",This night is cold in the kingdom\nI can feel ...
3,2nMeu6UenVvwUktBCpLMK9,Lana Del Rey,00FQb4jTyendYWaN8pK0wa,Young And Beautiful,Young And Beautiful,pop,post-teen pop,0.262000,0.324,236053,...,0.1100,-8.920,0,0.0368,113.986,4,0.151,85,"['art pop', 'pop']","I've seen the world, done it all\nHad my cake ..."
4,7i9AEaOWJrfVBsinUSefma,Melanie Martinez,63yrD80RY3RNEM2YDpUpO8,K-12,Detention,pop,post-teen pop,0.488000,0.832,236973,...,0.1030,-6.195,0,0.0676,94.998,4,0.783,82,"['dance pop', 'electropop', 'pop']",I'm not a bad guy\nSo don’t treat me bad if I'...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26556,0hd4rC19ldUBmabhSHxiwS,Gossip,3sFTupo9UGgrujjN21BjwR,Standing in the Way of Control,Standing In the Way of Control,rock,modern rock,0.006940,0.697,256400,...,0.2030,-6.174,1,0.0290,118.341,4,0.943,53,"['alternative dance', 'dance-punk', 'electropo...",Your back's against the wall\nThere's no-one h...
26557,1g8dRVzAvmSOTfTz5m0H9K,Civil Twilight,6i4aN0I3l7uldsLTjbZOF8,Civil Twilight,Letters From The Sky,rock,modern rock,0.561000,0.342,275040,...,0.0994,-5.955,1,0.0449,150.040,4,0.118,42,"['cape town indie', 'piano rock', 'south afric...",One of these days the sky's gonna break\nAnd e...
26558,3E7VTYxBDovFZIxUpobRur,Biffy Clyro,1km0R7wy712AzLkA1WjKET,Ellipsis (Deluxe),Animal Style,rock,modern rock,0.000448,0.472,210373,...,0.1420,-2.503,0,0.0684,139.728,4,0.635,66,"['modern alternative rock', 'modern rock', 'ro...",I picked it up slow but now it's just a ritual...
26559,1XsN9Flu0VvZpXmrkBtZGt,Stuck in the Sound,5sTzirFL1wjNa3GuSiUHsy,Pursuit,Let’s Go,rock,modern rock,0.009700,0.486,211453,...,0.1780,-5.070,0,0.0309,89.454,4,0.275,51,"['french rock', 'rock independant francais']","Here I am, tied and bound\nEvery night feeling..."


In [42]:
spotify_df_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26561 entries, 0 to 26560
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   track_id            26561 non-null  object 
 1   artist              26561 non-null  object 
 2   artist_id           26561 non-null  object 
 3   album               26561 non-null  object 
 4   trackName           26561 non-null  object 
 5   root genre          26561 non-null  object 
 6   sub genre           26561 non-null  object 
 7   acousticness        26561 non-null  float64
 8   danceability        26561 non-null  float64
 9   duration_ms         26561 non-null  int64  
 10  energy              26561 non-null  float64
 11  instrumentalness    26561 non-null  float64
 12  key                 26561 non-null  int64  
 13  liveness            26561 non-null  float64
 14  loudness            26561 non-null  float64
 15  mode                26561 non-null  int64  
 16  spee

#### Save spotify_df to csv and pickle 

In [41]:
spotify_df_cat.to_pickle("spotify_df.pkl")
spotify_df_cat.to_csv("spotify_df.csv")

In [59]:
spotify_df[153]

track_id                                         6hktRRKI55CMQJb0JhlBg2
artist                                                        Dubscribe
artist_id                                        2jQWqQ0Z06KiQnysJHn5Mq
album                                                          Absolute
trackName                                                    Get Wicked
root genre                                                          edm
sub genre                                                 electro house
acousticness                                                   0.000581
danceability                                                      0.355
duration_ms                                                      179582
energy                                                            0.956
instrumentalness                                                  0.172
key                                                                   9
liveness                                                        