# Data Mining & Wrangling
### Using The Spotify Developer API

In [3]:
# Import Packages
import spotipy
import requests
import sys
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import numpy as np
import random
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import json
import time
import sys
import urllib
from sklearn.preprocessing import MultiLabelBinarizer

## Data Mining

### Connect With The Spotify API

To begin pulling playlist data from the Spotify API, first a connection with the API needs to be made. For this, both a so-called "client id" and "client secret id" are required. Once these "id's" are obtained, we follow the below outlined steps to set up the API connection:

In [3]:
# ID and Password for accessing Spotify API
client_id = "client_id"
client_secret = "client_secret_id"

# Setup the credentials
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)

# Make the connection
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

### Collect Spotify's Featured Playlist Data

The main idea of this project is twofold: (i) to infer about key predictors (whether track features or artist features) which are statistically significant in determining a playlist's success in terms of number of followers; and (ii) to create a custom playlist that is deemed to be succesful (i.e., would obtain many followers).

To this extent, the first step in doing any further analysis is to obtain the playlists we want to run our predictions on. We decide to focus on Spotify's own "featured" playlists - i.e., those produced by Spotify itself given specific genres / moods / artists etc.. 

The initial step is to pull Spotify's featured playlists and obtain a number of base playlist features.

In [None]:
# Get all spotify playlists
playlists = sp.user_playlists('spotify')

# Empty list to hold playlist information
spotify_playlists = []

# Loop to get data for each playlist
while playlists:
    
    for i, playlist in enumerate(playlists['items']):
        names = playlist['name']
        track_count = playlist['tracks']['total']
        ids = playlist['id']
        uri = playlist['uri']
        href = playlist['href']
        public = playlist['public']
        data_aggregation = names, track_count, ids, uri, href, public
        spotify_playlists.append(data_aggregation)
        
    if playlists['next']:
        playlists = sp.next(playlists)
    
    else:
        playlists = None

The obtained baseline playlist features are converted into a large dataframe next.

In [4]:
# Convert list into a dataframe
data = pd.DataFrame(np.array(spotify_playlists).reshape(len(spotify_playlists),6), 
                    columns=['Name', 'No. of Tracks', 'ID', 'URI', 'HREF', 'Public'])
data.head()

Unnamed: 0,Name,No. of Tracks,ID,URI,HREF,Public
0,Today's Top Hits,50,37i9dQZF1DXcBWIGoYBM5M,spotify:user:spotify:playlist:37i9dQZF1DXcBWIG...,https://api.spotify.com/v1/users/spotify/playl...,True
1,RapCaviar,63,37i9dQZF1DX0XUsuxWHRQd,spotify:user:spotify:playlist:37i9dQZF1DX0XUsu...,https://api.spotify.com/v1/users/spotify/playl...,True
2,mint,61,37i9dQZF1DX4dyzvuaRJ0n,spotify:user:spotify:playlist:37i9dQZF1DX4dyzv...,https://api.spotify.com/v1/users/spotify/playl...,True
3,Are & Be,51,37i9dQZF1DX4SBhb3fqCJd,spotify:user:spotify:playlist:37i9dQZF1DX4SBhb...,https://api.spotify.com/v1/users/spotify/playl...,True
4,Rock This,64,37i9dQZF1DXcF6B6QPhFDv,spotify:user:spotify:playlist:37i9dQZF1DXcF6B6...,https://api.spotify.com/v1/users/spotify/playl...,True


For each playlist, the number of followers is obtained - this number will be the response variable for our regression based models.

In [8]:
# Pull the number of followers per playlist
playlist_follower = []

# Loop over playlists and get followers
for i in range(0, len(data['URI'])-1): 
    
    # If number of followers is greater than 0
    if data['No. of Tracks'][i] > 0:
        uri = data['URI'][i]
        username = uri.split(':')[2]
        playlist_id = uri.split(':')[4]
        results = sp.user_playlist(username, playlist_id)
        followers = results['followers']['total']
        playlist_follower.append(followers)
    
    # If follower count is 0, append 0   
    else: 
        followers = 0
        playlist_follower.append(followers)

Finally - the number of followers is concatenated to the playlist dataframe.

In [9]:
# Add a new column for followers 
data['Followers'] = pd.DataFrame({'Followers': playlist_follower})
data.head()

Unnamed: 0,Name,No. of Tracks,ID,URI,HREF,Public,Followers
0,Today's Top Hits,50,37i9dQZF1DXcBWIGoYBM5M,spotify:user:spotify:playlist:37i9dQZF1DXcBWIG...,https://api.spotify.com/v1/users/spotify/playl...,True,18247159.0
1,RapCaviar,63,37i9dQZF1DX0XUsuxWHRQd,spotify:user:spotify:playlist:37i9dQZF1DX0XUsu...,https://api.spotify.com/v1/users/spotify/playl...,True,8375355.0
2,mint,61,37i9dQZF1DX4dyzvuaRJ0n,spotify:user:spotify:playlist:37i9dQZF1DX4dyzv...,https://api.spotify.com/v1/users/spotify/playl...,True,4616753.0
3,Are & Be,51,37i9dQZF1DX4SBhb3fqCJd,spotify:user:spotify:playlist:37i9dQZF1DX4SBhb...,https://api.spotify.com/v1/users/spotify/playl...,True,3806312.0
4,Rock This,64,37i9dQZF1DXcF6B6QPhFDv,spotify:user:spotify:playlist:37i9dQZF1DXcF6B6...,https://api.spotify.com/v1/users/spotify/playl...,True,4004115.0


Following the above outlined steps, we are able to produce a dataframe consisting of, in excess 1400, playlists with  relevant information such as playlist id, number of playlist tracks, and number of playlist followers.

### Collect Spotify Audio Features Per Track in Playlist

Using the dataframe of playlists - and specifically the playlist id column - we iterate over all tracks in every playlist and pull relevant audio features which could potentially be helpful in predicting the success of a playlist.
Audio features refers to acousticness, energy, key, valence and etc.

To this extent, we defin a function to pull all playlists' tracks.

In [10]:
# New function to get tracks in playlist
def get_playlist_tracks(username, playlist_id):
    results = sp.user_playlist_tracks(username, playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    return tracks

Running the feature extraction from Spotify could take a significant amount of time and also tend to raise errors in the process. To avoid losing information when such error occurs, a dictionary is used in cache memory.

In [11]:
# Subsample of data to pull
Spotify_playlists = data.iloc[0:10]

# Create playlist cache in memory
playlist_tracks = dict()

The playlists are prepped for audio feature extraction.

In [12]:
# Collect audio features per track per playlist
for playlist in Spotify_playlists["ID"]:
    if Spotify_playlists.loc[Spotify_playlists['ID'] == playlist, 'No. of Tracks'].item() > 0:
        try:
            playlist_tracks[playlist] = get_playlist_tracks('spotify', playlist)
            time.sleep(random.randint(1, 3))
        except:
            pass

In [14]:
# Define an example list of songs for the first 10 playlists
songs_playlist = []

for item,playlist in enumerate(playlist_tracks):
    track_len = len(playlist_tracks[playlist])
    for song_item,song in enumerate(playlist_tracks[playlist]):
        songs_playlist.append((playlist,playlist_tracks[playlist][song_item]['track']['id']))
        
print("Number of Songs in Playlists: {}".format(len(songs_playlist)))

Number of Songs in Playlists: 663


Again, a dictionary in cache memory is set up for the main audio feature extraction loop.

In [15]:
# Create audio feature dictionary and set sleeping time thresholds
songs = [item[1] for item in songs_playlist]

audio_feat = dict()
limit_songs_small = 10
limit_songs_medium = 200

Audio features are extracted using the below code - note running this code on all playlists takes a significant amount of time (measured in hours).

In [16]:
# Audio feature extraction - saves information in cache
for item,song in enumerate(songs):
    if song not in audio_feat:
        try:
            audio_feat[song] = sp.audio_features(song)
        except:
            pass

        if item % limit_songs_small == 0:
            time.sleep(random.randint(0, 1))

        if item % limit_songs_medium == 0:
            time.sleep(random.randint(0, 1))

        out = np.floor(item * 1. / len(songs_playlist) * 100)
        sys.stdout.write("\r%d%%" % out)
        sys.stdout.flush()

sys.stdout.write("\r%d%%" % 100)

100%

Once all the audio features are extracted, they are converted into the main audio feature dataframe and saved as a large csv file.

In [17]:
# Convert raw data into dictionaries
acousticness = dict()
danceability = dict()
duration_ms = dict()
energy = dict()
instrumentalness = dict()
key = dict()
liveness = dict()
loudness = dict()
mode = dict()
speechiness = dict()
tempo = dict()
time_signature = dict()
valence = dict()

for item,song in enumerate(audio_feat):
    try:
        acousticness[song] = audio_feat[song][0]['acousticness']
        danceability[song] = audio_feat[song][0]['danceability']
        duration_ms[song] = audio_feat[song][0]['duration_ms']
        energy[song] = audio_feat[song][0]['energy']
        instrumentalness[song] = audio_feat[song][0]['instrumentalness']
        key[song] = audio_feat[song][0]['key']
        liveness[song] = audio_feat[song][0]['liveness']
        loudness[song] = audio_feat[song][0]['loudness']
        mode[song] = audio_feat[song][0]['mode']
        speechiness[song] = audio_feat[song][0]['speechiness']
        tempo[song] = audio_feat[song][0]['tempo']
        time_signature[song] = audio_feat[song][0]['time_signature']
        valence[song] = audio_feat[song][0]['valence']
    except TypeError:
        pass

In [18]:
# Creation of audio feature dataframes from dictionaries
acc_df = pd.DataFrame(pd.Series(acousticness)).reset_index().rename(columns={'index': 'song', 0: 'acousticness'})
dan_df = pd.DataFrame(pd.Series(danceability)).reset_index().rename(columns={'index': 'song', 0: 'dance'})
dur_df = pd.DataFrame(pd.Series(duration_ms)).reset_index().rename(columns={'index': 'song', 0: 'duration'})
ene_df = pd.DataFrame(pd.Series(energy)).reset_index().rename(columns={'index': 'song', 0: 'energy'})
inst_df = pd.DataFrame(pd.Series(instrumentalness)).reset_index().rename(columns={'index': 'song', 0: 'instrumentalness'})
key_df = pd.DataFrame(pd.Series(key)).reset_index().rename(columns={'index': 'song', 0: 'key'})
live_df = pd.DataFrame(pd.Series(liveness)).reset_index().rename(columns={'index': 'song', 0: 'liveness'})
loud_df = pd.DataFrame(pd.Series(loudness)).reset_index().rename(columns={'index': 'song', 0: 'loudness'})
mode_df = pd.DataFrame(pd.Series(mode)).reset_index().rename(columns={'index': 'song', 0: 'mode'})
spee_df = pd.DataFrame(pd.Series(speechiness)).reset_index().rename(columns={'index': 'song', 0: 'speech'})
temp_df = pd.DataFrame(pd.Series(tempo)).reset_index().rename(columns={'index': 'song', 0: 'tempo'})
time_df = pd.DataFrame(pd.Series(time_signature)).reset_index().rename(columns={'index': 'song', 0: 'time'})
vale_df = pd.DataFrame(pd.Series(valence)).reset_index().rename(columns={'index': 'song', 0: 'valence'})

In [19]:
# Merge individual dataframes into one features dataframe
playlist_df = pd.DataFrame(songs_playlist,columns=['playlist','song'])

frame_V1 = [acc_df,dan_df,dur_df,ene_df,inst_df,key_df,live_df,loud_df,mode_df,spee_df,temp_df,time_df,vale_df]
features = pd.concat(frame_V1,axis=1).T.groupby(level=0).first().T

frame_V2 = [features,playlist_df]
features_df = pd.concat(frame_V2,axis=1).T.groupby(level=0).first().T.dropna()

features_df.head()

Unnamed: 0,acousticness,dance,duration,energy,instrumentalness,key,liveness,loudness,mode,playlist,song,speech,tempo,time,valence
0,0.365,0.307,258933,0.481,0.0,3,0.207,-8.442,0,37i9dQZF1DXcBWIGoYBM5M,00kkWwGsR9HblTUHb3BmdX,0.128,68.894,3,0.329
1,0.993,0.322,160897,0.0121,0.927,5,0.127,-31.994,1,37i9dQZF1DXcBWIGoYBM5M,01T3AjynqSMVfiAQCAfrKJ,0.0491,112.464,4,0.118
2,0.994,0.375,58387,0.00406,0.908,7,0.0842,-31.824,0,37i9dQZF1DXcBWIGoYBM5M,02BumRY2OTFMkMxrXSVMat,0.0671,139.682,1,0.358
3,0.992,0.393,288280,0.0429,0.925,9,0.0821,-25.727,0,37i9dQZF1DXcBWIGoYBM5M,02mkkozonPEDCenOhuWwLc,0.0341,135.405,4,0.0394
4,0.992,0.373,99867,0.117,0.909,10,0.111,-25.222,0,37i9dQZF1DXcBWIGoYBM5M,02xmGU9unopKjpblPRC67j,0.0511,125.288,3,0.189


In [20]:
# Save as csv file
features_df.to_csv('track_features(track_indices).csv', sep=',')

### Collect Spotify Artist Information Per Track in Playlist

Following a similar procedure as the audio feature extraction, artist information for every track in every playlist is extracted next.

First, a function is defined to retrieve artist information given an artist name.

In [21]:
# Subsample of data to pull
Spotify_playlists = data.iloc[0:10]

# Collect artist information per track found in Step II
playlist_tracks = dict()

In [22]:
# New function to get artists in playlist
def get_artist(name):
    results = sp.search(q='artist:' + name, type='artist')
    items = results['artists']['items']
    if len(items) > 0:
        return items[0]
    else:
        return None

The playlists are prepped for audio feature extraction.

In [23]:
# Collect tracks per playlist
for playlist in Spotify_playlists["ID"]:
    if Spotify_playlists.loc[Spotify_playlists['ID'] == playlist, 'No. of Tracks'].item() > 0:
        try:
            playlist_tracks[playlist] = get_playlist_tracks('spotify', playlist)
            time.sleep(random.randint(1, 3))
        except:
            pass

In [24]:
# Define an example list of songs for the first 10 playlists
artist_list = []
song_dict = dict()
playlist_dict = dict()

for play_index,playlist in enumerate(playlist_tracks):
    songs = playlist_tracks[playlist]
    for song_index,song in enumerate(songs):
        no_artists = len(song['track']['artists'])
        for number in range(no_artists):
            name = song['track']['artists'][number]['name']
            song_id = song['track']['id']
            artist_list.append((playlist,song_id,name))
            song_dict[name] = song_id
            playlist_dict[name] = playlist

Again, a dictionary in cache memory is setup for the main artist feature extraction loop.

In [25]:
# Create artist feature dictionary and set sleeping time thresholds
artists = list(set([item[2] for item in artist_list]))

artist_info = dict()
limit_artist_small = 10
limit_artist_medium = 200

Artist features are extracted using the code below - note running this code on all playlists takes a significant amount of time (measured in hours).

In [26]:
# Artist feature extraction - saves information in cache
for item,artist in enumerate(artists):
    if artist not in artist_info:
        try:
            artist_info[artist] = get_artist(artist)
        except:
            pass
    
    if item % limit_artist_small == 0:
        time.sleep(random.randint(0, 1))
    
    if item % limit_artist_medium == 0:
        time.sleep(random.randint(0, 1))
        
    out = np.floor(item * 1. / len(artists) * 100)
    sys.stdout.write("\r%d%%" % out)
    sys.stdout.flush()

sys.stdout.write("\r%d%%" % 100)

100%

Once all the artist features are extracted, they are converted into the main artist feature dataframe and saved as a large csv file.

In [27]:
# Convert raw data into dictionaries
followers = dict()
genres = dict()
popularity = dict()

for item,artist in enumerate(artist_info):
    try:
        followers[artist] = artist_info[artist]['followers']['total']
        genres[artist] = artist_info[artist]['genres']
        popularity[artist] = artist_info[artist]['popularity']
    except TypeError:
        pass

In [28]:
# Creation of artist feature dataframes from dictionaries
follow_df = pd.DataFrame(pd.Series(followers)).reset_index().rename(columns={'index': 'artist', 0: 'followers'})
genres_df = pd.DataFrame(pd.Series(genres)).reset_index().rename(columns={'index': 'artist', 0: 'genres'})
popularity_df = pd.DataFrame(pd.Series(popularity)).reset_index().rename(columns={'index': 'artist', 0: 'popularity'})
song_df = pd.DataFrame(pd.Series(song_dict)).reset_index().rename(columns={'index': 'artist', 0: 'song'})
playlist_df = pd.DataFrame(pd.Series(playlist_dict)).reset_index().rename(columns={'index': 'artist', 0: 'playlist'})

In [29]:
# Merge individual dataframes into one features dataframe
frame_V1 = [follow_df,genres_df,popularity_df,song_df, playlist_df]
artist_information = pd.concat(frame_V1,axis=1).T.groupby(level=0).first().T
artist_information.head()

Unnamed: 0,artist,followers,genres,playlist,popularity,song
0,10 Years,157035,"[alternative metal, nu metal, post-grunge, rap...",37i9dQZF1DXcF6B6QPhFDv,63,0uyDAijTR0tOuH24hxDhE5
1,21 Savage,2323273,"[dwn trap, rap, trap music]",37i9dQZF1DX0XUsuxWHRQd,98,2vaMWMPMgsWX4fwJiKmdWm
2,24hrs,28839,"[dwn trap, trap music, underground hip hop]",37i9dQZF1DX0XUsuxWHRQd,73,2c5D6B8oXAwc6easamdgVA
3,3LAU,175224,"[big room, brostep, deep big room, edm, electr...",37i9dQZF1DX4JAvHpjipBk,67,6yxobtnNHKRAA0cvoNxJhe
4,50 Cent,2686486,"[east coast hip hop, gangster rap, hip hop, po...",37i9dQZF1DX0XUsuxWHRQd,85,32aYDW8Qdnv1ur89TUlDnm


In [30]:
# Save as csv file
artist_information.to_csv('artists(track_indices).csv', sep=',')

## Data Wrangling

### Loading Data Frames

Once all data is extracted from Spotify, the next step is to combine the separate dataframes (i.e., for playlists, audio features and artists) and to perform some initial feature engineering in the hope of creating useful data for inference and prediction of playlist success.

The first step is to load all the dataframes separately.

In [48]:
# Load playlist dataframe
playlist_df = pd.read_csv('Playlist.csv')
playlist_df.head()

Unnamed: 0.1,Unnamed: 0,Name,No. of Tracks,ID,URI,HREF,Public,Followers
0,0,Today's Top Hits,50,37i9dQZF1DXcBWIGoYBM5M,spotify:user:spotify:playlist:37i9dQZF1DXcBWIG...,https://api.spotify.com/v1/users/spotify/playl...,True,18079985.0
1,1,RapCaviar,61,37i9dQZF1DX0XUsuxWHRQd,spotify:user:spotify:playlist:37i9dQZF1DX0XUsu...,https://api.spotify.com/v1/users/spotify/playl...,True,8283836.0
2,2,mint,61,37i9dQZF1DX4dyzvuaRJ0n,spotify:user:spotify:playlist:37i9dQZF1DX4dyzv...,https://api.spotify.com/v1/users/spotify/playl...,True,4593498.0
3,3,Are & Be,51,37i9dQZF1DX4SBhb3fqCJd,spotify:user:spotify:playlist:37i9dQZF1DX4SBhb...,https://api.spotify.com/v1/users/spotify/playl...,True,3773823.0
4,4,Rock This,60,37i9dQZF1DXcF6B6QPhFDv,spotify:user:spotify:playlist:37i9dQZF1DXcF6B6...,https://api.spotify.com/v1/users/spotify/playl...,True,3989695.0


In [42]:
# Load track features dataframe
tracks_df = pd.read_csv('tracks_df_sub.csv').drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)
tracks_df.head()

Unnamed: 0,acousticness,dance,duration,energy,instrumentalness,key,liveness,loudness,mode,playlist,song,speech,tempo,time,valence
0,0.0395,0.299,214973,0.921,0.737,4,0.589,-6.254,1,37i9dQZF1DXcBWIGoYBM5M,0076oEQq8IToGfnzU3bTHY,0.193,174.982,4,0.0532
1,0.365,0.307,258933,0.481,0.0,3,0.207,-8.442,0,37i9dQZF1DXcBWIGoYBM5M,00kkWwGsR9HblTUHb3BmdX,0.128,68.894,3,0.329
2,0.0787,0.63,261731,0.656,0.000906,0,0.0953,-6.423,0,37i9dQZF1DXcBWIGoYBM5M,01JkrDSrakX5UO5knhpKNA,0.0276,133.012,4,0.432
3,0.000192,0.521,188834,0.837,0.051,5,0.0929,-4.581,1,37i9dQZF1DXcBWIGoYBM5M,01KsbekyuQQXpVnxIfNRaC,0.122,80.027,4,0.623
4,0.993,0.322,160897,0.0121,0.927,5,0.127,-31.994,1,37i9dQZF1DXcBWIGoYBM5M,01T3AjynqSMVfiAQCAfrKJ,0.0491,112.464,4,0.118


In [38]:
# Load artist information dataframe
artist_df_sub = pd.read_csv('artist_df_sub.csv').drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)
artist_df_sub.head()

Unnamed: 0,artist,followers,genres,playlist,popularity,song
0,*NSYNC,498511.0,"['boy band', 'dance pop', 'europop', 'pop', 'p...",37i9dQZF1DWXDAhqlN7e6W,75.0,35zGjsxI020C2NPKp2fzS7
1,10 Years,154800.0,"['alternative metal', 'nu metal', 'post-grunge...",37i9dQZF1DWWJOmJ7nRx0C,63.0,4qmoz9OUEBaXUzlWQX4ZU4
2,2 Chainz,1926728.0,"['dwn trap', 'pop rap', 'rap', 'southern hip h...",37i9dQZF1DX7QOv5kjbU68,91.0,4XoP1AkbOurU9CeZ2rMEz2
3,21 Savage,2224587.0,"['dwn trap', 'rap', 'trap music']",37i9dQZF1DX7QOv5kjbU68,98.0,4ckuS4Nj4FZ7i3Def3Br8W
4,24hrs,27817.0,"['dwn trap', 'trap music', 'underground hip hop']",37i9dQZF1DX0XUsuxWHRQd,74.0,2c5D6B8oXAwc6easamdgVA


As we can see from the above - artists are grouped by a list of genres by Spotify. Therefore,genres are one-hot encoded in order to make these genre lists predictors that we can run models on.

In [40]:
# One-hot encode genre labels
mlb = MultiLabelBinarizer(sparse_output=True)
pre_data = mlb.fit_transform(artist_df_sub['genres'].str.split(','))
classes = [i.strip('[]') for i in mlb.classes_]
genre_sub = pd.DataFrame(pre_data.toarray(),columns=classes)
_, i = np.unique(genre_sub.columns, return_index=True)
genre_sub = genre_sub.iloc[:, i]

# Drop genre column from artist sub dataframe
artist_df_sub_mid = artist_df_sub.drop('genres', axis=1)

# Concatenate artist sub dataframe and genre dataframe
artist_sub_frames = [artist_df_sub_mid,genre_sub]
artist_df = pd.concat(artist_sub_frames,axis=1,join='inner')

Once all the genres are one-hot encoded, the dataframes are grouped by playlist to enable the following feature engineering.

In [43]:
# Group-by function on artists
group_artists_by_playlist = artist_df.groupby('playlist') 
print("Number of playlists: ", len(group_artists_by_playlist))

# Group-by function on tracks
group_tracks_by_playlist = tracks_df.groupby('playlist')
print("Number of playlists: ", len(group_tracks_by_playlist))

Number of playlists:  1546
Number of playlists:  1465


### Feature Engineering

In terms of artists, feature engineering led to the following predictors:

* Thirty columns represent the names of top 30 artists (in terms of appearing most often in popular playlists). They are categorical variables indicating whether a playlist has a specific artist.
* Five columns represent the number of times top 50 artists (in terms of artist followers in aggregate) appear in the playlists (bucketed in 10 artists each)
* Two columns represent the mean and standard deviation of artists followers per playlist
* Two columns represent the mean and standard deviation of artists popularity per playlist
* Artist genres are one-hot encoded

First, the top 50 artists (in terms of number of Spotify followers) are extracted. Then, we count the amount of times these artists show up in a given playlist and record the counts as predictors in the final dataframe.

In [44]:
top_10_followers = list(artist_df.sort_values('followers',ascending=False)['artist'].unique()[:10])
top_10_20_followers = list(artist_df.sort_values('followers',ascending=False)['artist'].unique()[10:20])
top_20_30_followers = list(artist_df.sort_values('followers',ascending=False)['artist'].unique()[20:30])
top_30_40_followers = list(artist_df.sort_values('followers',ascending=False)['artist'].unique()[30:40])
top_40_50_followers = list(artist_df.sort_values('followers',ascending=False)['artist'].unique()[40:50])

artist_df['top_0_10'] = np.where(artist_df['artist'].isin(top_10_followers), 1, 0)
artist_df['top_10_20'] = np.where(artist_df['artist'].isin(top_10_20_followers), 1, 0)
artist_df['top_20_30'] = np.where(artist_df['artist'].isin(top_20_30_followers), 1, 0)
artist_df['top_30_40'] = np.where(artist_df['artist'].isin(top_30_40_followers), 1, 0)
artist_df['top_40_50'] = np.where(artist_df['artist'].isin(top_40_50_followers), 1, 0)

Second, we obtain the list of 30 artists who appear most often in playlists with 35,000+ followers. By looping over the playlists, the additional predictors are created as below.

In [5]:
popular_artists=['Post Malone', 'JAY Z', 'Lil Wayne', 'Rihanna', '21 Savage',
       'Young Thug', 'A$AP Rocky', 'Galantis', 'Van Morrison',
       'Chance The Rapper', 'Led Zeppelin', 'Otis Redding',
       'Axwell /\\ Ingrosso', 'Wiz Khalifa', 'Yo Gotti', 'Ryan Adams',
       'Miguel', 'Birdy', 'John Mayer', 'Kanye West', 'First Aid Kit',
       'Deorro', 'Ellie Goulding', 'Radiohead', 'Commodores', 'Diddy',
       'SZA', 'Nicki Minaj', 'SYML']

In [45]:
# Artist feature engineering
artist_feature_list=[]

for key, item in group_artists_by_playlist:
    
    #add in top 30 artists
    category_artist_count=[]
    for ele in popular_artists:
        present=False
        for artist in item['artist']:
            if ele==artist:
                present=True
        category_artist_count.append(present*1)
    
    followers_mean=item['followers'].mean()
    followers_std=item['followers'].std()
    
    popularity_mean=item['popularity'].mean()
    popularity_std=item['popularity'].std()
    
    top_10 = item['top_0_10'].sum()
    top_10_20 = item['top_10_20'].sum()
    top_20_30 = item['top_20_30'].sum()
    top_30_40 = item['top_30_40'].sum()
    top_40_50 = item['top_40_50'].sum()
    
    tmp=[key, followers_mean,followers_std,popularity_mean,popularity_std,\
         top_10,top_10_20,top_20_30,top_30_40,top_40_50]
    for i in range(len(popular_artists)):
        tmp.append(category_artist_count[i])
    artist_feature_list.append(tuple(tmp))
    
# Save feature names
artist_feature_names = ['followers_mean','followers_std','popularity_mean','popularity_std',
                       'top_0_10','top_10_20','top_20_30','top_30_40','top_40_50']
for i in range(len(popular_artists)):
        artist_feature_names.append(popular_artists[i])

All the genres in a playlist are encoded to ones in the one-hot encoded genre columns.

In [46]:
# Splitting of genres and enumeration per playlist
genre_list = []

for key, item in group_artists_by_playlist:
    for genre in classes:
        genre_list.append(item[genre].max())

Finally, the main artist data frame is created below:

In [49]:
# Reshape genres into array of proper dimensions
genre_arr = np.array(genre_list).reshape(len(artist_feature_list),len(classes))

# Create genre sub dataframe per playlist
artist_genres_df = pd.DataFrame(genre_arr)
artist_genres_df.columns = classes

#dataframe for artist grouped by playlist
artist_features_df = pd.DataFrame(artist_feature_list).set_index(0)
artist_features_df.columns = artist_feature_names

# column for number of followers
artist_features_df['Playlist_Followers'] = playlist_df[['Followers']].groupby(playlist_df['ID']).first()
artist_features_df['ID']=artist_features_df.index

artist_main_df = artist_features_df.reset_index().drop(0, axis=1)
artist_main_df.head()

Unnamed: 0,followers_mean,followers_std,popularity_mean,popularity_std,top_0_10,top_10_20,top_20_30,top_30_40,top_40_50,Playlist_Followers,ID
0,134413.666667,365459.0,42.833333,19.575645,0,0,0,0,0,24.0,01WIu4Rst0xeZnTunWxUL7
1,103320.580645,332015.0,48.903226,15.029648,0,0,0,0,0,330.0,05dTMGk8MjnpQg3bKuoXcc
2,566814.56,1427308.0,60.28,15.512146,0,0,0,1,0,73.0,070FVPBKvfu6M5tf4I9rt2
3,199831.484848,295385.9,58.69697,15.62747,0,0,0,0,0,6173.0,08vPKM3pmoyF6crB2EtASQ
4,223253.774194,491843.8,49.516129,19.489948,0,0,0,0,0,145.0,08ySLuUm0jMf7lJmFwqRMu


In [50]:
# Concatenate grouped artist sub dataframe and genre dataframe
artist_sub_groups = [artist_main_df,artist_genres_df]
artist_df_groups = pd.concat(artist_sub_groups,axis=1,join='inner')
artist_df_groups = artist_df_groups.rename(columns={'': "'no_genre'"})

Similar to the artist feature engineering, the playlists' audio features are engineered next. Specifically, for each audio feature (such as acousticness, duraition, energy) mined from Spotify, the mean and standard deviation across all playlist tracks is computed.

In [51]:
# Feature Engineering for track df: save to feature_list 
feature_list = []

for key, item in group_tracks_by_playlist:

    acousticness_mean =item['acousticness'].mean()
    acousticness_std = item['acousticness'].std()
    
    dance_mean =item['dance'].mean()
    dance_std = item['dance'].std()
    
    duration_mean =item['dance'].mean()
    duration_std = item['dance'].std()
    
    energy_mean =item['energy'].mean()
    energy_std = item['energy'].std()
    
    instrumentalness_mean =item['instrumentalness'].mean()
    instrumentalness_std = item['instrumentalness'].std()
    
    key_mean =item['energy'].mean()
    key_std = item['energy'].std()
    
    liveness_mean =item['liveness'].mean()
    liveness_std = item['liveness'].std()
    
    loudness_mean =item['loudness'].mean()
    loudness_std = item['loudness'].std()
    
    mode_mean =item['mode'].mean()
    mode_std = item['mode'].std()
    
    speech_mean =item['speech'].mean()
    speech_std = item['speech'].std()
    
    tempo_mean =item['tempo'].mean()
    tempo_std = item['tempo'].std()
    
    time_mean =item['time'].mean()
    time_std = item['time'].std()
    
    valence_mean =item['valence'].mean()
    valence_std = item['valence'].std()
        
    feature_list.append((key, acousticness_mean, acousticness_std, dance_mean, dance_std, energy_mean, energy_std, 
                        instrumentalness_mean, instrumentalness_std, key_mean, key_std, liveness_mean, liveness_std,
                        loudness_mean, loudness_std, mode_mean, mode_std, speech_mean, speech_std, tempo_mean, tempo_std,
                        time_mean, time_std, valence_mean, valence_std))
# Save feature names
feature_names =  ['acousticness_mean','acousticness_std','dance_mean', 'dance_std', 'energy_mean', 'energy_std', 
                        'instrumentalness_mean', 'instrumentalness_std', 'key_mean', 'key_std', 'liveness_mean', 
                        'liveness_std','loudness_mean', 'loudness_std', 'mode_mean', 'mode_std', 'speech_mean', 
                        'speech_std','tempo_mean', 'tempo_std','time_mean', 'time_std', 'valence_mean', 'valence_std',
                  ]

The engineered audio features are converted into a dataframe as follows:

In [52]:
features_df = pd.DataFrame(feature_list).set_index(0)
features_df.columns = feature_names

# Column for number of followers
features_df['Followers'] = playlist_df[['Followers']].groupby(playlist_df['ID']).first()
features_df['ID'] = features_df.index

features_main_df = features_df.reset_index().drop(0, axis=1)
features_main_df.head()

Unnamed: 0,acousticness_mean,acousticness_std,dance_mean,dance_std,energy_mean,energy_std,instrumentalness_mean,instrumentalness_std,key_mean,key_std,...,speech_mean,speech_std,tempo_mean,tempo_std,time_mean,time_std,valence_mean,valence_std,Followers,ID
0,0.641282,0.326942,0.467911,0.241057,0.27594,0.225821,0.11965,0.277109,0.27594,0.225821,...,0.383051,0.403365,101.045969,51.857504,3.338462,1.553996,0.319263,0.246235,24.0,01WIu4Rst0xeZnTunWxUL7
1,0.249844,0.321182,0.55514,0.172088,0.666567,0.230578,0.077776,0.240452,0.666567,0.230578,...,0.13726,0.226812,130.850167,30.525135,4.0,0.454859,0.496127,0.256787,6198.0,056jpfChuMP5D1NMMaDXRR
2,0.278816,0.262749,0.634392,0.14027,0.596,0.166902,0.192559,0.34146,0.596,0.166902,...,0.08221,0.131105,122.768255,28.215783,4.0,0.2,0.656235,0.245299,330.0,05dTMGk8MjnpQg3bKuoXcc
3,0.22881,0.251421,0.6004,0.178801,0.6122,0.192433,0.179571,0.336604,0.6122,0.192433,...,0.05215,0.025935,114.439167,21.997673,4.0,0.262613,0.481787,0.251199,73.0,070FVPBKvfu6M5tf4I9rt2
4,0.394114,0.362573,0.599424,0.151256,0.541097,0.289705,0.203059,0.332371,0.541097,0.289705,...,0.106724,0.112448,110.134788,25.125111,4.0,0.353553,0.511997,0.243171,6173.0,08vPKM3pmoyF6crB2EtASQ


Finally, the last step is to create the main dataframe using an inner merge on both the audio feature dataframe and artist dataframe. This inner merge leads to a loss of 126 playlists in total (i.e., there was no overlap between the two dataframes across these playlists).

In [53]:
# Concatenate the two dataframes
master_df = pd.merge(features_main_df, artist_df_groups, how='inner', on='ID')
master_df.head()

Unnamed: 0,acousticness_mean,acousticness_std,dance_mean,dance_std,energy_mean,energy_std,instrumentalness_mean,instrumentalness_std,key_mean,key_std,...,'wrestling','wrock','ye ye','yoik','zapstep','zeuhl','zim','zolo','zydeco','no_genre'
0,0.641282,0.326942,0.467911,0.241057,0.27594,0.225821,0.11965,0.277109,0.27594,0.225821,...,0,0,0,0,0,0,0,0,0,1
1,0.278816,0.262749,0.634392,0.14027,0.596,0.166902,0.192559,0.34146,0.596,0.166902,...,0,0,0,0,0,0,0,0,0,1
2,0.22881,0.251421,0.6004,0.178801,0.6122,0.192433,0.179571,0.336604,0.6122,0.192433,...,0,0,0,0,0,0,0,0,0,1
3,0.394114,0.362573,0.599424,0.151256,0.541097,0.289705,0.203059,0.332371,0.541097,0.289705,...,0,0,0,0,0,0,0,0,0,1
4,0.194509,0.27847,0.531067,0.150001,0.7594,0.249805,0.115499,0.25802,0.7594,0.249805,...,0,0,0,0,0,0,0,0,0,1


The master dataframe is saved for both EDA and modelling purposes next and final dataframe size is presented.

In [54]:
master_df.to_csv('spotify_data_master.csv', sep=',')

In [56]:
print("Number of Playlists: {}".format(master_df.shape[0]))
print("Number of Predictors: {}".format(master_df.shape[1]))

Number of Playlists: 1420
Number of Predictors: 3245


### String Parsing / Natural Language Processing

Here, we further analyze the names of the playlist based on the rationale that listeners usually search for key terms like 'Best', 'Hit', 'Workout' when they look for certain type of playlists. Due to the small size of our data, we adopt the string parsing approach for our model (which could be easily scaled with Python's NLTK package in larger models) as we do not the number of predictors to exceed the dimensions of our model. 

- After reading in the full dataset and the playlist dataset, we perform a left join based on playlist ID and add the playlist name to the full dataset
- We search for 12 categories of specific strings that cover 'Best', 'Workout', 'Party', 'Chill', 'Acoustic', '2000s', '1990s', '1980s', '1970s', '1960s', and '1950s' using the str.contain function
- After creating these 12 boolean variables, we transform them to binary ones (0 or 1) by multiplying 1
- Lastly, we include those binary variables in the dataframe as predictor variables

In [5]:
# Read-in the full data set
full_df = pd.read_csv('data/spotify_data_master_V3.csv')

# Drop the first index column as it is a duplicate
full_df = full_df.drop("Unnamed: 0", axis=1)

# Filter non-zero genre columns only
full_df = full_df.loc[:, (full_df != 0).any(axis=0)]

In [6]:
# Read in playlist df for merging
playlist_df = pd.read_csv('data/Playlist.csv')

# Drop the first index column as it is a duplicate
playlist_df = playlist_df.drop("Unnamed: 0", axis=1)

In [7]:
# Left Join by Playlist ID
new_df = pd.merge(full_df, playlist_df[['Name', 'ID']], on='ID', how='left')
new_df.shape

(1420, 1494)

In [8]:
# Make list of duplicate columns to drop
duplicate_columns = []
for i in full_df.columns:
    if i[-1] == '1': 
        duplicate_columns.append(i)

In [9]:
# Drop columns not to be used in analysis
full_df_concise = new_df.drop(duplicate_columns, inplace=False, axis=1)
full_df_concise = full_df_concise.drop(['Playlist_Followers','ID'], inplace=False, axis=1)

In [10]:
# Search For Sub Strings
Str_Best = full_df_concise.Name.str.contains('Best|Top|Hit|best|top|hit|Hot|hot|Pick|pick')
Str_Workout = full_df_concise.Name.str.contains('Workout|workout|Motivation|motivation|Power|power|Cardio|')
Str_Party = full_df_concise.Name.str.contains('Party|party')
Str_Chill = full_df_concise.Name.str.contains('Chill|chill|Relax|relax')
Str_Acoustic = full_df_concise.Name.str.contains('Acoustic|acoustic')
Str_2000s = full_df_concise.Name.str.contains('20')
Str_1990s = full_df_concise.Name.str.contains('90|91|92|93|94|95|96|97|98|99')
Str_1980s = full_df_concise.Name.str.contains('80|81|82|83|84|85|86|87|88|89')
Str_1970s = full_df_concise.Name.str.contains('70|71|72|73|74|75|76|77|78|79')
Str_1960s = full_df_concise.Name.str.contains('60|61|62|63|64|65|66|67|68|69')
Str_1950s = full_df_concise.Name.str.contains('50s')

# Convert Boolean into Integers
Str_Best = Str_Best*1
Str_Workout = Str_Workout*1 
Str_Party = Str_Party*1
Str_Chill = Str_Chill*1
Str_Acoustic = Str_Acoustic*1
Str_2000s = Str_2000s*1
Str_1990s = Str_1990s*1
Str_1980s = Str_1980s*1
Str_1970s = Str_1970s*1
Str_1960s = Str_1960s*1
Str_1950s = Str_1950s*1

# Add to Dataframe
full_df_concise['Str_Best'] = Str_Best
full_df_concise['Str_Workout'] = Str_Workout
full_df_concise['Str_Party'] = Str_Party
full_df_concise['Str_Chill'] = Str_Chill
full_df_concise['Str_Acoustic'] = Str_Acoustic
full_df_concise['Str_2000s'] = Str_2000s
full_df_concise['Str_1990s'] = Str_1990s
full_df_concise['Str_1980s'] = Str_1980s
full_df_concise['Str_1970s'] = Str_1970s
full_df_concise['Str_1960s'] = Str_1960s
full_df_concise['Str_1950s'] = Str_1950s

In [11]:
# Check New Column
full_df_concise.columns[-11:-1]

Index(['Str_Best', 'Str_Workout', 'Str_Party', 'Str_Chill', 'Str_Acoustic',
       'Str_2000s', 'Str_1990s', 'Str_1980s', 'Str_1970s', 'Str_1960s'],
      dtype='object')

## Interaction Terms with Audio Features and Genre

The following section describes the process of creating interaction terms between genres and audio features. Interaction terms are considered because genre may have an effect on the relationships between audio features and the number of playlist followers. For example, different levels of energy may be more popular for rap music than for acoustic music.

The first step is to bucket the genres (with a total of more than 100 specific genres) into broader categories. As listed below, some of the most common broad genres includ: house, hip hop, pop, dance, r&b, acoustic, and soul. 

In [12]:
broad_genres = ['house','hip hop','pop','dance','r&b','rap','acoustic','soul']

broad_genres = pd.DataFrame(np.zeros((full_df_concise.shape[0], len(broad_genres))), columns = broad_genres)

In [13]:
for genre in broad_genres:  
    for data_col in full_df_concise.columns:
        if genre in data_col:
            indices = full_df_concise[(full_df_concise[data_col]==1)].index
            broad_genres[genre][indices] = 1

Next, interaction terms are generated between genre categories and certain audio features. Below are the interaction terms that are created. These features are selected through a separate analysis in which all of the genres, audio features, and all possible interactions are used as predictors to model the number of playlist followers. We find that the interaction terms listed below are significant. 

In [15]:
# Adding significant interaction terms from previous model
interaction_columns = ['house_acousticness_mean','hip hop_acousticness_std','pop_liveness_std','dance_liveness_std',
                      'r&b_acousticness_std','rap_energy_std','rap_key_std','acoustic_acousticness_std','acoustic_acousticness_mean',
                      'acoustic_energy_std','acoustic_key_std','soul_acousticness_std']


full_df_concise['house_acousticness_mean'] = broad_genres['house']*full_df_concise['acousticness_mean']
full_df_concise['hip hop_acousticness_std'] = broad_genres['hip hop']*full_df_concise['acousticness_std']
full_df_concise['pop_liveness_std'] = broad_genres['pop']*full_df_concise['liveness_std']
full_df_concise['dance_liveness_std'] = broad_genres['dance']*full_df_concise['liveness_std']
full_df_concise['r&b_acousticness_std'] = broad_genres['r&b']*full_df_concise['acousticness_std']
full_df_concise['rap_energy_std'] = broad_genres['rap']*full_df_concise['energy_std']
full_df_concise['rap_key_std'] = broad_genres['rap']*full_df_concise['key_std']
full_df_concise['acoustic_acousticness_std'] = broad_genres['acoustic']*full_df_concise['acousticness_std']
full_df_concise['acoustic_acousticness_mean'] = broad_genres['acoustic']*full_df_concise['acousticness_mean']
full_df_concise['acoustic_energy_std'] = broad_genres['acoustic']*full_df_concise['energy_std']
full_df_concise['acoustic_key_std'] = broad_genres['acoustic']*full_df_concise['key_std']
full_df_concise['soul_acousticness_std'] = broad_genres['soul']*full_df_concise['acousticness_std']

In [16]:
full_df_concise[interaction_columns].describe()

Unnamed: 0,house_acousticness_mean,hip hop_acousticness_std,pop_liveness_std,dance_liveness_std,r&b_acousticness_std,rap_energy_std,rap_key_std,acoustic_acousticness_std,acoustic_acousticness_mean,acoustic_energy_std,acoustic_key_std,soul_acousticness_std
count,1420.0,1418.0,1418.0,1418.0,1418.0,1418.0,1418.0,1418.0,1420.0,1418.0,1418.0,1418.0
mean,0.224109,0.235339,0.156279,0.137165,0.239961,0.210305,0.210305,0.102606,0.115892,0.080324,0.080324,0.17331
std,0.21228,0.144852,0.056181,0.073726,0.143718,0.094412,0.094412,0.150939,0.190786,0.117964,0.117964,0.162756
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.129984,0.111896,0.160846,0.20446,0.20446,0.0,0.0,0.0,0.0,0.0
50%,0.221718,0.302918,0.155873,0.149616,0.306497,0.238572,0.238572,0.0,0.0,0.0,0.0,0.240984
75%,0.366849,0.341949,0.185451,0.180391,0.344083,0.267752,0.267752,0.285148,0.22857,0.220703,0.220703,0.332228
max,0.961,0.428986,0.351859,0.351859,0.444861,0.371096,0.371096,0.444861,0.961,0.347747,0.347747,0.420705


By now, the final dataframe has been created. We will leverage this dataframe and its features to conduct EDA and to construct models in the following sections. 