# Spotify Song Popularity Prediction

- Sayan Biswas (biswas.say@northeastern.edu)


# Data and Analysis Plan

## Data Extraction & Cleaning

We make use of Spotify APIs to collect data.  Because our aim is to predict a song's popularity, we seek to collect features which are correlated with popularity from various API endpoints. 

We note that popularity varies over time.  To avoid having to model these time dyanmics of popularity we limit our analysis to include only songs which have been released in the past two weeks.  By doing so, we mitigate most of the effect the age a song has on its popularity.  

todo: 2 paragraphs below
The `tag:new` parameter, passed to an album query, allows us to limit responses to albums which are at most two weeks old.  We derive track information by further querying the API which tracks are associated with each 'new' album.

Also, in order to combat the chance of having an imbalanced dataset(most of the song might end up being popular) we decided to use the "tag:hipster" that gives albums that are unpopular and released in the last 2 weeks. This ensured that we have a good mix of songs across all range of popularity.

### Pipeline

1. Get all albums released in last 2 weeks
    - `get_new_album()`
1. For each album, build a dataframe of tracks included
    - `get_tracks_per_album()`
1. We'll aggregate all sets of tracks (from each) album into a `df_track`
1. For each track, we'll append the song features (e.g. danceability, loudness)
    - `get_track_features()`
1. For each track, we'll also append it's popularity score
    - `get_track_popularity()`
1. We'll then merge the song features and popularity together

## Access Token Management
One can use Spotify's API to manually produce an access token used in class.  This access token expires in an hour and it can be cumbersome to manually navigate the web interface.  We provide a `get_access_token()` function which programmatically gets the access token to mitigate the issue.  

### Note:
todo: This section would be outstanding level work... above and beyond

In [1]:
import requests
import pandas as pd

def get_access_token(client_id, client_secret):
    """ gets a fresh access token (good for an hour) 
    
    NOTE: you're welcome to "steal" this function in your own spotify
    API calls but we should give a shoutout to its authors:
    
    Written by:
        Sayan Biswas (biswas.say@northeastern.edu)

    https://developer.spotify.com/documentation/general/guides/authorization-guide/
    
    Args:
        client_id (str): id associated with a spotify app
        client_secret (str): secret associated with a spotify app
    """
    # query spotify API for an access token
    auth_url = 'https://accounts.spotify.com/api/token'
    
    data = {'grant_type': 'client_credentials',
            'client_id': client_id,
            'client_secret': client_secret}
    
    auth_response = requests.post(auth_url, data)
    
    # extract access token from response
    access_token = auth_response.json()['access_token']
    
    return access_token

In [2]:
# after building your own app you can get cliend_id, client_secret
# via https://developer.spotify.com/dashboard/applications
client_id = 'f5e3a712c7bc4f38a71d35e5bb327875'
client_secret = '614a7ba5f2de44c995689b230075f045'

get_access_token(client_id=client_id, client_secret=client_secret)

'BQAlOZR6L9VFyKwmXTsNiAZYy1TDVqIngLhkyQKExsX1K-T_0HdHNDcDWAgeJYx9bz1LAULGUYuJ2cEtHZg'

### NOTE: 
You're welcome to "steal" this function in your own spotify API calls but its approrpiate to show some appreciation for its authors in the docstring:

    Sayan Biswas (biswas.say@northeastern.edu)
    
(Appreciation aside ... failure to include this is plagiarism)

## Getting all albums of past two weeks

In [5]:
def search_album(query, limit, offset, access_token, market='US'):
    """ searches for an album

    see link below for further doc
    https://developer.spotify.com/documentation/web-api/reference/#endpoint-search

    Args:
        query (str): query string
        market (str): An ISO 3166-1 alpha-2 country code or the string from_token
        limit (int): Maximum number of results to return.
        offset (int): The index of the first result to return
        access_token (str): access token

    Returns:
        df_album (pd.DataFrame): one row per album
    """
    # build url of query
    search_url = 'https://api.spotify.com/v1/search/' 
    endpoint = f'?q={query}&type=album&market={market}&limit={limit}&offset={offset}'
    url = search_url + endpoint

    # query API
    headers = {'Authorization': f'Bearer {access_token}'}
    response = requests.get(url, headers=headers)

    response = response.json()
    
    return pd.DataFrame(response['albums']['items'])

In [6]:
access_token = get_access_token(client_id=client_id, client_secret=client_secret)
search_album('tag:new', limit=1, offset=0, access_token=access_token)


Unnamed: 0,album_type,artists,external_urls,href,id,images,name,release_date,release_date_precision,total_tracks,type,uri
0,single,[{'external_urls': {'spotify': 'https://open.s...,{'spotify': 'https://open.spotify.com/album/5L...,https://api.spotify.com/v1/albums/5LuoozUhs2pl...,5LuoozUhs2pl3glZeAJl89,"[{'height': 640, 'url': 'https://i.scdn.co/ima...",Scary Hours 2,2021-03-05,day,3,album,spotify:album:5LuoozUhs2pl3glZeAJl89


In [7]:
def get_new_album(query, limit,start_offset, end_offset, market='US'):
    """ searches for all albums utilizing the function "search_album" created above


    Args:
        query (str): query string
        limit (int): Maximum number of results to return
        start_offset (int): The index of the first result to return in the first set of albums returned
        end_offset (int): The index of the first result to return in the last set of albums returned
        market (str): An ISO 3166-1 alpha-2 country code or the string from_token

    Returns:
        df_album (pd.DataFrame): one row per album
    """
    
    # creating an empty dataframe that will be used to append all the function call to "search_album"
    df_album = pd.DataFrame()
    
    #offset will help us get the subsequent set of albums from the endpoint
    
    for offset in range(start_offset,end_offset,limit):
    
        # refreshing the access_token to ensure it doesn't expire
        access_token = get_access_token(client_id=client_id, client_secret=client_secret)
        df = search_album(query, limit, offset, access_token)
        df_album = df_album.append(df, ignore_index=True)
    
    return(df_album)
        
    

In [8]:
df_album_new = get_new_album('tag:new', limit=50, start_offset=0, end_offset=1000)
df_album_hipster = get_new_album('tag:hipster', limit=50, start_offset=0, end_offset=1000)

## Getting all tracks in the albums released in past 2 weeks

In [9]:
def get_tracks_per_album(album_id,limit,offset,market,access_token):
    """ finds all tracks in an album

    see link below for further doc
    https://developer.spotify.com/console/get-album-tracks/

    Args:
        album_id (str): Find all tracks for this album_id
        limit (int): Maximum number of results to return
        offset (int): The index of the first result to return
        market (str): An ISO 3166-1 alpha-2 country code or the string from_token
        access_token (str): access token
        

    Returns:
        df_track (pd.DataFrame): one row per track
    """
    
    # build url of query
    track_url = 'https://api.spotify.com/v1/albums/' 
    params = f'{album_id}'
    endpoint = f'/tracks/?market={market}&limit={limit}&offset={offset}'
    url = track_url + params + endpoint

    # query API
    
    headers = {'Authorization': f'Bearer {access_token}'}
    response = requests.get(url, headers=headers)


    response = response.json()
    return pd.DataFrame(response['items'])
    

In [10]:
get_tracks_per_album(df_album_new["id"][0], limit=50, offset=0, market='US', access_token=access_token)



Unnamed: 0,artists,disc_number,duration_ms,explicit,external_urls,href,id,is_local,is_playable,name,preview_url,track_number,type,uri
0,[{'external_urls': {'spotify': 'https://open.s...,1,178153,True,{'spotify': 'https://open.spotify.com/track/3a...,https://api.spotify.com/v1/tracks/3aQem4jVGdht...,3aQem4jVGdhtg116TmJnHz,False,True,What’s Next,,1,track,spotify:track:3aQem4jVGdhtg116TmJnHz
1,[{'external_urls': {'spotify': 'https://open.s...,1,192956,True,{'spotify': 'https://open.spotify.com/track/65...,https://api.spotify.com/v1/tracks/65OVbaJR5O1R...,65OVbaJR5O1RmwOQx0875b,False,True,Wants and Needs (feat. Lil Baby),,2,track,spotify:track:65OVbaJR5O1RmwOQx0875b
2,[{'external_urls': {'spotify': 'https://open.s...,1,383036,True,{'spotify': 'https://open.spotify.com/track/4F...,https://api.spotify.com/v1/tracks/4FRW5Nza1Ym9...,4FRW5Nza1Ym91BGV4nFWXI,False,True,Lemon Pepper Freestyle (feat. Rick Ross),,3,track,spotify:track:4FRW5Nza1Ym91BGV4nFWXI


In [11]:
def get_all_tracks(album_ids,limit,offset,market,access_token):
    
    df_track = pd.DataFrame()
    for album_id in album_ids:   
        df = get_tracks_per_album(album_id,limit,offset,market,access_token)
        
        df_track = df_track.append(df, ignore_index=True)

    
    return(df_track)
        

In [12]:
df_track_new = get_all_tracks(df_album_new["id"], limit=50, offset=0, market='US', access_token=access_token)
df_track_hipster = get_all_tracks(df_album_hipster["id"], limit=50, offset=0, market='US', access_token=access_token)


## Getting all audio features associated with the track

In [13]:
def get_track_features(track_ids,access_token,batch_size):
    
    """ finds all features in an album

    see link below for further doc
    https://developer.spotify.com/console/get-audio-features-several-tracks/

    Args:
        track_ids (Series): Find all audio-features for these tracks
        access_token (str): access token
        batch_size (int) : The number of ids to query for in one API call
        

    Returns:
        df_audio (pd.DataFrame): one row of audio features per track
    """
    
    df_audio = pd.DataFrame()
    
    #converting series object to list
    track_list = track_ids.to_list()
    
    # dividing the list of track_ids in to batches for optimized API calls
    
    for i in range(0, len(track_list), batch_size):
        ids = track_list[i:i+batch_size]
        ids = ','.join(ids)

        # build url of query
        
        audio_features_url = 'https://api.spotify.com/v1/audio-features'
        endpoint = f'?ids={ids}'
        url = audio_features_url + endpoint
        
        # query API
    
        headers = {'Authorization': f'Bearer {access_token}'}
        response = requests.get(url, headers=headers)
        response = response.json()
        
        #few tracks do not have audio-features and hence removing them
        response['audio_features'][:] = (value for value in response['audio_features'] if value!=None)
        df = pd.DataFrame(response['audio_features'])
        
        df_audio = df_audio.append(df, ignore_index=True)
    
    return(df_audio)
    

In [14]:
access_token = get_access_token(client_id=client_id, client_secret=client_secret)
df_audio_new = get_track_features(df_track_new["id"],access_token=access_token,batch_size=100)
df_audio_hipster = get_track_features(df_track_hipster["id"],access_token=access_token,batch_size=100)

## Getting popularity of all the tracks

In [15]:
def get_track_popularity(track_ids,batch_size,access_token,market='US'):
    
    """ finds all features in an album

    see link below for further doc
    https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-several-tracks
    
    Args:
        track_ids (Series): Find popularity for these tracks
        access_token (str): access token
        batch_size (int) : The number of ids to query for in one API call
        

    Returns:
        df_popularity (pd.DataFrame): one row of track details per track (including popularity)
    """
    
    
    df_popularity = pd.DataFrame()

    for i in range(0, len(track_ids), batch_size):
        ids = track_ids[i:i+batch_size]
        ids = ','.join(ids)

    # build url of query

        track_url = 'https://api.spotify.com/v1/tracks'
        endpoint = f'?ids={ids}&market={market}'
        url = track_url  + endpoint

    # query API

        headers = {'Authorization': f'Bearer {access_token}'}
        response = requests.get(url, headers=headers)
        response = response.json()
        
        df = pd.DataFrame(response['tracks'])
        df_popularity = df_popularity.append(df, ignore_index=True)

    return(df_popularity)
    

In [16]:
df_popularity_new = get_track_popularity(df_track_new["id"],batch_size=50,access_token=access_token)
df_popularity_hipster = get_track_popularity(df_track_hipster["id"],batch_size=50,access_token=access_token)

## merging the audio-features data & popularity data

In [84]:
df_new = df_audio_new.merge(df_popularity_new, on='id',how='inner')
df_hipster = df_audio_hipster.merge(df_popularity_hipster, on='id',how='inner')

In [85]:
# concatenating both the dataframes together to get one final dataframe
df_final = pd.concat([df_new,df_hipster])
df_final.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,external_urls,href,is_local,is_playable,name,popularity,preview_url,track_number,type_y,uri_y
0,0.781,0.594,0,-6.959,0,0.0485,0.0136,0.0,0.162,0.0628,...,{'spotify': 'https://open.spotify.com/track/3a...,https://api.spotify.com/v1/tracks/3aQem4jVGdht...,False,True,What’s Next,85,,1,track,spotify:track:3aQem4jVGdhtg116TmJnHz
1,0.578,0.449,1,-6.349,1,0.286,0.0618,2e-06,0.119,0.1,...,{'spotify': 'https://open.spotify.com/track/65...,https://api.spotify.com/v1/tracks/65OVbaJR5O1R...,False,True,Wants and Needs (feat. Lil Baby),84,,2,track,spotify:track:65OVbaJR5O1RmwOQx0875b
2,0.77,0.637,1,-5.53,1,0.345,0.103,0.0,0.171,0.431,...,{'spotify': 'https://open.spotify.com/track/4F...,https://api.spotify.com/v1/tracks/4FRW5Nza1Ym9...,False,True,Lemon Pepper Freestyle (feat. Rick Ross),82,,3,track,spotify:track:4FRW5Nza1Ym91BGV4nFWXI
3,0.777,0.58,0,-6.928,0,0.0525,0.0125,0.0,0.161,0.0636,...,{'spotify': 'https://open.spotify.com/track/3m...,https://api.spotify.com/v1/tracks/3mDFLytDotXo...,False,True,What's Next,67,,1,track,spotify:track:3mDFLytDotXo2p0rvfGbkA
4,0.588,0.412,7,-7.397,0,0.329,0.0574,8e-06,0.114,0.121,...,{'spotify': 'https://open.spotify.com/track/6Z...,https://api.spotify.com/v1/tracks/6ZoZ4KGIDD23...,False,True,Wants and Needs (feat. Lil Baby),65,,2,track,spotify:track:6ZoZ4KGIDD23DohdVk0Ybw


## Data Cleaning

In [86]:
def data_clean(df):
    """ cleans the data

    Args:
    df (pd.DataFrame) : the dataframe to be    

    Returns:
    df_cleaned (pd.DataFrame): cleaned dataframe
    """

    #ensuring there are no duplicate rows in the data
    df.drop_duplicates
    
    
    #removing columns we won't be working with
    df.drop(['type_x','uri_x', 'track_href', 'analysis_url', 'duration_ms_x',
    'time_signature','disc_number', 'duration_ms_y',
    'explicit', 'external_ids', 'external_urls', 'href', 'is_local',
    'is_playable','preview_url','type_y', 'uri_y','album','artists','track_number'],axis=1, inplace=True)

    return(df)

    

In [87]:
df_final = data_clean(df_final)
df_final.head()


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,name,popularity
0,0.781,0.594,0,-6.959,0,0.0485,0.0136,0.0,0.162,0.0628,129.895,3aQem4jVGdhtg116TmJnHz,What’s Next,85
1,0.578,0.449,1,-6.349,1,0.286,0.0618,2e-06,0.119,0.1,136.006,65OVbaJR5O1RmwOQx0875b,Wants and Needs (feat. Lil Baby),84
2,0.77,0.637,1,-5.53,1,0.345,0.103,0.0,0.171,0.431,94.966,4FRW5Nza1Ym91BGV4nFWXI,Lemon Pepper Freestyle (feat. Rick Ross),82
3,0.777,0.58,0,-6.928,0,0.0525,0.0125,0.0,0.161,0.0636,129.918,3mDFLytDotXo2p0rvfGbkA,What's Next,67
4,0.588,0.412,7,-7.397,0,0.329,0.0574,8e-06,0.114,0.121,136.068,6ZoZ4KGIDD23DohdVk0Ybw,Wants and Needs (feat. Lil Baby),65


In [88]:
# saving the dataframes into csv file so that we don't have to get data from Spotify API each time we want to use it
df_new.to_csv("df_new.csv")
df_hipster.to_csv("df_hipster.csv")
df_final.to_csv("df_final.csv")