In [2]:
#Import Packages: Pandas for Data Wrangling, 
import pandas as pd
import spotipy as spy
import spotipy.util as util

The first step we will take is to acquire the necessary data for our project. We are seeking to obtain data that includes both musical attributes as well as lyrics of individual songs, and since there do not appear to be any readily available datasets that include both of these as well as genre labels, we will instead scrape together a dataset from API calls.

Upon some research, I found that Spotify, which is one of the leading music streaming services, has an extensive open-source API to access the data of all of the songs in their catalog. The primary Spotify API is accessed through it's web based interface, but there is also a Python wrapper developed aptly named 'SpotiPy' that enables the ability to query song data directly through a Python IDE such as Jupyter notebook.

As such, in this section, we will use this API to access the necessary data and store it in a Pandas dataframe in order to build upon and continue analyzing. Below you can see that we are storing the ClientID and Client Secret, codes generated on Spotify's Data API website after registering and creating an applicatin. These are necessary in order to gain authorization to use the SpotiPy python library to directly access the music data it stores.

In [2]:
client_id = "ef7d245737124f1692a1b2a5f9a4ac81"
client_secret = "76f0f293190b4f29ac46ba285c30880a"

After generating and storing these two codes in variables, we can proceed to use these codes to generate an access token after authentication and then instatiate the Spotify data search function (sp).

In [3]:
token = spy.oauth2.SpotifyClientCredentials(client_id="ef7d245737124f1692a1b2a5f9a4ac81", client_secret="76f0f293190b4f29ac46ba285c30880a")
cache_token = token.get_access_token()
sp = spy.Spotify(cache_token)

  


After some additional exploration into how the data is stored and accessed on Spotify's end, it appears that there isn't a way to query genre labels on the back end for individuals songs or even artist level genre. However surprisingly enough, on the front end Spotify application and web interface, you can search for music by genre from the search bar using a specific syntax (ex. type "genre:”Rock” into Spotify's front end search). You can also query song data by means of any public playlist, whether user created or created by Spotify itself. 

As such, I went ahead and built playlists for 5 different genres: Rock, Pop, Hip-Hop, Country, and EDM. There are of course several more genres and subgenres than the ones chosen, but these five were chosen on the basis that they are distinct enough musically and lyrically to the point where we can extract and analyze the features that define them using our planned analysis. They also represent umbrellas where other genres could potentially fit under (ie. Alternative as a subgenre of rock, Christian as a subgenre of Rock or Country, Indie as a subgenre of Pop or Rock, something that can perhaps be identified using an unsupervised approached and lyrical topic analysis.

These playlists were then generated using the aforementioned front end search. To try to limit artists labeled with mixed genres on Spotify (ie. Pop/Rock, Country/Rock, Pop/Dance, Country/Hip-Hop, which has become increasingly common in today's music where there seems to be an increased effort to blend genres to maximize appeal and create a more unique sound, I narrowed the front end genre search syntax to find artists only in the specified genre and not any of the other 4. An example of this is "genre:”Rock” OR genre:”Alternative” NOT genre:”Pop” NOT genre:”Hip-Hop” NOT genre:”Rap” NOT genre:”Country”. 

Using these queries, I was able to generate playlists on Spotify containing 2000 songs for each genre that Spotify had labeled the artist or album under. This was expected to be a large enough dataset given that there is an equal distribution of songs across genres with a total size of 10,000. With the playlists prepared on the front end on my Spotify account where the playlist setting was set to public, we can access the playlist data using the unique playlist IDs. Below I've found custom functions that extract the data from the API call and take the corresponding output, which is in the form of a complex JSON file, and extract the relevant musical attributes and identifying information. 

In [4]:
def get_playlist_tracks(username, playlist_id):
    results = sp.user_playlist_tracks(username, playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    return tracks

In [5]:
def analyze_playlist(creator, playlist_id):
    
    # Create empty dataframe
    playlist_features_list = ["artist","album","track_name", "track_id", "playlist_name", "danceability", "energy","key","loudness","mode", "speechiness","instrumentalness","liveness","valence","tempo", "duration_ms","time_signature"]
    
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Loop through every track in the playlist, extract features and append the features to the playlist dataframe
    
    playlist = get_playlist_tracks(creator, playlist_id)
    for track in playlist:
        # Create empty dict
        playlist_features = {}
        # Get metadata
        playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
        playlist_features["album"] = track["track"]["album"]["name"]
        playlist_features["track_name"] = track["track"]["name"]
        playlist_features["track_id"] = track["track"]["id"]
        
        # Get audio features
        audio_features = sp.audio_features(playlist_features["track_id"])[0]
        for feature in playlist_features_list[5:]:
            playlist_features[feature] = audio_features[feature]
        
        # Concat the dfs
        track_df = pd.DataFrame(playlist_features, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True, sort=False)
        
    # Add genre, the playlist title, to playlist dataframe
    
    playlist_df["playlist_name"] = next(iter(sp.user_playlist(user=creator, playlist_id=playlist_id, fields="name").values()))
        
    return playlist_df

Using these custom functions, we can extract the data we need into a clean Pandas dataframe where we can proceed to use it for exploratory data analysis and predictive potential. Below I've generated dataframes of each individual genre and proceeded to combine them into one. This was done individually at first in order to keep the API call running smoothly with only 2000 songs pulled at a time, which took a few minutes each to run.

In [6]:
#Extract track data for each genre playlist
rock_df = analyze_playlist(creator="Varun Raja", playlist_id="2bhOyiIYRserqMmA4jRBrR")
edm_df = analyze_playlist(creator="Varun Raja", playlist_id="7q4rZG8iAh2FfYVrhxl10y")
pop_df = analyze_playlist(creator="Varun Raja", playlist_id="0ezO8YZ1LyC9nZE0YYRfd7")
rap_df = analyze_playlist(creator="Varun Raja", playlist_id="6ozqmD6b88DFVdDUWRpzVc")
country_df = analyze_playlist(creator="Varun Raja", playlist_id="46rjJgToFor8OWYmQ083O4")

After successfully extracting the data for each of the 5 genres from the Spotify API into their respective dataframes, we can now combine them into a single dataframe by creating a list of each of the dataframe names and concatenating then. Below you can see this along with the preview of our final dataframe.

In [7]:
genre_dfs = [rock_df, edm_df, pop_df, rap_df, country_df]
spotify_df = pd.concat(genre_dfs)
spotify_df

Unnamed: 0,artist,album,track_name,track_id,playlist_name,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Red Hot Chili Peppers,Stadium Arcadium,Dani California,10Nmj3JCNoMeBQ87uw5j8k,Rock,0.556,0.913,0,-2.360,1,0.0437,8.59e-06,0.3460,0.730,96.184,282160,4
1,Red Hot Chili Peppers,Californication (Deluxe Edition),Californication,48UPSzbZjgc449aqz8bxox,Rock,0.592,0.767,9,-2.788,0,0.0270,0.00165,0.1270,0.328,96.483,329733,4
2,Foo Fighters,"Echoes, Silence, Patience & Grace",The Pretender,7x8dCjCr0x6x2lXKujYD34,Rock,0.433,0.959,9,-4.040,1,0.0431,0,0.0280,0.365,172.984,269373,4
3,Nirvana,Nevermind (Remastered),Smells Like Teen Spirit,5ghIJDpPoe3CfHMGu71E6T,Rock,0.502,0.912,1,-4.556,1,0.0564,0.000173,0.1060,0.720,116.761,301920,4
4,Red Hot Chili Peppers,Californication (Deluxe Edition),Scar Tissue,1G391cbiT3v3Cywg8T7DM1,Rock,0.595,0.717,0,-4.803,1,0.0295,0.00274,0.1080,0.547,88.969,215907,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,Jason Aldean,Relentless,Do You Wish It Was Me,0NffUi6VS3WFN66ZKB7sAN,Country,0.587,0.610,7,-6.069,1,0.0261,3.49e-06,0.0724,0.658,142.195,264507,4
1996,Jason Boland & The Stragglers,Comal County Blue,Comal County Blue,2aYLDRT1RyTgYgDo1b0Kzt,Country,0.596,0.447,4,-11.055,1,0.0267,0.00264,0.1030,0.538,133.931,241973,4
1997,William Clark Green,Hebert Island,This Is Us,6IbmwZ92FXQeuqh7ASV92j,Country,0.512,0.563,9,-7.132,1,0.0276,0,0.1340,0.358,84.858,215467,4
1998,Midland,On The Rocks,At Least You Cried,6y0mAMhVwWpbqQyWeBWDbp,Country,0.536,0.847,7,-5.051,1,0.0403,0.00632,0.3180,0.698,123.989,158000,4


In [11]:
print(spotify_df.shape) 
print(spotify_df.describe())
print(spotify_df.columns)

(10000, 20)
       danceability        energy      loudness   speechiness  \
count  10000.000000  10000.000000  10000.000000  10000.000000   
mean       0.606127      0.681767     -6.407974      0.100645   
std        0.151107      0.193031      2.745441      0.108305   
min        0.061700      0.016700    -26.967000      0.022600   
25%        0.506000      0.551000     -7.762250      0.035300   
50%        0.605000      0.705000     -5.962500      0.051000   
75%        0.713000      0.837000     -4.573750      0.111000   
max        0.981000      0.999000      0.878000      0.944000   

       instrumentalness      liveness       valence         tempo  \
count      10000.000000  10000.000000  10000.000000  10000.000000   
mean           0.064241      0.194992      0.451779    123.758013   
std            0.190840      0.156917      0.226044     29.042635   
min            0.000000      0.014100      0.025800     43.509000   
25%            0.000000      0.097600      0.272000     9

As shown above in the dataframe preview and shape and summary descriptions, the dataframe contains a total of 10,000 records representing songs across a number of artists and albums with 2000 songs labeled in each genre. The columns of data include the basic song identification data in track name, artist, album, and playlist name (which is the genre) in addition to various musical attributes that represent quantitative measures involving various musical components of a song. 

These include metrics such as danceability, energy, and loudness that are measured on scale as described in the Spotify API website (link). The other metrics include tempo measured in terms of the musical standard of beats per minute as well as duration in miliseconds and time signature in beats per measure. Finally it includes key and mode in terms of a numerical scale of Pitch Notation which defines each of the 12 keys and a binary mode with 0 representing minor and 1 representing major.

Since the last group of metrics is stored in a format which isn't easily interpretable, we will go ahead and do some data cleaning and transforming in order to get it in a format that can be more understandable. First we will use the 
standard pitch notation the keys are presented in and map them to their respective letter key by creating a dictionary of meaning (ie. 0 = C, 2 = D). We will likewise do the same to the mode field by translating the binary notation into 'Minor' and 'Major'. From there we can translate the values in the field to the format of letter and major or minor (ex. key=0, mode=1 -> C major). From my knowledge of music theory, these keys are representative of the tonic and scales a song is built around. 

In [38]:
key_dict = {0:"C", 1:"C#", 2:"D", 3:"D#", 4:"E", 5:"F", 6:"F#", 7:"G", 8:"G#", 9:"A", 10:"Bb", 11:"B"}
mode_dict = {1:"Major", 0:"Minor"}

In [42]:
spotify_df["key"] = spotify_df["key"].map(key_dict)
spotify_df["mode"] = spotify_df["mode"].map(mode_dict)

In [47]:
spotify_df["full_key"] = spotify_df["key"] + " " + spotify_df["mode"]

With the key and mode corrected, the last transformation we could make that would be useful would be to convert the millisecond song 'duration' field into standard time notation of minutes and seconds. Since the minutes/seconds notation would fail to be continuous when charted, we instead opted to simply conver milliseconds to minutes by dividing by 60000 (the number of milliseconds in a minute). This way we can see the duration in terms of how many minutes long it is, which is how music players generally communicate song length. 

In [62]:
%%capture
spotify_df["duration_minutes"] = spotify_df["duration_ms"] / 60000

In [64]:
spotify_df = spotify_df[['track_name', 'artist', 'album', 'track_id', 'playlist_name',
       'danceability', 'energy', 'full_key', 'key','mode', 'loudness', 'speechiness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_minutes', 'duration_ms',
       'time_signature']]

In [55]:
def convert_to_datetime(millis):
    millis = int(millis)
    seconds=(millis/1000)%60
    seconds = int(seconds)
    minutes=(millis/(1000*60))%60
    minutes = int(minutes)
    hours=(millis/(1000*60*60))%24
    return ("%d:%d:%d" % (hours, minutes, seconds))

After completing the minor data transformations, we can review the updated dataframe and see that it looks just about complete. However, since we are also looking to analyze a songs lyrics and use it to predict genre alongside musical attributes, we will need to find a way to pull in lyrics to this dataframe. As such, we will save the existing dataframe to a CSV file and load it in into our next notebook where we will work through using Genius to retrieve the lyrics to as many songs in our existent dataframe as possible. 

In [65]:
spotify_df

Unnamed: 0,track_name,artist,album,track_id,playlist_name,danceability,energy,full_key,key,mode,loudness,speechiness,instrumentalness,liveness,valence,tempo,duration_minutes,duration_ms,time_signature
0,Dani California,Red Hot Chili Peppers,Stadium Arcadium,10Nmj3JCNoMeBQ87uw5j8k,Rock,0.556,0.913,C Major,C,Major,-2.360,0.0437,0.000009,0.3460,0.730,96.184,4.702667,282160,4
1,Californication,Red Hot Chili Peppers,Californication (Deluxe Edition),48UPSzbZjgc449aqz8bxox,Rock,0.592,0.767,A Minor,A,Minor,-2.788,0.0270,0.001650,0.1270,0.328,96.483,5.495550,329733,4
2,The Pretender,Foo Fighters,"Echoes, Silence, Patience & Grace",7x8dCjCr0x6x2lXKujYD34,Rock,0.433,0.959,A Major,A,Major,-4.040,0.0431,0.000000,0.0280,0.365,172.984,4.489550,269373,4
3,Smells Like Teen Spirit,Nirvana,Nevermind (Remastered),5ghIJDpPoe3CfHMGu71E6T,Rock,0.502,0.912,C# Major,C#,Major,-4.556,0.0564,0.000173,0.1060,0.720,116.761,5.032000,301920,4
4,Scar Tissue,Red Hot Chili Peppers,Californication (Deluxe Edition),1G391cbiT3v3Cywg8T7DM1,Rock,0.595,0.717,C Major,C,Major,-4.803,0.0295,0.002740,0.1080,0.547,88.969,3.598450,215907,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Do You Wish It Was Me,Jason Aldean,Relentless,0NffUi6VS3WFN66ZKB7sAN,Country,0.587,0.610,G Major,G,Major,-6.069,0.0261,0.000003,0.0724,0.658,142.195,4.408450,264507,4
9996,Comal County Blue,Jason Boland & The Stragglers,Comal County Blue,2aYLDRT1RyTgYgDo1b0Kzt,Country,0.596,0.447,E Major,E,Major,-11.055,0.0267,0.002640,0.1030,0.538,133.931,4.032883,241973,4
9997,This Is Us,William Clark Green,Hebert Island,6IbmwZ92FXQeuqh7ASV92j,Country,0.512,0.563,A Major,A,Major,-7.132,0.0276,0.000000,0.1340,0.358,84.858,3.591117,215467,4
9998,At Least You Cried,Midland,On The Rocks,6y0mAMhVwWpbqQyWeBWDbp,Country,0.536,0.847,G Major,G,Major,-5.051,0.0403,0.006320,0.3180,0.698,123.989,2.633333,158000,4


In [10]:
#Read dataframe into CSV file to use in the next portion
spotify_df.to_csv("spotify_genre_fixtrack_df.csv", index=False)