In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import os
import pandas as pd

We opted to use Spotify's API in order to get the music. The keys used here are in our environmental variables.

In [2]:
cid = os.getenv('SPOTIFY_CLIENT_ID')
secret = os.getenv('SPOTIFY_SECRET')
ccm = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=ccm)

Since we need playlists of music representative of each genre, we decided to exclusively use Spotify's hand-picked playlists, indicated by the Spotify logo at the top left of the playlist icon.

However, when noticing that not every song has a preview mp3 available in the JSON file. This would mean that for any given playlist, we would only have about 20-30 songs whose previews we can analyze.

To circumvent this, we sampled broadly from playlists within the genres, removing duplicates and ensuring we had at least 100 mp3s per genre.

The names of the playlists chosen are as follows,

* Rock: "90's Rock Anthems", "80's Rock Anthems", "00's Rock Anthems", "Rock Classics"
* Classical: "Classical Essentials", "Classical Reading", "Classical Focus"
* Hip Hop: "I Love My WestCoast Classics", "I Love My 90's Hip Hop", "Get Turnt"
* Punk: "Classic Punk", "Punk Essentials", "New Punk Tracks"
* Jazz: "Jazz Classics", "Jazz Classics Blue Note Edition", "Late Night Jazz", "Smooth Jazz", "Jazz X-Press"

In [3]:
# Rock
rock_new_1 = '37i9dQZF1DX1rVvRgjX59F' #90's Rock Anthems
rock_new_2 = '37i9dQZF1DX1spT6G94GFC' #80's Rock Anthems
rock_new_3 = '37i9dQZF1DX3oM43CtKnRV' #00's Rock Anthems
rock_new_4 = '37i9dQZF1DWXRqgorJj26U' #Rock Classics

#Classical
classical_new_1 = '37i9dQZF1DWWEJlAGA9gs0' #Classical Essentials
classical_new_2 = '37i9dQZF1DWYkztttC1w38' #Classical Reading
classical_new_3 = '37i9dQZF1DXd5zUwdn6lPb' #Classical Focus

#Hip Hop
hiphop_new_1 = '37i9dQZF1DX9sQDbOMReFI' #I Love My West Coast Classics
hiphop_new_2 = '37i9dQZF1DX186v583rmzp' #I Love My '90s Hip Hop
hiphop_new_3 = '37i9dQZF1DWY4xHQp97fN6' #Get Turnt

#Punk
punk_new_1 = '37i9dQZF1DX3LDIBRoaCDQ' #Classic Punk
punk_new_2 = '37i9dQZF1DXd6tJtr4qeot' #Punk Essentials
punk_new_3 = '37i9dQZF1DX0KpeLFwA3tO' #New Punk Tracks

#Jazz
jazz_new_1 = '37i9dQZF1DXbITWG1ZJKYt' #Jazz Classics
jazz_new_2 = '37i9dQZF1DWTR4ZOXTfd9K' #Jazz Classics Blue Note Edition
jazz_new_3 = '37i9dQZF1DX4wta20PHgwo' #Late Night Jazz
jazz_new_4 = '37i9dQZF1DXdwTUxmGKrdN' #Smooth Jazz
jazz_new_5 = '37i9dQZF1DX85XJl1mZAlp' #Jazz X-Press

As for what features we'd get at this time, from each playlist we selected the features that gave some kind of description about the qualities of the music, such as duration and whether or not it was explicit. 

Also, we used the Spotify API's audio features object for each track to extract their additional features, such as key, energy, and time signature. 

Additionally, we took from each song the url of a 30 second preview, which will be the subject of further analysis later on.

Each of these were then put into a pandas dataframe.

In [4]:
def features(playlist_id):
    tracklist = sp.playlist(playlist_id)['tracks']['items']
    tracks_features = []
    for i in tracklist:
        t = i['track']
        track_id = t['id']
        duration_ms = t['duration_ms']
        explicit = t['explicit']
        popularity = t['popularity']
        preview = t['preview_url']
        af = sp.audio_features(track_id)[0]
        track_features = [track_id, duration_ms, explicit, popularity, af['danceability'],
                         af['energy'], af['key'], af['loudness'], af['mode'], af['speechiness'],
                         af['acousticness'], af['instrumentalness'], af['liveness'], 
                         af['valence'], af['tempo'], af['time_signature'], preview]
        tracks_features.append(track_features)
    df =  pd.DataFrame(data=tracks_features, columns=['ID', 'Duration', 'Explicit', 'Popularity',
                                                      'Danceability', 'Energy', 'Key', 'Loudness',
                                                      'Mode', 'Speechiness', 'Acousticness', 
                                                      'Instrumentalness', 'Liveness', 'Valence',
                                                      'Tempo', 'Time Signature', 'Preview'])
    return df

Using this function, we got a dataframe of features from each playlist.

In [5]:
#Rock
rock_df_1 = features(rock_new_1)
rock_df_2 = features(rock_new_2)
rock_df_3 = features(rock_new_3)
rock_df_4 = features(rock_new_4)

#Classical
classical_df_1 = features(classical_new_1)
classical_df_2 = features(classical_new_2)
classical_df_3 = features(classical_new_3)

#Hip Hop
hiphop_df_1 = features(hiphop_new_1)
hiphop_df_2 = features(hiphop_new_2)
hiphop_df_3 = features(hiphop_new_3)

#Punk
punk_df_1 = features(punk_new_1)
punk_df_2 = features(punk_new_2)
punk_df_3 = features(punk_new_3)

#Jazz
jazz_df_1 = features(jazz_new_1)
jazz_df_2 = features(jazz_new_2)
jazz_df_3 = features(jazz_new_3)
jazz_df_4 = features(jazz_new_4)
jazz_df_5 = features(jazz_new_5)

Once these smaller dataframes were acquired, we then merged them, being sure to remove any duplicate songs which may have been present across multiple playlists

In [6]:
rock_df = pd.concat([rock_df_1, rock_df_2, rock_df_3, rock_df_4], ignore_index=True).drop_duplicates()
classical_df = pd.concat([classical_df_1, classical_df_2, classical_df_3], ignore_index=True).drop_duplicates()
hiphop_df = pd.concat([hiphop_df_1, hiphop_df_2, hiphop_df_3], ignore_index=True).drop_duplicates()
punk_df = pd.concat([punk_df_1, punk_df_2, punk_df_3], ignore_index=True).drop_duplicates()
jazz_df = pd.concat([jazz_df_1, jazz_df_2, jazz_df_3, jazz_df_4, jazz_df_5], ignore_index=True).drop_duplicates()

Since we only care about tracks which have previews we can analyze, we then limited the datasets to these observations only

In [7]:
rock_df = rock_df[~rock_df['Preview'].isnull()]
classical_df = classical_df[~classical_df['Preview'].isnull()]
hiphop_df = hiphop_df[~hiphop_df['Preview'].isnull()]
punk_df = punk_df[~punk_df['Preview'].isnull()]
jazz_df = jazz_df[~jazz_df['Preview'].isnull()]

In [8]:
rock_df.head()

Unnamed: 0,ID,Duration,Explicit,Popularity,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Time Signature,Preview
1,62nQ8UZVqR2RMvkJHkcO2o,318226,False,72,0.285,0.846,2,-6.472,1,0.0438,0.0404,0.0,0.182,0.287,108.808,4,https://p.scdn.co/mp3-preview/c5742cc09643dc07...
2,59WN2psjkt1tyaxjspN8fp,313573,True,78,0.466,0.833,7,-4.215,1,0.304,0.0266,0.0,0.0327,0.661,88.785,4,https://p.scdn.co/mp3-preview/af0c42e6dacc0b8b...
3,3d9DChrdc6BOeFsbrZ3Is0,264306,False,81,0.559,0.345,4,-13.496,1,0.0459,0.0576,0.000105,0.141,0.458,84.581,4,https://p.scdn.co/mp3-preview/90e41778392f27b6...
6,5jafMI8FLibnjkYTZ33m0c,257480,False,73,0.418,0.383,4,-11.782,1,0.0257,0.0718,0.0177,0.0896,0.352,87.773,4,https://p.scdn.co/mp3-preview/7cc3982631523940...
9,4PtZE0h5oyPhCtPjg3NeYQ,255573,False,64,0.527,0.838,3,-6.013,1,0.0323,0.0206,0.000662,0.07,0.721,117.454,4,https://p.scdn.co/mp3-preview/b1dd1977653f3668...


We were then finally left with over 100 observations from each genre to analyze.

In [9]:
print('Rock: ' + str(len(rock_df)))
print('Classical: ' + str(len(classical_df)))
print('Hip Hop: ' + str(len(hiphop_df)))
print('Punk: ' + str(len(punk_df)))
print('Jazz: ' + str(len(jazz_df)))

Rock: 130
Classical: 190
Hip Hop: 142
Punk: 147
Jazz: 141


With these dataframes created, we exported them to CSV files to be analyzed and cleaned

In [10]:
rock_df.to_csv('data/rock_df.csv')
classical_df.to_csv('data/classical_df.csv')
hiphop_df.to_csv('data/hiphop_df.csv')
punk_df.to_csv('data/punk_df.csv')
jazz_df.to_csv('data/jazz_df.csv')