# The Rise and Fall of Different Music Genres
#### By: Will Slotterback, Erik Ryde, and Todd Roberts
In deciding what to work on for this project we first thought about interesting datasets we could use. One dataset which came to mind was the Billboard Hot 100. Billboard has been tracking the top 100 most popular songs in the U.S. every week since the late 50's so they've amassed a massive dataset of musical tastes in the U.S. This data already provides a really interesting picture of changing preferences in the U.S. but on its own its hard to tease out broad musical trends. To do this we need a way to categorize the different songs. The obvious choice is to do it by genre, but Billboard doesn't have data on the genre of song, just its popularity. To get this data we instead need to turn to the Spotify API where we can correlate the genre of an artist with songs that artist has produced. Though this isn't a perfect method -- artists do often cross genres and produce new types of songs -- it is accurate enough to produce valuable results. Using this data we can then plot the rise and fall of different musical genres in American history and understand questions like: What is the most popular genre of all time? Spoiler: It's not Ska.

### Libraries Used
* [Pandas](https://pandas.pydata.org/)
* [StatsModels](https://www.statsmodels.org/stable/index.html)
* [Bokeh](https://bokeh.pydata.org/en/latest/)

## Gathering our Data
### Songs and Artists
The first step in collecting our data is scraping the Billboard Hot 100 to gather the list of most popular songs. Billboard used to provide an API for this, but has since deprecated it, so we chose instead to use the [billboard-top-100](https://github.com/darthbatman/billboard-top-100) library. Billboard chart urls follow the predictable pattern of `billboard.com/charts/*chart_name*/*date*` so this library takes a chart name, and a date, and scrapes the HTML for that chart in order to produce a list of the songs in that chart. We wrote a utility function around this  that allowed us to fetch all of the charts starting from a specific date and write them to a file. This function (and a few other utility functions) can be seen in our [source](https://github.com/slotterbackW/music-genres/blob/master/billboard-api/billboard.js).

Running this function with the date: *August 9th, 1958* or the earliest date Billboard charts are available produced a very large dataset, as you can see below.

In [1]:
with open('./data/songs.txt') as songs_file:
    list_of_songs = songs_file.readlines()
    print(len(list_of_songs))

314501


Wow that's a lot of music! For reference the average song is about three minutes long, so if you listened to all the songs in our dataset one after another it would take you 655 days, or over a year and a half.

Our dataset is not complete, however. We want to be able to look at broad trends in the data we have, but right now the individual songs are too granular. We need to categorize these songs by their genre. As mentioned above, to do this we'll use the Spotify API to fetch the genre for a song's artist and then correlate that with the song. Unfortunately our data isn't the cleanest right now, so before we fetch genre's we need to clean it up a little bit.

The first problem with our data is that the artist names we have don't always represent one artist. For example the 1960's classic *Stuck On You* lists "Elvis Presley With The Jordanaires" as the artist. As is obvious to human reader this is actually two artists, *Elvis Presley* and *The Jordanaires*, but the computer doesn't know that. Our method for solving this is to split artist names on a series of delimeters, and say that a song is officially by the first artist in that list.

The other issue with our data is that there are lots of duplicates. Because especially successful songs will last on the Hot 100 chart for many weeks the same song could show up multiple times. We don't want to waste resources fetching the same artist name multiple times, so making sure the names of artists we have are unique is an important concern of ours.

Ok let's get into some code. The first thing we'll do is grab the raw list of artists.

In [2]:
raw_artists = []
with open('./data/songs.txt') as songs_file:
    # Our data is separated by bar characters and artists are in column 2 (0-based)
    raw_artists = [line.split('|')[2] for line in songs_file]

Then we'll polish the raw artist data so that it only contains the primary artist we care about. To do this we'll use a method we wrote which splits a string based on a list of delimiters. [[source](https://github.com/slotterbackW/music-genres/blob/master/analysis_helpers.py#L20)]

As mentioned above, we also want to remove duplicates from this data so to do that we'll store the names of our artists in a dictionary.

In [3]:
DELIMITERS = ['and', 'with', 'featuring', 'ft.', 'ft', '&', 'X']
# .split() only takes one argument, so we wrote our own "multi_split" function
# which splits a string based on a list of delimiters
from analysis_helpers import multi_split, sample_dict

# Now we'll use the delimiters to split artist names and grab the first one
# We also want to remove duplicates so we'll use a dictionary to store the results
artists = {}
for raw_artist in raw_artists[1:]:
    # We don't care about the value so we just use 0
    artists[multi_split(raw_artist, DELIMITERS)[0]] = 0

### Genres
Now that we have a collection of unique artist names the next step is to fetch the genres for those artists from the Spotify API. This is so that we can correlate the genre of an artist with their songs and eventually do our analysis. Our methodology for this is to use Spotify's search API to search for the name of the artist and then say that the first result returned is the artist we're looking for. We'll then grab the genres listed for that artist and write the artist's name and genres to a file. The result of this process (which you can see in our [source](https://github.com/slotterbackW/music-genres/blob/master/spotify.py#L25)) is shown below.

In [4]:
artist_genres = {}
with open('./data/artists.txt') as artists_file:
    header = artists_file.readline()
    for line in artists_file:
        split_line = line.strip().split('|')
        artist_genres[split_line[0]] = eval(split_line[2])

sample_dict(artist_genres)

{'Ricky Nelson': ['adult standards',
  'brill building pop',
  'bubblegum pop',
  'christmas',
  'doo-wop',
  'folk rock',
  'lounge',
  'merseybeat',
  'nashville sound',
  'rhythm and blues',
  'rock-and-roll',
  'rockabilly'],
 'Domenico Modugno': ['classic italian pop', 'italian pop'],
 'Bobby Darin': ['adult standards',
  'brill building pop',
  'christmas',
  'easy listening',
  'lounge',
  'rock-and-roll',
  'rockabilly',
  'soul',
  'swing',
  'vocal jazz'],
 'Kalin Twins': [],
 'Jack Scott': ['brill building pop',
  'deep adult standards',
  'doo-wop',
  'rock-and-roll',
  'rockabilly'],
 'Elvis Presley': ['christmas', 'rock-and-roll', 'rockabilly'],
 'Duane Eddy': ['adult standards',
  'brill building pop',
  'christmas',
  'doo-wop',
  'rhythm and blues',
  'rock-and-roll',
  'rockabilly',
  'surf music'],
 'Jimmy Clanton': ['brill building pop',
  'doo-wop',
  'rhythm and blues',
  'swamp pop'],
 'The Coasters': ['adult standards',
  'brill building pop',
  'christmas',
  '

### Full Dataset
Now that we have aggregated all of the genres associated with our Billboard Hot 100 artists, we can create one file representing our full dataset.This is done in our [source](https://github.com/slotterbackW/music-genres/blob/master/songs_cleaning.py) and rewritten to a new songs file, the results of which can be seen below.

In [5]:
with open('./data/full_songs.txt') as cleaned_songs:
    songs_header = cleaned_songs.readline().strip().split('|')
    songs_data = [song.strip().split('|') for song in cleaned_songs.readlines()]
songs_data[:10]

[['1958-08-09', 'Poor Little Fool', 'Ricky Nelson', '1', 'Rock'],
 ['1958-08-09',
  'Nel Blu Dipinto Di Blu (Volaré)',
  'Domenico Modugno',
  '2',
  'Pop'],
 ['1958-08-09', 'Splish Splash', 'Bobby Darin', '4', 'Pop Standards'],
 ['1958-08-09', 'My True Love', 'Jack Scott', '6', 'Rock'],
 ['1958-08-09', 'Hard Headed Woman', 'Elvis Presley', '7', 'Rock'],
 ['1958-08-09', "Rebel-'rouser", 'Duane Eddy', '8', 'Rock'],
 ['1958-08-09', 'Just A Dream', 'Jimmy Clanton', '9', 'Pop'],
 ['1958-08-09', 'Yakety Yak', 'The Coasters', '11', 'Jazz & Blues'],
 ['1958-08-09', 'If Dreams Came True', 'Pat Boone', '12', 'Pop Standards'],
 ['1958-08-09', 'Fever', 'Peggy Lee', '13', 'Pop Standards']]

## Analysis
<br>
<div style="width: 500px;">[Kowalski, Analysis!](images/analysis.jpg)</div>

Let's start by creating a dictionary which maps years to a dictionary of songs and list of the song's rankings. We'll say that a song ranked at #1 gets a score of 100, #2 gets a 99 and so on. This will help us visualize the most popular songs of each year.


In [6]:
# Helper function to get the year from a date string
def get_year(date):
    return int(date.split('-')[0])

# Initialize current year variable to be the first year in the dataset
current_year = get_year(songs_data[0][0])
years_to_songs = {}
years_to_songs[current_year] = {}

for row in songs_data:
    row_year = get_year(row[0])
    song_name = row[1]
    if row_year != current_year:
        current_year = row_year
        years_to_songs[current_year] = {}
    if song_name in years_to_songs[current_year]:
        years_to_songs[current_year][song_name].append(int(row[3]))
    else:
        years_to_songs[current_year][song_name] = [int(row[3])]

# Now let's see the (shortened) results
{key:sample_dict(value) for key, value in sample_dict(years_to_songs).items()}

{1958: {'Poor Little Fool': [1, 4, 6, 5, 6, 13, 19, 36, 47, 71],
  'Nel Blu Dipinto Di Blu (Volaré)': [2,
   1,
   2,
   1,
   1,
   1,
   1,
   2,
   4,
   6,
   12,
   14,
   18,
   57,
   88],
  'Splish Splash': [4, 10, 16, 18, 45, 58, 72],
  'My True Love': [6, 3, 5, 7, 7, 12, 13, 19, 24, 31, 63, 62, 94],
  'Hard Headed Woman': [7, 13, 21, 26, 56, 78, 94],
  "Rebel-'rouser": [8, 8, 11, 13, 19, 41, 52],
  'Just A Dream': [9, 5, 4, 4, 4, 5, 6, 9, 12, 14, 19, 29, 59, 86],
  'Yakety Yak': [11, 18, 35, 37, 74, 90],
  'If Dreams Came True': [12, 15, 18, 23, 24, 55, 82, 81, 82],
  'Fever': [13, 9, 8, 12, 12, 22, 33, 40, 38, 56, 83]},
 1959: {'Smoke Gets In Your Eyes': [2, 2, 1, 1, 1, 4, 4, 12, 16, 25, 40, 45],
  'Problems': [4, 9, 11, 21, 32, 58, 77],
  'One Night': [5, 8, 13, 17, 28, 35, 61, 79, 92],
  'My Happiness': [6, 3, 2, 2, 6, 6, 6, 11, 17, 21, 37, 50, 85, 94],
  'Tom Dooley': [7, 11, 18, 24, 33, 53, 86],
  "A Lover's Question": [8, 7, 6, 7, 9, 11, 17, 25, 32, 40, 53, 77],
  'Gott

Great, this will help us figure out what the most popular song of any given year is. As we mentioned before, the popularity of a song will be assessed as `101-RANK` meaning that the top song will earn 100 points for that week. The most popular song of the year will therefore be the song with the highest total adjusted popularity.

In [7]:
# Finds the most popular song of each year
# input: dictionary mapping song names to a list of their popularity scores
# output: dictionary mapping year to a tuple of top song name and its total score
def song_of_the_year(song_dict):
    max_score = 0
    max_song_name = ''
    for song, scores in song_dict.items():
        adjusted_scores = [101 - score for score in scores]
        total_song_score = sum(adjusted_scores)
        if total_song_score > max_score:
            max_score = total_song_score
            max_song_name = song
    return max_song_name, max_score

# Now let's use a dictionary comprehension and our new function
top_songs = {int(year):song_of_the_year(song_dict) for year, song_dict in years_to_songs.items()}

# Here are the results
top_songs

{1958: ("It's All In The Game", 1712),
 1959: ('The Battle Of New Orleans', 1752),
 1960: ('The Twist', 2330),
 1961: ('Moon River', 1714),
 1962: ('Limbo Rock', 1766),
 1963: ('Days Of Wine And Roses', 1685),
 1964: ('Hello, Dolly!', 1840),
 1965: ('The "In" Crowd', 1787),
 1966: ('Born Free', 1258),
 1967: ('Mercy, Mercy, Mercy', 1822),
 1968: ('Little Green Apples', 1767),
 1969: ('Sugar, Sugar', 1797),
 1970: ('Get Ready', 1489),
 1971: ("You've Got A Friend", 1898),
 1972: ('The First Time Ever I Saw Your Face', 1530),
 1973: ('Why Me', 2142),
 1974: ('One Hell Of A Woman', 1599),
 1975: ('Rhinestone Cowboy', 1774),
 1976: ('A Fifth Of Beethoven', 2069),
 1977: ('I Just Want To Be Your Everything', 2361),
 1978: ('Hot Child In The City', 1964),
 1979: ('Sad Eyes', 1895),
 1980: ('Call Me', 1932),
 1981: ("Jessie's Girl", 2307),
 1982: ("Don't You Want Me", 2030),
 1983: ('Flashdance...What A Feeling', 2025),
 1984: ("What's Love Got To Do With It", 1929),
 1985: ('Take On Me', 160

Looking at the total popularity score of each top song, we might be interested in plotting those over time to see if there is a trend. To do this plotting, we're going to use a library called [Bokeh](https://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#). We chose this library becuause it makes creating prettier, more interactive plots easier than many of its alternatives.

In [8]:
# Bokeh accepts data in lists or Pandas DataFrames, so we'll give ourselves the optin
# which will come in handy for some statistical analysis
import pandas as pd

# To create a DataFrame from our dictionary while keeping the years
# in a usable format, we need to convert to a list
top_songs_df = [[k, v[0], v[1]] for k,v in top_songs.items()]
top_songs_df = pd.DataFrame(top_songs_df, columns=['Year', 'SongTitle', 'PopularityScore'])

# This data looks nice in a table!
top_songs_df.head(10)

Unnamed: 0,Year,SongTitle,PopularityScore
0,1958,It's All In The Game,1712
1,1959,The Battle Of New Orleans,1752
2,1960,The Twist,2330
3,1961,Moon River,1714
4,1962,Limbo Rock,1766
5,1963,Days Of Wine And Roses,1685
6,1964,"Hello, Dolly!",1840
7,1965,"The ""In"" Crowd",1787
8,1966,Born Free,1258
9,1967,"Mercy, Mercy, Mercy",1822


We also want to see if there's a relationship between the year and popularity score of the most popular song of the year. Maybe with the advent of the internet, the most popular stayed at the top of the charts for longer than ever before. Let's do some statistical analysis.

In [9]:
import statsmodels.formula.api as sm

# Let's check out our linear regression model results
top_model = sm.ols(formula='PopularityScore ~ Year', data=top_songs_df).fit()
print(top_model.summary())

# ...and add them to our DataFrame so we can plot the linear regression line too
top_songs_df['RegLine'] = [(year * top_model.params[1]) + top_model.params[0] for year in range(1958,2019,1)]
top_songs_df.head(10)

                            OLS Regression Results                            
Dep. Variable:        PopularityScore   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.746
Method:                 Least Squares   F-statistic:                     177.0
Date:                Wed, 28 Nov 2018   Prob (F-statistic):           2.07e-19
Time:                        14:13:24   Log-Likelihood:                -463.62
No. Observations:                  61   AIC:                             931.2
Df Residuals:                      59   BIC:                             935.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -9.189e+04   7109.627    -12.925      0.0

Unnamed: 0,Year,SongTitle,PopularityScore,RegLine
0,1958,It's All In The Game,1712,1256.785828
1,1959,The Battle Of New Orleans,1752,1304.35854
2,1960,The Twist,2330,1351.931253
3,1961,Moon River,1714,1399.503966
4,1962,Limbo Rock,1766,1447.076679
5,1963,Days Of Wine And Roses,1685,1494.649392
6,1964,"Hello, Dolly!",1840,1542.222105
7,1965,"The ""In"" Crowd",1787,1589.794818
8,1966,Born Free,1258,1637.36753
9,1967,"Mercy, Mercy, Mercy",1822,1684.940243


Now that we have everything we need, let's plot the data. Use your mouse to hover over data points and see more information about them!

In [10]:
 # First lets set up bokeh
import bokeh.models as bkm
import bokeh.plotting as bkp
from bokeh.io import output_notebook

output_notebook()

In [11]:
# 
# from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
# from bokeh.models import HoverTool

# # This forces the output to stay inline in our notebook
# output_notebook()

# This sets up information we want in our tooltips
# source = ColumnDataSource(data=dict(
#     Year=top_songs_df['Year'],
#     PopularityScore=top_songs_df['PopularityScore'],
#     SongTitle=top_songs_df['SongTitle']
# ))

# # list of tools to show
# tools_to_show = 'box_zoom,pan,save,hover,reset,tap,wheel_zoom'

# # Creating our figure object
# p = figure(title="Popularity Score of Each Year's Most Popular Song Over Time", 
#            x_axis_label='Year', y_axis_label='Popularity Score', tools=tools_to_show)

# # We can easily snag data out of our DataFrame
# p.line('Year', 'PopularityScore', 
#        legend='Songs', line_width=2, source=source)

# # And overlay multiple plots
# p.line(top_songs_df['Year'], top_songs_df['RegLine'], 
#        legend='Regression Line, R²={:.3f}'.format(top_model.rsquared), 
#        line_width=2, color='orange', line_dash='6 6')

# # Now we create our hover tool
# hover = p.select(dict(type=HoverTool))
# hover.tooltips = [('Song', '@SongTitle'), ('Year', '@Year'), ('Score,', '@PopularityScore')]
# hover.mode = 'mouse'

# We need to get our data in
source = bkm.ColumnDataSource(data=top_songs_df)

p = bkp.figure(tools='box_zoom,pan,save,reset,tap,wheel_zoom')
g1 = bkm.Line(x='Year', y='PopularityScore', line_width=2, 
              line_color='#0099cc')
g1_r = p.add_glyph(source_or_glyph=source, glyph=g1, name='Hi')
g1_hover = bkm.HoverTool(renderers=[g1_r],
                         tooltips=[('Song', '@SongTitle'), ('Year', '@Year'), 
                                   ('Score,', '@PopularityScore')])
p.add_tools(g1_hover)
g2 = bkm.Line(x='Year', y='RegLine', line_width=2, 
              line_color='#ff9933', line_dash=(6, 6))
g2_r = p.add_glyph(source_or_glyph=source, glyph=g2)

# legend = Legend('Top Songs', 'Regression Line,  R²={:.3f}'.format(top_model.rsquared))

bkp.show(p)