# The Rise and Fall of Different Music Genres
#### By: Will Slotterback, Erik Ryde, and Todd Roberts
In deciding what to work on for this project we first thought about interesting datasets we could use. One dataset which came to mind was the Billboard Hot 100. Billboard has been tracking the top 100 most popular songs in the U.S. every week since the late 50's so they've amassed a massive dataset of musical tastes in the U.S. This data already provides a really interesting picture of changing preferences in the U.S. but on its own its hard to tease out broad musical trends. To do this we need a way to categorize the different songs. The obvious choice is to do it by genre, but Billboard doesn't have data on the genre of song, just its popularity. To get this data we instead need to turn to the Spotify API where we can correlate the genre of an artist with songs that artist has produced. Though this isn't a perfect method -- artists do often cross genres and produce new types of songs -- it is accurate enough to produce valuable results. Using this data we can then plot the rise and fall of different musical genres in American history and understand questions like: What is the most popular genre of all time? Spoiler: It's not Ska.

## Gathering our Data
The first step in collecting our data is scraping the Billboard Hot 100 to gather the list of most popular songs. Billboard used to provide an API for this, but has since deprecated it, so we chose instead to use the [billboard-top-100](https://github.com/darthbatman/billboard-top-100) library. Billboard chart urls follow the predictable pattern of `billboard.com/charts/*chart_name*/*date*` so this library takes a chart name, and a date, and scrapes the HTML for that chart in order to produce a list of the songs in that chart. We wrote a utility function around this  that allowed us to fetch all of the charts starting from a specific date and write them to a file. This function (and a few other utility functions) can be seen in our [source](https://github.com/slotterbackW/music-genres/blob/master/billboard-api/billboard.js).

Running this function with the date: *August 9th, 1958* or the earliest date Billboard charts are available produced a very large dataset, as you can see below.

In [1]:
with open('./data/songs.txt') as songs_file:
    list_of_songs = songs_file.readlines()
    print(len(list_of_songs))

314501


Wow that's a lot of music! For reference the average song is about three minutes long, so if you listened to all the songs in our dataset one after another it would take you 655 days, or over a year and a half.

Our dataset is not complete, however. We want to be able to look at broad trends in the data we have, but right now the individual songs are too granular. We need to categorize these songs by their genre. As mentioned above, to do this we'll use the Spotify API to fetch the genre for a song's artist and then correlate that with the song. Unfortunately our data isn't the cleanest right now, so before we fetch genre's we need to clean it up a little bit.

The first problem with our data is that the artist names we have don't always represent one artist. For example the 1960's classic *Stuck On You* lists "Elvis Presley With The Jordanaires" as the artist. As is obvious to human reader this is actually two artists, *Elvis Presley* and *The Jordanaires*, but the computer doesn't know that. Our method for solving this is to split artist names on a series of delimeters, and say that a song is officially by the first artist in that list.

The other issue with our data is that there are lots of duplicates. Because especially successful songs will last on the Hot 100 chart for many weeks the same song could show up multiple times. We don't want to waste resources fetching the same artist name multiple times, so making sure the names of artists we have are unique is an important concern of ours.

Ok let's get into some code. The first thing we'll do is grab the raw list of artists.

In [2]:
raw_artists = []
with open('./data/songs.txt') as songs_file:
    # Our data is separated by bar characters and artists are in column 2 (0-based)
    raw_artists = [line.split('|')[2] for line in songs_file]

Then we'll polish the raw artist data so that it only contains the primary artist we care about. To do this we'll use a method we wrote which splits a string based on a list of delimiters. [[source](https://github.com/slotterbackW/music-genres/blob/master/analysis_helpers.py#L20)]

As mentioned above, we also want to remove duplicates from this data so to do that we'll store the names of our artists in a dictionary.

In [3]:
DELIMITERS = ['and', 'with', 'featuring', 'ft.', 'ft', '&', 'X']
# .split() only takes one argument, so we wrote our own "multi_split" function
# which splits a string based on a list of delimiters
from analysis_helpers import multi_split

# Now we'll use the delimiters to split artist names and grab the first one
# We also want to remove duplicates so we'll use a dictionary to store the results
artists = {}
for raw_artist in raw_artists[1:]:
    # We don't care about the value so we just use 0
    artists[multi_split(raw_artist, DELIMITERS)[0]] = 0

['Ricky Nelson',
 'Domenico Modugno',
 'Perez Prado',
 'Bobby Darin',
 'Kalin Twins',
 'Jack Scott',
 'Elvis Presley',
 'Duane Eddy',
 'Jimmy Clanton',
 'The Johnny Otis Show',
 'The Coasters',
 'Pat Boone',
 'Peggy Lee',
 'The Elegants',
 'Frankie Avalon',
 'Doris Day',
 'The Danleers',
 'Poni-Tails',
 'Patti Page',
 'Jody Reynolds',
 'The Olympics',
 'Johnny Cash',
 'Jerry Butler',
 'The Rinky-Dinks',
 'Buddy Knox',
 'Jimmie Rodgers',
 'Bobby Freeman',
 'Johnny Mathis',
 'The Four Lads',
 'Perry Como',
 'Chuck Willis',
 'The Crickets',
 'Bobby Day',
 'The Everly Brothers',
 'Connie Francis',
 'Don Gibson',
 'Dean Martin',
 'Big Bopper',
 'Dean Martin',
 'Buddy Holly',
 'Jimmie Rodgers',
 'Robin Luke',
 'The Everly Brothers',
 'Bobby Freeman',
 'Jim Reeves',
 'Nat King Cole',
 'Sheb Wooley',
 'The Slades',
 'Clyde McPhatter',
 'Bobby Hendricks',
 'Faron Young',
 'Eddie Cochran',
 'Bobby Day',
 'Elvis Presley',
 'Tony',
 'Jack Scott',
 'Pat Boone',
 'The Drifters',
 'Gerry Granahan',
 

Now that we have a collection of unique artist names the next step is to fetch the genres for those artists from the Spotify API. This is so that we can correlate the genre of an artist with their songs and eventually do our analysis. Our methodology for this is to use Spotify's search API to search for the name of the artist and then say that the first result returned is the artist we're looking for. We'll then grab the genres listed for that artist and write the artist's name and genres to a file. The result of this process (which you can see in our [source](https://github.com/slotterbackW/music-genres/blob/master/spotify.py#L25)) is shown below.

In [8]:
artist_genres = {}
with open('./data/artists.txt') as artists_file:
    header = artists_file.readline()
    for line in artists_file:
        split_line = line.strip().split('|')
        artist_genres[split_line[0]] = eval(split_line[2])
artist_genres

{'Ricky Nelson': ['adult standards',
  'brill building pop',
  'bubblegum pop',
  'christmas',
  'doo-wop',
  'folk rock',
  'lounge',
  'merseybeat',
  'nashville sound',
  'rhythm and blues',
  'rock-and-roll',
  'rockabilly'],
 'Domenico Modugno': ['classic italian pop', 'italian pop'],
 'Bobby Darin': ['adult standards',
  'brill building pop',
  'christmas',
  'easy listening',
  'lounge',
  'rock-and-roll',
  'rockabilly',
  'soul',
  'swing',
  'vocal jazz'],
 'Kalin Twins': [],
 'Jack Scott': ['brill building pop',
  'deep adult standards',
  'doo-wop',
  'rock-and-roll',
  'rockabilly'],
 'Elvis Presley': ['christmas', 'rock-and-roll', 'rockabilly'],
 'Duane Eddy': ['adult standards',
  'brill building pop',
  'christmas',
  'doo-wop',
  'rhythm and blues',
  'rock-and-roll',
  'rockabilly',
  'surf music'],
 'Jimmy Clanton': ['brill building pop',
  'doo-wop',
  'rhythm and blues',
  'swamp pop'],
 'The Coasters': ['adult standards',
  'brill building pop',
  'christmas',
  '

Now that we have aggregated all of the genres associated with our Billboard Hot 100 artists, we need to make sure that our song data, which contains the dates, is using the same artist names as the ones we cleaned above. This is done in our [source](https://github.com/slotterbackW/music-genres/blob/master/songs_cleaning.py) and rewritten to a new songs file, the results of which can be seen below. With the data munging complete, we're ready to do some analysis!

In [1]:
with open('./data/cleaned_songs.txt') as cleaned_songs:
    songs_header = cleaned_songs.readline().strip().split('|')
    songs_data = [song.strip().split('|') for song in cleaned_songs.readlines()]
songs_data

[['1958-08-09', 'Poor Little Fool', 'Ricky Nelson', '1'],
 ['1958-08-09', 'Nel Blu Dipinto Di Blu (Volaré)', 'Domenico Modugno', '2'],
 ['1958-08-09', 'Patricia', 'Perez Prado', '3'],
 ['1958-08-09', 'Splish Splash', 'Bobby Darin', '4'],
 ['1958-08-09', 'When', 'Kalin Twins', '5'],
 ['1958-08-09', 'My True Love', 'Jack Scott', '6'],
 ['1958-08-09', 'Hard Headed Woman', 'Elvis Presley', '7'],
 ['1958-08-09', "Rebel-'rouser", 'Duane Eddy', '8'],
 ['1958-08-09', 'Just A Dream', 'Jimmy Clanton', '9'],
 ['1958-08-09', 'Willie And The Hand Jive', 'The Johnny Otis Show', '9'],
 ['1958-08-09', 'Yakety Yak', 'The Coasters', '11'],
 ['1958-08-09', 'If Dreams Came True', 'Pat Boone', '12'],
 ['1958-08-09', 'Fever', 'Peggy Lee', '13'],
 ['1958-08-09', 'Little Star', 'The Elegants', '14'],
 ['1958-08-09', 'Ginger Bread', 'Frankie Avalon', '15'],
 ['1958-08-09', 'Everybody Loves A Lover', 'Doris Day', '16'],
 ['1958-08-09', 'One Summer Night', 'The Danleers', '17'],
 ['1958-08-09', 'Born Too Late', 

Let's start by creating a dictionary of dictionaries which maps years to songs and their ranking. We'll say that a song ranked at #1 gets a score of 100, #2 gets a 99 and so on. This will help us visualize the most popular songs of each year.