# The Rise and Fall of Different Music Genres
#### By: Will Slotterback, Erik Ryde, and Todd Roberts
In deciding what to work on for this project we first thought about interesting datasets we could use. One dataset which came to mind was the Billboard Hot 100. Billboard has been tracking the top 100 most popular songs in the U.S. every week since the late 50's so they've amassed a massive dataset of musical tastes in the U.S. This data already provides a really interesting picture of changing preferences in the U.S. but on its own its hard to tease out broad musical trends. To do this we need a way to categorize the different songs. The obvious choice is to do it by genre, but Billboard doesn't have data on the genre of song, just its popularity. To get this data we instead need to turn to the Spotify API where we can correlate the genre of an artist with songs that artist has produced. Though this isn't a perfect method -- artists do often cross genres and produce new types of songs -- it is accurate enough to produce valuable results. Using this data we can then plot the rise and fall of different musical genres in American history and understand questions like: What is the most popular genre of all time? Spoiler: It's not Ska.

## Gathering the Data
The first step in collecting our data is scraping the Billboard Hot 100 to gather the list of most popular songs. Billboard used to provide an API for this, but has since deprecated it, so we chose instead to use the [billboard-top-100](https://github.com/darthbatman/billboard-top-100) library. Billboard chart urls follow the predictable pattern of `billboard.com/charts/*chart_name*/*date*` so this library takes a chart name, and a date, and scrapes the HTML for that chart in order to produce a list of the songs in that chart. We wrote a utility function around this  that allowed us to fetch all of the charts starting from a specific date and write them to a file. This function (and a few other utility functions) can be seen in our [source](https://github.com/slotterbackW/music-genres/blob/master/api/billboard.js).

Running this function with the date: *August 9th, 1958* or the earliest date Billboard charts are available produced a very large dataset, as you can see below.

In [3]:
with open('./data/songs.txt') as songs_file:
    list_of_songs = songs_file.readlines()
    print(len(list_of_songs))

311158


Wow that's a lot of music! For reference the average song is about three minutes long, so if you listened to all the songs in our dataset one after another it would take you 649 days, or over a year and a half.

Our dataset is not complete, however. We want to be able to look at broad trends in the data we have, but right now the individual songs are too granular. We need to categorize these songs by their genre. As mentioned above, to do this we'll use the Spotify API to fetch the genre for a song's artist and then correlate that with the song. Unfortunately our data isn't the cleanest right now, so before we fetch genre's we need to clean it up a little bit.

The first problem with our data is that the artist names we have don't always represent one artist. For example the 1960's classic *Stuck On You* lists "Elvis Presley With The Jordanaires" as the artist. As is obvious to human reader this is actually two artists, Elvis Presley and The Jordanaires, but the computer doesn't know that. Our method for solving this is to split artist names on a series of delimeters, and say that a song is officially by the first artist in that list.

The other issue with our data is that there are lots of duplicates. Because especially successful songs will last on the Hot 100 chart for many weeks the same song could show up multiple times. We don't want to waste resources fetching the same artist name multiple times, so making sure the names of artists we have are unique is an important concern of ours.

Ok enough talk. Let's get into some code. The first thing we'll do is grab the raw list of artists.

In [4]:
raw_artists = []
with open('./data/songs.txt') as songs_file:
    # Our data is comma separated, so split on commas
    raw_artists = [line.split(',')[2] for line in songs_file]

Then we'll polish the raw artist data so that it only contains the primary artist we care about

In [None]:
DELIMETERS = ['with', 'featuring', 'ft.', 'ft', '&', 'X']

# TODO
# Notes:
# Our data doesn't include the #1 artist for each week... so need to fix that
# Shouldn't have used commas as a delimiter because many song names and artist listings have them already
# Figure out how to remove the DELIMITERS from the artist name and grab the first one.
artist_names = []