# The Rise and Fall of Different Music Genres
#### By: Will Slotterback, Erik Ryde, and Todd Roberts
In deciding what to work on for this project we first thought about interesting datasets we could use. One dataset which came to mind was the Billboard Hot 100. Billboard has been tracking the top 100 most popular songs in the U.S. every week since the late 50's so they've amassed a massive dataset of musical tastes in the U.S. This data already provides a really interesting picture of changing preferences in the U.S. but on its own its hard to tease out broad musical trends. To do this we need a way to categorize the different songs. The obvious choice is to do it by genre, but Billboard doesn't have data on the genre of song, just its popularity. To get this data we instead need to turn to the Spotify API where we can correlate the genre of an artist with songs that artist has produced. Though this isn't a perfect method -- artists do often cross genres and produce new types of songs -- it is accurate enough to produce valuable results. Using this data we can then plot the rise and fall of different musical genres in American history and understand questions like: What is the most popular genre of all time? Spoiler: It's not Ska.

## Gathering our Data
The first step in collecting our data is scraping the Billboard Hot 100 to gather the list of most popular songs. Billboard used to provide an API for this, but has since deprecated it, so we chose instead to use the [billboard-top-100](https://github.com/darthbatman/billboard-top-100) library. Billboard chart urls follow the predictable pattern of `billboard.com/charts/*chart_name*/*date*` so this library takes a chart name, and a date, and scrapes the HTML for that chart in order to produce a list of the songs in that chart. We wrote a utility function around this  that allowed us to fetch all of the charts starting from a specific date and write them to a file. This function (and a few other utility functions) can be seen in our [source](https://github.com/slotterbackW/music-genres/blob/master/api/billboard.js).

Running this function with the date: *August 9th, 1958* or the earliest date Billboard charts are available produced a very large dataset, as you can see below.

In [1]:
with open('./data/songs.txt') as songs_file:
    list_of_songs = songs_file.readlines()
    print(len(list_of_songs))

314501


Wow that's a lot of music! For reference the average song is about three minutes long, so if you listened to all the songs in our dataset one after another it would take you 655 days, or over a year and a half.

Our dataset is not complete, however. We want to be able to look at broad trends in the data we have, but right now the individual songs are too granular. We need to categorize these songs by their genre. As mentioned above, to do this we'll use the Spotify API to fetch the genre for a song's artist and then correlate that with the song. Unfortunately our data isn't the cleanest right now, so before we fetch genre's we need to clean it up a little bit.

The first problem with our data is that the artist names we have don't always represent one artist. For example the 1960's classic *Stuck On You* lists "Elvis Presley With The Jordanaires" as the artist. As is obvious to human reader this is actually two artists, *Elvis Presley* and *The Jordanaires*, but the computer doesn't know that. Our method for solving this is to split artist names on a series of delimeters, and say that a song is officially by the first artist in that list.

The other issue with our data is that there are lots of duplicates. Because especially successful songs will last on the Hot 100 chart for many weeks the same song could show up multiple times. We don't want to waste resources fetching the same artist name multiple times, so making sure the names of artists we have are unique is an important concern of ours.

Ok let's get into some code. The first thing we'll do is grab the raw list of artists.

In [2]:
raw_artists = []
with open('./data/songs.txt') as songs_file:
    # Our data is separated by bar characters
    raw_artists = [line.split('|')[2] for line in songs_file]

Then we'll polish the raw artist data so that it only contains the primary artist we care about. To do this we'll use a method we wrote which splits a string based on a list of delimiters. [[source](https://github.com/slotterbackW/music-genres/blob/master/analysis_helpers.py#L20)]

In [3]:
DELIMITERS = ['and', 'with', 'featuring', 'ft.', 'ft', '&', 'X']
from analysis_helpers import multi_split

artist_names = []
for raw_artist in raw_artists[1:]:
    # We'll consider the first artist who's name appears in the artist field the "primary" artist.
    artist_names.append(multi_split(raw_artist, DELIMITERS)[0])

artist_names

['Ricky Nelson',
 'Domenico Modugno',
 'Perez Prado',
 'Bobby Darin',
 'Kalin Twins',
 'Jack Scott',
 'Elvis Presley',
 'Duane Eddy',
 'Jimmy Clanton',
 'The Johnny Otis Show',
 'The Coasters',
 'Pat Boone',
 'Peggy Lee',
 'The Elegants',
 'Frankie Avalon',
 'Doris Day',
 'The Danleers',
 'Poni-Tails',
 'Patti Page',
 'Jody Reynolds',
 'The Olympics',
 'Johnny Cash',
 'Jerry Butler',
 'The Rinky-Dinks',
 'Buddy Knox',
 'Jimmie Rodgers',
 'Bobby Freeman',
 'Johnny Mathis',
 'The Four Lads',
 'Perry Como',
 'Chuck Willis',
 'The Crickets',
 'Bobby Day',
 'The Everly Brothers',
 'Connie Francis',
 'Don Gibson',
 'Dean Martin',
 'Big Bopper',
 'Dean Martin',
 'Buddy Holly',
 'Jimmie Rodgers',
 'Robin Luke',
 'The Everly Brothers',
 'Bobby Freeman',
 'Jim Reeves',
 'Nat King Cole',
 'Sheb Wooley',
 'The Slades',
 'Clyde McPhatter',
 'Bobby Hendricks',
 'Faron Young',
 'Eddie Cochran',
 'Bobby Day',
 'Elvis Presley',
 'Tony',
 'Jack Scott',
 'Pat Boone',
 'The Drifters',
 'Gerry Granahan',
 

Ok we've managed to clean up our data, so that we now have useful artist names. The next step is to remove duplicates. To do this we'll create a dictionary of artists and utilize the unique key property of dictionaries to make sure our artist names are unique.

In [4]:
artist_dict = {name: 0 for name in artist_names}

We can now see how many duplicates we eliminated by checking the length of the two collections.

In [5]:
artist_names_len = len(artist_names)
artist_dict_len = len(artist_dict.keys())
print(f'Artist names has {artist_names_len} values.')
print(f'Artist dict has {artist_dict_len} values.')
print(f'We eliminated {artist_names_len - artist_dict_len} duplicate artist names.')

Artist names has 314500 values.
Artist dict has 7188 values.
We eliminated 307312 duplicate artist names.


Now that we have a collection of unique artist names the next step is to fetch the genres for those artists from the Spotify API. This is so that we can correlate the genre of an artist with their songs and eventually do our analysis.

In [10]:
# First get auth token for spotify API
import spotipy
import spotipy.util as util

username = 'slotterback' # using my spotify account
scope = 'user-read-private'
client_id = '6ed1f612bf9a487c9d0f7048115291d3'
redirect_uri = 'http://localhost/'
from secrets import SPOTIFY_API_KEY
client_secret = SPOTIFY_API_KEY()
token = util.prompt_for_user_token(username, scope, client_id, client_secret, redirect_uri)
token

'BQAatscKcDNoAl3Eo3-TTtcw1q1TIVSKfvL5v4veQV0RCIp6PEk-iN0J7ycQPkoKIO0FDutdqkGaGN0rO5rNyDgkDaZUOq7X0Ft08h6eyORc4SVhW5N-1qQh48nlVgE_oxl9_Ob_TZobxqlLLIRLRlf7Ofpk0YS1HRE'

In [14]:
sp = spotipy.Spotify(auth=token)
artist_names_to_ids = {}

for artist in artist_names:
    results = sp.search(q='artist:' + artist, type='artist')
    result_items = results['artists']['items']
    id = None
    for items in result_items:
        if items['name'] == artist:
            print(f"Found artist {artist}, Id: {items['id']}")
            artist_names_to_ids[artist] = items['id']
            break
    
artist_names_to_ids

Found artist Ricky Nelson, Id: 73sSFVlM6pkweLXE8qw1OS
Found artist Domenico Modugno, Id: 4llklDtTTyMYMY2LfFOkTI
Found artist Bobby Darin, Id: 0EodhzA6yW1bIdD5B4tcmJ
Found artist Kalin Twins, Id: 6LXtFndRkOihPIa2dWY3FH
Found artist Jack Scott, Id: 4ucP0bNegd7Q4ewdOKIBfz
Found artist Elvis Presley, Id: 43ZHCT0cAZBISjO8DG9PnE
Found artist Duane Eddy, Id: 1I5Cu7bqjkRg85idwYsD91
Found artist Jimmy Clanton, Id: 2XZXvrqedRMiKv6UWjAT4B
Found artist The Coasters, Id: 3QZKZBEmr54lAVI5XvmjnM
Found artist Pat Boone, Id: 7fmKtIgmxqNEKjATioVNsu
Found artist Peggy Lee, Id: 602DnpaSXJB4b9DZrvxbDc
Found artist The Elegants, Id: 7bNoMfBqbaLJrfH3Vw1q6L
Found artist Frankie Avalon, Id: 5zNOI87gG4RttFmYAZWaxQ
Found artist Doris Day, Id: 3ESG6pj6a0LvUKklENalT6
Found artist The Danleers, Id: 1W0oUYvRe6jjI2SuaiigFv
Found artist Patti Page, Id: 4nZN9kln8toEzOifhWG2uF
Found artist Jody Reynolds, Id: 4j07I8NDDAMIy4BWc6aqOj
Found artist The Olympics, Id: 3KtW4xANJkgfEnFgMxaj8h
Found artist Johnny Cash, Id: 6kACVP

KeyboardInterrupt: 