# The Rise and Fall of Different Music Genres
#### By: Will Slotterback, Erik Ryde, and Todd Roberts
In deciding what to work on for this project we first thought about interesting datasets we could use. One dataset which came to mind was the Billboard Hot 100. Billboard has been tracking the top 100 most popular songs in the U.S. every week since the late 50's so they've amassed a massive dataset of musical tastes in the U.S. This data already provides a really interesting picture of changing preferences in the U.S. but on its own its hard to tease out broad musical trends. To do this we need a way to categorize the different songs, so we can observe trends in those different categories. The obvious choice is to do it by genre, but Billboard doesn't have data on the genre of song, just its popularity. To get this data we instead need to turn to the Spotify API where we can correlate the genre of an artist with songs that artist has produced. Though this isn't a perfect method -- artists do often cross genres and produce new types of songs -- it is accurate enough to produce valuable results. Using this data we can then plot the rise and fall of different musical genres in American history and understand questions like: What is the most popular genre of all time? Spoiler: It's not Ska.

#### Libraries Used
* [billboard-top-100](https://github.com/darthbatman/billboard-top-100)
 * Used to obtain Billboard Top 100 data
* [spotipy](https://spotipy.readthedocs.io/en/latest/)
 * Used to obtain song data from Spotify
* [Pandas](https://pandas.pydata.org/)
 * Used for data processing and analysis
* [StatsModels](https://www.statsmodels.org/stable/index.html)
 * Used for statistical analysis
* [Bokeh](https://bokeh.pydata.org/en/latest/)
 * Used for plotting

## Gathering our Data
### Finding Most Popular Songs
The first step in collecting our data is scraping the Billboard Hot 100 to gather the list of most popular songs. Billboard used to provide an API for this, but has since deprecated it, so we chose instead to use the [billboard-top-100](https://github.com/darthbatman/billboard-top-100) library. Billboard chart urls follow the predictable pattern of `billboard.com/charts/*chart_name*/*date*` so this library takes a chart name, and a date, and scrapes the HTML for that chart in order to produce a list of the songs in that chart. We wrote a utility function around this  that allowed us to fetch all of the charts starting from a specific date and write them to a file. This function (and a few other utility functions) can be seen in our [source](https://github.com/slotterbackW/music-genres/blob/master/billboard-api/billboard.js).

Running this function with the date: *August 9th, 1958* (the earliest date Billboard charts are available) produces a very large dataset, as you can see below.

In [1]:
with open('./data/songs.txt') as songs_file:
    list_of_songs = songs_file.readlines()
    print(len(list_of_songs))

314501


Wow that's a lot of music! For reference the average song is about three minutes long, so if you listened to all the songs in our dataset one after another it would take you 655 days, or over a year and a half.]

Our dataset is not complete, however. We want to be able to look at broad trends in the data we have, but right now the individual songs are too granular. We need to categorize these songs by their genre. As mentioned above, to do this we'll use the Spotify API to fetch the genre for a song's artist and then correlate that with the song. Unfortunately our data isn't the cleanest right now, so before we fetch genre's we need to clean it up a little bit.

The first problem with our data is that the artist names we have don't always represent one artist. For example the 1960's classic *Stuck On You* lists "Elvis Presley With The Jordanaires" as the artist. As is obvious to human reader this is actually two artists, *Elvis Presley* and *The Jordanaires*, but the computer doesn't know that. To fix this we'll split artist names on a series of delimeters, and say that a song is officially *by* the first artist in the resulting list.

The other issue with our data is that there are lots of duplicates. Because especially successful songs will last on the Hot 100 chart for many weeks the same song could show up multiple times in our dataset. We don't want to waste resources fetching the same song's artist multiple times, so making sure the names of artists we have are unique is an important concern of ours.

We can actually combine both of these polishing steps into one, which we'll demonstrate below.

First, we'll grab the list of artist names.

In [2]:
raw_artists = []
with open('./data/songs.txt') as songs_file:
    # Our data is separated by bar characters and artists are in column 2 (0-based)
    raw_artists = [line.split('|')[2] for line in songs_file]

Then we'll polish the raw artist name data so that it only contains the primary artist we care about. This will turn names like *Elvis Presley and The Jordanaires* into just *Elvis Presley*. To do this we'll use a method we wrote which splits a string based on a list of delimiters [[source](https://github.com/slotterbackW/music-genres/blob/master/analysis_helpers.py#L20)], and then grab the first artist from that list.

As mentioned above, we also want to remove duplicates from this data so once we've found the name of the primary artist for a song, we'll store that name in a dictionary.

In [3]:
DELIMITERS = ['and', 'with', 'featuring', 'ft.', 'ft', '&', 'X']
# .split() only takes one argument, so we wrote our own "multi_split" function
# which splits a string based on a list of delimiters
from analysis_helpers import multi_split, sample_dict

# Now we'll use the delimiters to split artist names and grab the first one
# We also want to remove duplicates so we'll use a dictionary to store the results
artists = {}
for raw_artist in raw_artists[1:]:
    # We don't care about the value so we just use 0
    artists[multi_split(raw_artist, DELIMITERS)[0]] = 0

This gives us a clean list of artist names which we can now use to fetch the genres for those artists from the Spotify API. Using this we can correlate the genre of an artist with their songs and eventually do our analysis.

### Gathering the Genres of Artists
To get an artist's genre we're going to use Spotify's search API to search for the name of the artist and then say that the first result returned is the artist we're looking for. We'll then grab the genres listed for that artist and write the artist's name and genres to a file. The result of this process (which you can see in our [source](https://github.com/slotterbackW/music-genres/blob/master/spotify.py#L25)) is shown below.

In [4]:
artist_genres = {}
with open('./data/artists.txt') as artists_file:
    header = artists_file.readline()
    for line in artists_file:
        split_line = line.strip().split('|')
        artist_genres[split_line[0]] = eval(split_line[2])

# Output a sample of this data below
sample_dict(artist_genres)

{'Ricky Nelson': ['adult standards',
  'brill building pop',
  'bubblegum pop',
  'christmas',
  'doo-wop',
  'folk rock',
  'lounge',
  'merseybeat',
  'nashville sound',
  'rhythm and blues',
  'rock-and-roll',
  'rockabilly'],
 'Domenico Modugno': ['classic italian pop', 'italian pop'],
 'Bobby Darin': ['adult standards',
  'brill building pop',
  'christmas',
  'easy listening',
  'lounge',
  'rock-and-roll',
  'rockabilly',
  'soul',
  'swing',
  'vocal jazz'],
 'Kalin Twins': [],
 'Jack Scott': ['brill building pop',
  'deep adult standards',
  'doo-wop',
  'rock-and-roll',
  'rockabilly'],
 'Elvis Presley': ['christmas', 'rock-and-roll', 'rockabilly'],
 'Duane Eddy': ['adult standards',
  'brill building pop',
  'christmas',
  'doo-wop',
  'rhythm and blues',
  'rock-and-roll',
  'rockabilly',
  'surf music'],
 'Jimmy Clanton': ['brill building pop',
  'doo-wop',
  'rhythm and blues',
  'swamp pop'],
 'The Coasters': ['adult standards',
  'brill building pop',
  'christmas',
  '

This gives us the genre data which we can correlate with songs to build our full dataset.

### Full Dataset
Now that we have aggregated all of the genres associated with our Billboard Hot 100 artists, we can create one file representing our full dataset. This is done in our [source](https://github.com/slotterbackW/music-genres/blob/master/songs_cleaning.py) and rewritten to a new songs file, the results of which can be seen below.

In [5]:
with open('./data/full_songs.txt') as cleaned_songs:
    songs_header = cleaned_songs.readline().strip().split('|')
    songs_data = [song.strip().split('|') for song in cleaned_songs.readlines()]

# Output a sample of this data to the notebook
songs_data[:10]

[['1958-08-09', 'Poor Little Fool', 'Ricky Nelson', '1', 'Rock'],
 ['1958-08-09',
  'Nel Blu Dipinto Di Blu (Volaré)',
  'Domenico Modugno',
  '2',
  'Pop'],
 ['1958-08-09', 'Splish Splash', 'Bobby Darin', '4', 'Pop Standards'],
 ['1958-08-09', 'My True Love', 'Jack Scott', '6', 'Rock'],
 ['1958-08-09', 'Hard Headed Woman', 'Elvis Presley', '7', 'Rock'],
 ['1958-08-09', "Rebel-'rouser", 'Duane Eddy', '8', 'Rock'],
 ['1958-08-09', 'Just A Dream', 'Jimmy Clanton', '9', 'Pop'],
 ['1958-08-09', 'Yakety Yak', 'The Coasters', '11', 'Jazz & Blues'],
 ['1958-08-09', 'If Dreams Came True', 'Pat Boone', '12', 'Pop Standards'],
 ['1958-08-09', 'Fever', 'Peggy Lee', '13', 'Pop Standards']]

## Kowalski!
<br>
<div style="width: 500px;">![Kowalski, Analysis!](images/analysis.jpg)</div>

We now have the full dataset needed to do our analysis. For our analysis we'll be looking at a few key questions. The first will be how the popularity of songs has changed overtime. For example, have songs lasted longer on the Billboard Hot 100 as time goes on, or has wider access to music caused our musical tastes to diverge? We'll also be looking at how the popularity of different genres has evolved over time. After that we'll analyze whether the top songs of the year generally come from the top genre of the year, and finally, we'll be looking at what the current musical landscape looks like, i.e. in 2018, what genres and songs are most popular?
### Song Popularity Scores over Time
As mentioned above, we'll first start by looking at the evolution of popularity over time. Let's start by creating a dictionary which maps years to a dictionary of songs and list of the song's rankings. We'll say that a song ranked at #1 gets a score of 100, #2 gets a 99 and so on. This will help us visualize the most popular songs of each year.

In [6]:
# Helper function to get the year from a date string
def get_year(date):
    return int(date.split('-')[0])

# Helper function to return the score from a ranking
def get_score(ranking):
    return 101 - int(ranking)

# Initialize current year variable to be the first year in the dataset
current_year = get_year(songs_data[0][0])
years_to_songs = {}
years_to_songs[current_year] = {}

for row in songs_data:
    row_year = get_year(row[0])
    song_name = row[1]
    artist = row[2]
    genre = row[4]
    if row_year != current_year:
        current_year = row_year
        years_to_songs[current_year] = {}
    if song_name in years_to_songs[current_year]:
        years_to_songs[current_year][song_name][0].append(get_score(row[3]))
        
    else:
        years_to_songs[current_year][song_name] = [[get_score(row[3])], artist, genre]

# Now let's see the (shortened) results
{key:sample_dict(value) for key, value in sample_dict(years_to_songs).items()}

{1958: {'Poor Little Fool': [[100, 97, 95, 96, 95, 88, 82, 65, 54, 30],
   'Ricky Nelson',
   'Rock'],
  'Nel Blu Dipinto Di Blu (Volaré)': [[99,
    100,
    99,
    100,
    100,
    100,
    100,
    99,
    97,
    95,
    89,
    87,
    83,
    44,
    13],
   'Domenico Modugno',
   'Pop'],
  'Splish Splash': [[97, 91, 85, 83, 56, 43, 29],
   'Bobby Darin',
   'Pop Standards'],
  'My True Love': [[95, 98, 96, 94, 94, 89, 88, 82, 77, 70, 38, 39, 7],
   'Jack Scott',
   'Rock'],
  'Hard Headed Woman': [[94, 88, 80, 75, 45, 23, 7], 'Elvis Presley', 'Rock'],
  "Rebel-'rouser": [[93, 93, 90, 88, 82, 60, 49], 'Duane Eddy', 'Rock'],
  'Just A Dream': [[92, 96, 97, 97, 97, 96, 95, 92, 89, 87, 82, 72, 42, 15],
   'Jimmy Clanton',
   'Pop'],
  'Yakety Yak': [[90, 83, 66, 64, 27, 11], 'The Coasters', 'Jazz & Blues'],
  'If Dreams Came True': [[89, 86, 83, 78, 77, 46, 19, 20, 19],
   'Pat Boone',
   'Pop Standards'],
  'Fever': [[88, 92, 93, 89, 89, 79, 68, 61, 63, 45, 18],
   'Peggy Lee',
 

Great, this will help us figure out what the most popular song of any given year is. As we mentioned before, the popularity of a song will be assessed as `101-RANK` meaning that the top song will earn 100 points for that week. The most popular song of the year will therefore be the song with the highest total adjusted popularity.

In [7]:
# Finds the most popular song of each year
# input: dictionary mapping song names to a list of their popularity scores
# output: dictionary mapping year to a tuple of top song name and its total score
def song_of_the_year(song_dict):
    max_score = 0
    max_song_name = ''
    max_song_artist = ''
    max_song_genre = ''
    for song, scores in song_dict.items():
        total_song_score = sum(scores[0])
        if total_song_score > max_score:
            max_score = total_song_score
            max_song_name = song
            max_song_artist = scores[1]
            max_song_genre = scores[2]
    return max_song_name, max_score, max_song_artist, max_song_genre

# Now let's use a dictionary comprehension and our new function
top_songs = {int(year):song_of_the_year(song_dict) for year, song_dict in years_to_songs.items()}

# Here are the results
top_songs

{1958: ("It's All In The Game", 1712, 'Tommy Edwards', 'Pop Standards'),
 1959: ('The Battle Of New Orleans', 1752, 'Johnny Horton', 'Country'),
 1960: ('The Twist', 2330, 'Hank Ballard', 'Jazz & Blues'),
 1961: ('Moon River', 1714, 'Jerry Butler', 'Soul'),
 1962: ('Limbo Rock', 1766, 'The Champs', 'Rock'),
 1963: ('Days Of Wine And Roses', 1685, 'Henry Mancini', 'Pop Standards'),
 1964: ('Hello, Dolly!', 1840, 'Louis Armstrong', 'Jazz & Blues'),
 1965: ('The "In" Crowd', 1787, 'Dobie Gray', 'Soul'),
 1966: ('Born Free', 1258, 'Roger Williams', 'Pop Standards'),
 1967: ('Mercy, Mercy, Mercy', 1822, 'Cannonball Adderley', 'Jazz & Blues'),
 1968: ('Little Green Apples', 1767, 'Roger Miller', 'Country'),
 1969: ('Sugar, Sugar', 1797, 'The Archies', 'Pop'),
 1970: ('Get Ready', 1489, 'Rare Earth', 'Rock'),
 1971: ("You've Got A Friend", 1898, 'James Taylor', 'Rock'),
 1972: ('The First Time Ever I Saw Your Face', 1530, 'Roberta Flack', 'Soul'),
 1973: ('Why Me', 2142, 'Kris Kristofferson',

We can now plot this data over time to see if there is a trend in popularity. To do this plotting, we're going to use a library called [Bokeh](https://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#). We chose this library becuause it makes creating prettier, more interactive plots easier than any alternatives we found.

In [8]:
# Bokeh accepts data in lists or Pandas DataFrames, but we'll use Pandas to give ourselves the option
# of doing some additional statistical analysis
import pandas as pd

# To create a DataFrame from our dictionary while keeping the years
# in a usable format, we need to convert to a list
top_songs_df = [[k, v[0], v[2], v[3], v[1]] for k,v in top_songs.items()]
top_songs_df = pd.DataFrame(top_songs_df, columns=['Year', 'SongTitle', 'Artist', 'Genre', 'PopularityScore'])

# This data looks nice in a table!
top_songs_df.head(10)

Unnamed: 0,Year,SongTitle,Artist,Genre,PopularityScore
0,1958,It's All In The Game,Tommy Edwards,Pop Standards,1712
1,1959,The Battle Of New Orleans,Johnny Horton,Country,1752
2,1960,The Twist,Hank Ballard,Jazz & Blues,2330
3,1961,Moon River,Jerry Butler,Soul,1714
4,1962,Limbo Rock,The Champs,Rock,1766
5,1963,Days Of Wine And Roses,Henry Mancini,Pop Standards,1685
6,1964,"Hello, Dolly!",Louis Armstrong,Jazz & Blues,1840
7,1965,"The ""In"" Crowd",Dobie Gray,Soul,1787
8,1966,Born Free,Roger Williams,Pop Standards,1258
9,1967,"Mercy, Mercy, Mercy",Cannonball Adderley,Jazz & Blues,1822


We also want to see if there's a relationship between the year and popularity score of the most popular songs of the year. Maybe with the advent of the internet, the most popular songs have stayed at the top of the charts for longer than before. Let's do some statistical analysis.

In [9]:
import statsmodels.formula.api as sm

# Let's check out our linear regression model results
top_model = sm.ols(formula='PopularityScore ~ Year', data=top_songs_df).fit()
print(top_model.summary())

# ...and add them to our DataFrame so we can plot the linear regression line too
top_songs_df['RegLine'] = [(year * top_model.params[1]) + top_model.params[0] for year in range(1958,2019,1)]
top_songs_df.head(10)

                            OLS Regression Results                            
Dep. Variable:        PopularityScore   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.746
Method:                 Least Squares   F-statistic:                     177.0
Date:                Tue, 04 Dec 2018   Prob (F-statistic):           2.07e-19
Time:                        22:07:50   Log-Likelihood:                -463.62
No. Observations:                  61   AIC:                             931.2
Df Residuals:                      59   BIC:                             935.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -9.189e+04   7109.627    -12.925      0.0

Unnamed: 0,Year,SongTitle,Artist,Genre,PopularityScore,RegLine
0,1958,It's All In The Game,Tommy Edwards,Pop Standards,1712,1256.785828
1,1959,The Battle Of New Orleans,Johnny Horton,Country,1752,1304.35854
2,1960,The Twist,Hank Ballard,Jazz & Blues,2330,1351.931253
3,1961,Moon River,Jerry Butler,Soul,1714,1399.503966
4,1962,Limbo Rock,The Champs,Rock,1766,1447.076679
5,1963,Days Of Wine And Roses,Henry Mancini,Pop Standards,1685,1494.649392
6,1964,"Hello, Dolly!",Louis Armstrong,Jazz & Blues,1840,1542.222105
7,1965,"The ""In"" Crowd",Dobie Gray,Soul,1787,1589.794818
8,1966,Born Free,Roger Williams,Pop Standards,1258,1637.36753
9,1967,"Mercy, Mercy, Mercy",Cannonball Adderley,Jazz & Blues,1822,1684.940243


Now that we have everything we need, let's plot the data. 

Use your mouse to hover over data points and see more information about them (might have to click on the graph first)!

In [10]:
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
from bokeh.models import HoverTool, NumeralTickFormatter

# This forces the output to stay inline in our notebook
output_notebook()

# This sets up information we want in our tooltips
source = ColumnDataSource(data=dict(
    Year=top_songs_df['Year'],
    SongTitle=top_songs_df['SongTitle'],
    Artist=top_songs_df['Artist'],
    PopularityScore=top_songs_df['PopularityScore'],
    RegLine=top_songs_df['RegLine']
))

# # Creating our figure object
p = figure(title="Popularity Score of Each Year's Most Popular Song Over Time", 
            x_axis_label='Year', y_axis_label='Popularity Score')

# Graph popularity scores
p.line(x='Year', y='PopularityScore', line_width=2, line_color='#0099cc', legend="Popularity Score", source=source)

# Graph regression line for popularity scores
p.line(x='Year', y='RegLine', line_width=2, line_color='#ff9933', line_dash=(6, 6), legend="Regression Line", source=source)

# We need to add the hover tool information
p.add_tools(HoverTool(tooltips=[
    ("Song", "@SongTitle"),
    ("Artist", "@Artist"),
    ("Year", "@Year"),
    ("Score", "@PopularityScore")
]))

# Move legend over
p.legend.location = "top_left"

show(p)

As you can see the popularity of songs does seem to be sharply increasing. This suggests that as our world becomes more connected, our musical tastes are converging. Another interesting observation from this data is just how popular the song *Foolish Games/ You Were Meant For Me* is. This Folk/Pop hit by Jewel stayed high on the Billboard charts from 1996 all the way until 1998. It was certainly the most popular song in 1996, but is it the most popular song of all time? Stay tuned to find out!

### Popularity of Different Genres over Time

For this next section of our analysis we're going to be looking at the popularity of different genres through time. From this analysis we'll be able to visualize macro trends in the music industry and witness the rise and fall of different genres.

Right now our data isn't neccessarily in the best form, so we're going to transform it to be a little more useful. Specifically, we're going to create a nested dictionary where the keys are years and the values are another dictionary mapping genres to their total popularity score.

In [11]:
genre_scores = {}
for song in songs_data:
    year = get_year(song[0])
    genre = song[4]
    score = get_score(song[3])
    # Handle accessing keys which don't exist yet
    if year not in genre_scores:
        genre_scores[year] = {}
    if genre not in genre_scores[year]:
        genre_scores[year][genre] = 0
    # Add score to genre
    genre_scores[year][genre] += score

sample_dict(genre_scores)

{1958: {'Rock': 18426,
  'Pop': 7392,
  'Pop Standards': 25822,
  'Jazz & Blues': 12247,
  'Country': 6983,
  'Soul': 2458,
  'Christian/Gospel': 1030,
  'Holiday': 1060,
  'Ska/Reggae/Folk': 1239},
 1959: {'Jazz & Blues': 49213,
  'Rock': 39426,
  'Pop Standards': 54175,
  'Ska/Reggae/Folk': 3569,
  'Country': 15273,
  'Soul': 11919,
  'Pop': 16859,
  'Holiday': 2519,
  'Christian/Gospel': 2070,
  'Classical': 30},
 1960: {'Country': 20775,
  'Pop': 19499,
  'Pop Standards': 56566,
  'Jazz & Blues': 39763,
  'Christian/Gospel': 1128,
  'Soul': 19295,
  'Rock': 43824,
  'Holiday': 2720,
  'Ska/Reggae/Folk': 2958,
  'R&B/Hip-hop': 945},
 1961: {'Pop Standards': 36645,
  'Rock': 43582,
  'Holiday': 4316,
  'Pop': 28805,
  'Country': 13930,
  'Soul': 25414,
  'Jazz & Blues': 40498,
  'Christian/Gospel': 510,
  'Ska/Reggae/Folk': 610},
 1962: {'Jazz & Blues': 26347,
  'Pop': 35958,
  'Rock': 41567,
  'Country': 13655,
  'Pop Standards': 37437,
  'Soul': 28358,
  'Holiday': 1313,
  'Ska/Reg

Great! Now we have our data. Let's graph it.

In [12]:
# Create data structures for graphing
years = list(genre_scores.keys()) # List of years
# Dictionary of genres where each value is a list of total popularity scores for each year in that genre
genre_scores_list = {'Rock': [], 'Metal/Punk': [], 'Christian/Gospel': [], 
                     'Country': [], 'Dance/Electronic': [], 'Holiday': [], 
                     'Latin': [], 'Pop': [], 'Pop Standards': [], 'Classical': [], 
                     'R&B/Hip-hop': [], 'Jazz & Blues': [], 'Soul': [], 'Ska/Reggae/Folk': []}
for index, year in enumerate(years):
    genres = genre_scores[year]
    for genre, score in genres.items():
        genre_scores_list[genre].append(score)
    # Fill in genres which didn't get a score this year
    for genre, scores in genre_scores_list.items():
        if len(scores) < index + 1:
            scores.append(0)
    

# List of tools to show
tools_to_show = 'hover'

# X and Y values must be specified as ints so Bokeh doesn't use scientific notation when displaying them
tooltips = [
    ("Genre", "$name"),
    ("Year","$x{int}"),
    ("Score","$y{int}")
]

# # Creating our figure object
p = figure(title="Popularity of Each Genre Over Time", 
            x_axis_label='Year', y_axis_label='Popularity Score', tools=tools_to_show, tooltips=tooltips,
            plot_width=900)

colors = {
    'Rock': "#E74C3C",
    'Metal/Punk': "#273746",
    'Christian/Gospel': "#9B59B6",
    'Country': "#48C9B0",
    'Dance/Electronic': "#27AE60",
    'Holiday': "#F7DC6F",
    'Latin': "#F39C12",
    'Pop': "#2980B9",
    'Pop Standards': "#F08080",
    'Classical': "#D35400",
    'R&B/Hip-hop': "#F1C40F",
    'Jazz & Blues': "#922B21",
    'Soul': "#C39BD3",
    'Ska/Reggae/Folk': "#7F8C8D"
}

# Graph each genres score
for genre, scores in genre_scores_list.items():
    p.line(years, scores, name=f"{genre}", legend=f"{genre}", line_color=colors[genre], line_width=2)

# Move legend to top left
p.legend.location = "top_left"
# Enable clicking on legend items to show/hide them
p.legend.click_policy = "hide"

# Overwrite axis formatting so it doesn't use scientific notation
p.xaxis[0].formatter = NumeralTickFormatter(format="0")
p.yaxis[0].formatter = NumeralTickFormatter(format="0")

show(p)

This is pretty cool data. As you may have expected, Rock's peak popularity was in the mid-1980s while Pop and R&B/Hip-hop seem to be the dominant genres of today. While these genres may have dominated their times, did the most popular song of the year always coincide with the most popular genre? Let's find out!

### How Often is the Top Song of the Year in the Top Genre?

In [13]:
def top_genres(genre_dict):
    genre_oty = []
    for year, genres in genre_dict.items():
        top_genre = ''
        max_genre_score = 0
        for genre in genres:
            if genres[genre] > max_genre_score:
                top_genre = genre
                max_genre_score = genres[genre]
        genre_oty.append([year, top_genre, max_genre_score])
    return genre_oty

In [14]:
def song_genre_intersection(s_df, g_df):
    intersections = {}
    for idx in s_df.index:
        if s_df.loc[idx, 'Genre'] == g_df.loc[idx, 'Genre']:
            intersections[s_df.loc[idx, 'Year']] = True
        else:
            intersections[s_df.loc[idx, 'Year']] = False
    return intersections

genre_scores_df = pd.DataFrame(top_genres(genre_scores), columns=['Year', 'Genre', 'GenreScore'])
intersections = song_genre_intersection(top_songs_df, genre_scores_df)


In [15]:
# Code based on example here: https://bokeh.pydata.org/en/latest/docs/gallery/pie_chart.html
from math import pi
from bokeh.transform import cumsum

# Set up input data
input_data = {
    'Top Song In Top Genre': len([key for key, value in intersections.items() if value == True]),
     'Top Song Outside Top Genre': len([key for key, value in intersections.items() if value == False])   
    }
data = pd.Series(input_data).reset_index(name='value').rename(columns={'index':'type'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
# Add chart colors
c = ["#2980B9", "#E74C3C"]
data['color'] = c

# Set up the figure
p = figure(title="Frequency of Top Song in the Top Genre", plot_width=900, tools="hover", tooltips="@type: @value")
# Actually plot the data
p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color="color", legend='type', source=data)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

show(p)

Interestingly enough the top song of the year is much more often outside of the most popular genre of that year. In fact this occurs approximately 2/3 of the time.

### Current Musical Landscape
For our last piece of analysis we'll be looking at the current musical landscape. We'll be analyzing which genres are popular, and which songs are topping the charts this year.

To start, lets create a variable with the data we need. We'll need all of the information about songs for the year 2018, so we can just make a list of each song item from that year.

In [16]:
songs_2018 = {}
# Loop backwards for efficiency
for song in reversed(songs_data):
    if get_year(song[0]) == 2018:
        song_name = song[1]
        if song_name in songs_2018:
            songs_2018[song_name][2] = songs_2018[song_name][2] + get_score(song[3])
            songs_2018[song_name][0] = song[0]
            
        else:
            songs_2018[song_name] = [song[0], song[2], get_score(song[3]), song[4]]
    else:
        # No need to loop through songs before 2018
        break

# Show sample output
sample_dict(songs_2018)

{'Drowns The Whiskey': ['2018-06-30', 'Jason Aldean', 707, 'Country'],
 'Electricity': ['2018-09-22', 'Silk City', 15, 'Dance/Electronic'],
 'This Feeling': ['2018-10-06', 'The Chainsmokers', 41, 'Dance/Electronic'],
 'Kamikaze': ['2018-09-15', 'Lil Mosey', 115, 'R&B/Hip-hop'],
 'Desperate Man': ['2018-07-28', 'Eric Church', 163, 'Country'],
 'Break Up In The End': ['2018-03-10', 'Cole Swindell', 448, 'Country'],
 'Be Alright': ['2018-11-10', 'Dean Lewis', 7, 'Pop'],
 'New Patek': ['2018-09-29', 'Lil Uzi Vert', 269, 'R&B/Hip-hop'],
 'Venom': ['2018-09-15', 'Eminem', 190, 'R&B/Hip-hop'],
 'Burning Man': ['2018-11-03', 'Dierks Bentley', 11, 'Country']}

Now let's graph this data!

In [17]:
songs_dict = {
    'date': [],
    'pdDate': [],
    'title': [],
    'artist': [],
    'score': [],
    'genre': [],
    'color': []
}
for song, song_data in songs_2018.items():
    songs_dict['date'].append(song_data[0])
    songs_dict['pdDate'].append(pd.to_datetime(song_data[0]))
    songs_dict['title'].append(song)
    songs_dict['artist'].append(song_data[1])
    songs_dict['score'].append(song_data[2])
    songs_dict['genre'].append(song_data[3])
    songs_dict['color'].append(colors[song_data[3]])


# This sets up information we want in our tooltips
source = ColumnDataSource(songs_dict)

f = figure(title="2018 Songs", plot_width=900, x_axis_type="datetime", x_axis_label='Song Release Date', y_axis_label='Popularity Score')
f.circle(x="pdDate", y="score", fill_color="color", source=source, size=15, fill_alpha=.7, legend="genre")

f.add_tools(HoverTool(tooltips=[
    ("Song", "@title"),
    ("Artist", "@artist"),
    ("Score", "@score"),
    ("Date", "@date")
]))

# Move legend to top left
f.legend.location = "top_right"
# Enable clicking on legend items to show/hide them
f.legend.click_policy = "hide"
    
show(f)

As you can see in 2018 most of the popular songs are either Pop or Hip-hop/R&B. The reason for many of the songs clustering towards the left side of the graph is that most of those songs were likely released prior to 2018, but they carry over on the charts to the first day of 2018 and are therefore counted as 2018 songs.

## Conclusion

Collecting song data for this project certainly wasn't easy. We had to combine data from two different sources and do a decent amount of work to polish our data into a format we could use. Once we had usable data, however, we were able to plot that data to tease out some interesting overarching trends in the music industry. We were surprised to find that music tastes are converging as our world becomes increasingly interconnected. We also were pretty surprised to find that the most popular song of the year is actually usually different from the most popular genre that year.
    
In our analysis of musical genres over time, some things that stuck out to us were both the rise of rock in the 1980's and its subsequent decline, as well as rock's takeover by Pop and Hip-hop/R&B in later years. The changing trend was further backed up by our analysis of music in 2018 which showed the continued popularity of Pop and Hip-hop/R&B.

We really enjoyed plotting our data and invite anyone interested in observing musical trends to try playing with it as well.

## Ethical Implications

As mentioned in our project proposal, we believe the potential of this project to cause ethical issues is very low. Our results are only intended to create cool, fun, and interesting facts for music enthusiasts and those looking to expand their music genre trivia knowledge. We also collected our data from publicly available, open, sources which mitigates any potential privacy concerns. While our conclusions will likely not directly affect social trends, they could potentially be misused or wrongly cited as sources of social analysis, but we believe this to be an unlikely outcome. Our methods for evaluating popularity were purely numerical and available with the study, thus creating transparency and maintaining the scientific integrity of the study.