# 01 - Scraping Billboard Charts
This notebook contains code for gathering yearly rankings of Billboard's Hot 100 singles.

In [528]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import bs4
import requests
import warnings
import re

from matplotlib import rcParams

rcParams['font.family'] = 'serif'
rcParams['font.serif'] = 'times new roman'

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

warnings.filterwarnings('ignore')

## Connecting to Wikipedia
Wikipedia has published the Billboard Year-end Hot 100 charts from **1960** to **2017**. The chart for each year is located on its own webpage, so we first iterate through **58** different URL's, collecting the HTML content for each.

In [529]:
url_base = 'https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_{}'

htmls = dict()
for year in range(1960, 2018):
    url = url_base.format(year)
    html = requests.get(url=url).content
    htmls[year] = bs4.BeautifulSoup(html)

## Parsing Wiki tables
Each HTML file stores the year-end chart data as a table. Below, we parse each table's rows to extract the data needed, which is then converted to a dataframe.

In [336]:
dfs = list()    # for collecting dataframes

for year, soup in htmls.items():
    # collect all relevant table rows into a list
    data = htmls[year]\
        .find('table', {'class':'wikitable sortable'})\
        .find_all('tr')
    rows = [datum.contents for datum in data]
    
    # the first, and every other <tr> object contains no data
    rows = [row[1::2] for row in rows][1:]
    
    # parse each row and store data in lists
    ranks = list()
    songs = list()
    artists_base = list() # collects only primary artist
    artists_all = list() # collects primary and featured artists
    for row in rows:
        # ranks must be cast as strings because of "Tie" as a possible value
        ranks.append(str(row[0].contents[0]))
        artists_all.append(' '.join(row[2].findAll(text=True)))
        
        # most primary artist data is a hyperlink, but some are just plaintext
        if not isinstance(row[2].contents[0], bs4.NavigableString):
            artists_base.append(row[2].contents[0].get('title'))
        else:
            artists_base.append(str(row[2].contents[0]))
        if len(row[1].contents) == 1:
            songs.append(str(row[1].contents[0]))
        else:
            songs.append(str(row[1].contents[1].findAll(text=True)[0]))
    
    # remove extra quotation marks from beginning and end of song titles
    songs = [song.strip("\"") for song in songs]
    
    # convert collected data for each year into its own dataframe
    # to be combined later
    data = dict(rank=ranks,
                song=songs,
                artist_base=artists_base,
                artist_all=artists_all,
                year=year)
    df = pd.DataFrame(data)
    dfs.append(df)

## Combining parsed data
Having successfully parsed the chart data for each year, the next step is to combine all the charts into a single dataframe. We also lowercase all dataframe values for consistency. 

In [554]:
billboard = pd.concat(dfs, axis=0)
billboard[['song', 'artist_all', 'artist_base']]\
    = billboard[['song', 'artist_all', 'artist_base']].applymap(str.lower)
billboard = billboard.reset_index()
billboard.tail(5)

Unnamed: 0,index,artist_all,artist_base,rank,song,year
5796,95,camila cabello featuring young thug,camila cabello,96,havana,2017
5797,96,maroon 5 featuring sza,maroon 5,97,what lovers do,2017
5798,97,blackbear,blackbear (musician),98,do re mi,2017
5799,98,xxxtentacion,xxxtentacion,99,look at me!,2017
5800,99,keith urban featuring carrie underwood,keith urban,100,the fighter,2017


We manually fix errors that cannot be automated. For example, automating the conversion from **lamp-lighter** to **lamplighter** would also mean **happy-go-lucky** is converted to **happygolucky**. Additionally, many song and artist names stored on Wikipedia are less-popular variants that need to be manually normalized.

In [555]:
billboard.at[33, 'song'] = 'night' # fix HTML error
billboard.at[61, 'artist_base'] = 'the safaris' # remove (page not found)
billboard.at[126, 'artist_base'] = 'dick and deedee' # change from "dee dee"
billboard.at[283, 'artist_base'] = 'dick and deedee' # change from "dee dee"
billboard.at[57, 'song'] = 'the old lamplighter' # change from "lamp-lighter"
billboard.at[238, 'song'] = 'theme from dr kildare' # remove parantheticals

for index in [145, 211, 266, 285, 331, 375]:
    billboard.at[index, 'artist_base'] = 'dion' # remove last name

billboard.at[347, 'song'] = 'memphis' # remove tennessee
billboard.at[388, 'song'] = 'fly me to the moon' # remove bossa nova
billboard.at[442, 'artist_base'] = 'the rip chords' # use artist_all value
billboard.at[489, 'song'] = 'walk dont run' # remove year
billboard.at[686, 'song'] = 'just my style' # remove "she's"
billboard.at[871, 'artist_base'] = 'tommy boyce' # remove second artist
billboard.at[936, 'artist_base'] = 'smokey robinson and the miracles' # add lead
billboard.at[2798, 'song'] = 'fight for your right' # use shortened title
billboard.at[1025, 'artist_base'] = 'the moments' # use artist_all
billboard.at[1084, 'artist_base'] = '100 proof aged in soul' # longer name
billboard.at[1191, 'song'] = 'i am i said' # remove ellipses
billboard.at[1193, 'song'] = 'dont knock my love part 1'
billboard.at[1200, 'song'] = 'somos novios its impossible' # add translation
billboard.at[1264, 'artist_base'] = 'dr hook' # remove "medicine show"
billboard.at[1290, 'artist_base'] = 'stephen schwartz'
billboard.at[1371, 'song'] = 'do you want to dance' # fix from "wanna"
billboard.at[1481, 'artist_base'] = 'donny and marie osmond'
billboard.at[1481, 'song'] = 'im leaving it all up to you' # add "all"
billboard.at[1565, 'song'] = 'one man woman one woman man' # fix spacing
billboard.at[1749, 'song'] = 'swayin to the music slow dancin' # fix from "dancing"
billboard.at[1773, 'artist_base'] = 'david dundas' # remove lord
billboard.at[1800, 'artist_base'] = 'cj and co' # shorten from "company"
billboard.at[1946, 'artist_base'] = 'dr hook' # remove "medicine show"
billboard.at[2190, 'artist_base'] = 'hall and oates'
billboard.at[2320, 'artist_base'] = 'frida' # use stage name
billboard.at[2601, 'artist_base'] = 'dionne warwick' # remove "and friends"
billboard.at[2670, 'song'] = 'silent running'
billboard.at[2689, 'artist_base'] = 'run dmc'

for index in [2860, 2863, 3065]:
    billboard.at[index, 'artist_base'] = 'pebbles' # use artist_all

for index in [2718, 2978]:
    billboard.at[index, 'artist_base'] = 'tiffany' # remove last name
    
for index in [2989, 3204, 3321, 3531, 3575]:
    billboard.at[index, 'artist_base'] = 'vanessa williams' # remove middle initial

for index in [3356, 3461, 3551, 3889, 4380]:
    billboard.at[index, 'artist_base'] = '2pac' # use stage name

for index in [3453, 3580]:
    billboard.at[index, 'artist_base'] = 'immature' # change from "imx"

for index in [3500, 3537, 3549, 3598, 3616, 3802, 3914, 3976, 4265, 4289]:
    billboard.at[index, 'artist_base'] = 'brandy' # remove last name

for index in [3703, 3705, 3819, 3847, 3868, 3995, 4215, 4218, 4742]: 
    billboard.at[index, 'artist_base'] = 'puff daddy' # change from "sean combs"

billboard.at[3890, 'artist_base'] = 'luke' # change from "luther campbell"

for index in [3943, 4331]:
    billboard.at[index, 'artist_base'] = 'tyrese' # remove last name

billboard.at[3993, 'song'] = 'lesson in leavin' # remove "a"
billboard.at[4086, 'artist_base'] = 'kandi' # remove last name
billboard.at[4174, 'artist_base'] = 'romeo' # remove last name
billboard.at[4353, 'song'] = 'why dont you and i'
billboard.at[4475, 'song'] = 'why dont you and i'
billboard.at[4593, 'song'] = 'numb encore'
billboard.at[4967, 'artist_base'] = 'young money' # remove "entertainment"
billboard.at[4967, 'song'] = 'every girl in the world'
billboard.at[5321, 'song'] = 'cups' # remove paranthetical

for index in [5424, 5552, 5590]:
    billboard.at[index, 'artist_base'] = 'sia' # remove last name
    
billboard.at[2190, 'song'] = 'youve lost that lovin feeling' # add ending "g"
billboard.at[5323, 'song'] = 'scream and shout' # convert "&"

In [556]:
billboard[billboard.song.str.contains('scream')]

Unnamed: 0,index,artist_all,artist_base,rank,song,year
3556,55,michael jackson and janet jackson,michael jackson,56,scream,1995
5244,43,usher,usher (entertainer),44,scream,2012
5323,22,will.i.am featuring britney spears,will.i.am,23,scream and shout,2013


## Removing disambiguations
One quirk of scraping data off Wikipedia is that the artist_base names come with parantheticals to "disambiguate" the entity (ex: **tlc** could refer to a band or a television channel). Since such disambiguations are unnecessary for our purposes, we remove these.

In [557]:
billboard[billboard.artist_base.str.contains('\(')].sample(5)

Unnamed: 0,index,artist_all,artist_base,rank,song,year
1682,81,orleans,orleans (band),82,still the one,1976
2395,94,stephen bishop,stephen bishop (musician),95,it might be you,1983
1291,90,yes,yes (band),91,roundabout,1972
4554,53,usher and alicia keys,usher (entertainer),54,my boo,2005
4838,37,plies featuring ne-yo,plies (rapper),38,bust it baby (part 2),2008


In [558]:
billboard[['artist_base', 'artist_all']]\
    = billboard[['artist_base', 'artist_all']].applymap(
        lambda x: re.sub(r'\(.*', '', x).strip()
    )

## Extracting featured artists
A number of songs are performed by a primary artist and a **featured** artist the primary invites to record. Fortunately, Wikipedia is consistent in that it always denotes such collaborations with the word "featuring".

In [559]:
billboard.artist_all.str.contains('featuring').sum()\
    == billboard.artist_all.str.contains('feat').sum()

True

In [560]:
def extract_featured_artist(x):
    match = re.match(".*\sfeaturing\s(.*)", x)
    if match:
        return match[1]
    else:
        return np.nan

billboard['artist_featured'] = billboard.artist_all.apply(extract_featured_artist)
billboard.sample(5)

Unnamed: 0,index,artist_all,artist_base,rank,song,year,artist_featured
4521,20,ciara featuring ludacris,ciara,21,oh,2005,ludacris
1639,38,captain & tennille,captain & tennille,39,lonely night (angel face),1976,
1193,92,wilson pickett,wilson pickett,93,dont knock my love part 1,1971,
2783,82,crowded house,crowded house,83,something so strong,1987,
4323,22,fabolous featuring tamia,fabolous,23,into you,2003,tamia


## Resolving ties

Occasionally, some songs tie for a particular Billboard rank, in which case Wikipedia records the rank as "Tie". To address this, we convert all rank values of "Tie" to null values. Because the data is ordered sequentially by rank, we can perform a simple linear interpolation to impute the null values.

In [561]:
billboard['rank'] = billboard['rank'].replace('Tie', np.nan).astype(float).interpolate()

The only time linear interpolation fails however is when the tie occurs at position 100. The next rank will be 1, resulting in an interpolated value of 50.5. This occurs only once, and we manually update the rank.

In [562]:
billboard['rank'][billboard.year == 1969][100] = 100
billboard['rank'] = billboard['rank'].astype(int)

## Save data

In [563]:
billboard.to_csv('../data/billboard.csv', index=None)