## Build Hot 100 archive

> There are some significant flaws in the files resulting from these scripts that make the data fairly unusable. Since titles and artists mix quotes and include commas, columns are not parsed correctly. This could possible be fixed in the script, or after the fact, but I don't have it in me to figure it out.

Idea of this notebook is to build an archive of the Hot 100 from the the current date to the oldest date __1958-08-02__. It's currently set up to pull a year's worth. The idea is to stack them later. (I started doing it by year after hitting the timeouts listed above, just in case it was a reading error locally on a large file. The doesn't appear to be the case.)

There is a rate limit on requests to the billboard site. I've had it time out after 10 requests, but I've also had it time out after one if I've run other requests recently.

For each chart, we have `chart.previousDate` to work with, which allows us to walk back in time. The loop works like this:

- Open our file
- check for the oldest date, start new if not results already
- Find the next oldest chart
- Start a loop and counter and write the results of that week's chart
- Set the chart date to the next oldest date
- Check if that is in our current year. Break if not.
- Wait a time interval and loop again if counter is not maxed

This doesn't completely solve the rate limit, but does pretty well at 10 seconds a week.


In [1]:
import billboard
from datetime import datetime, timedelta, date
import os
import pandas as pd
import time

## Settings

In [14]:
# chart type from api
chart_type = 'hot-100'

# year we are working on
output_year = "1997"

# output path
outfilename = "../data/hot-100-" + output_year + ".csv"

print(outfilename)

../data/hot-100-1997.csv


## Create the file

In [15]:
# headers
header = 'date,title,artist,current,previous,peak,weeks\n'

# set exists flag
file_exists = os.path.exists(outfilename)

# checks if file exists and writes if not
if file_exists != True:
    with open(outfilename, 'a') as outputfile:
        outputfile.write(header)
        print("File created with header")
# checks if file empty and writes header if not
else:
    file_empty = os.stat(outfilename).st_size == 0
    if file_empty:
        with open(outfilename, 'a') as outputfile:
            outputfile.write(header)
            print("Added header")
    else:
        print("File has data")

File created with header


## Chart loop

This loop checks the most recent date of the current year's file. If it is new, it starts with the last chart in December and then through older charts. If there are charts already, it picks up where it left off.

Beyond `output_year` above, there are two settings to help control rate limiting:

- counter: How many loops it will do before stopping.
- timer_interval: How long to wait before getting the next chart.

In [16]:
# set the counter
counter = 53

# set the time intervval
timer_interval = 10

# read in file
top_100 = pd.read_csv(outfilename)

# find most oldest week in output
oldest_date = top_100.date.min()

# if oldest_date isnull, then use begin_chart date
if pd.isnull(oldest_date):
    begin_chart_date = output_year + "-12-25"
    chart = billboard.ChartData(chart_type, date=begin_chart_date)
    print("Starting new year")
    print("Beginning date: " + chart.date)
# else, use next previous date
else:
    chart = billboard.ChartData(chart_type, date=oldest_date)
    chart = billboard.ChartData(chart_type, str(chart.previousDate))
    print("Picking up after: " + oldest_date)
    print("Beginning date: " + chart.date)

with open(outfilename, 'a') as outputfile:
    start_time = time.time()
    for i in range (1,counter+1):
        for position in range (0,100):
            song = chart[position]
            line_out = str(chart.date) + ',' + '"' + song.title + '"' + ',' + '"' \
            + song.artist + '"' + ','  + str(song.rank) + ',' + str(song.lastPos) \
            + ',' + str(song.peakPos) + ',' + str(song.weeks) + '\n'
            with open(outfilename, 'a') as outputfile:
                outputfile.write(line_out)
        print(chart.date + ": " + str(chart[0]))
        chart = billboard.ChartData(chart_type, str(chart.previousDate))
        # check if year is over
        if chart.date[:4] != output_year:
            print("Year is over")
            break
        else:
            time.sleep(timer_interval)
    print('done')
outputfile.close()

Starting new year
Beginning date: 1997-12-27
1997-12-27: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-12-20: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-12-13: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-12-06: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-11-29: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-11-22: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-11-15: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-11-08: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-11-01: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-10-25: 'Candle In The Wind 1997/Something About The Way You Look Tonight' by Elton John
1997-10-18: 'Candle In Th

## Some testing

This checks the lengh of the last file processed. Should be 5200, unless it is a leap year that starts on a Saturday, like 2016.

In [17]:
# read in the file
chart_peek = pd.read_csv(outfilename)

# check the length
len(chart_peek)

5200