## Build Billboard 200 archive

> 1988 and 1989 not complete. Artist won't parse. Oct. 15th 1988 through Nov. 15 1989 there are problems.

This notebook builds a year's worth of Billboard 200 charts at a time (but could be changed to pull other charts by adjusting the `chart_type` field. The idea is to build an archive one year at a time. This allows me to fix a single year if it is a problem. And there were issues with timeouts on other charts.

A little about the Billboard 200: [The chart grew from a weekly top 10 list in 1956 to become a top 200 in May 1967, and acquired its present title in March 1992](https://en.wikipedia.org/wiki/Billboard_200).

There is a rate limit on requests to the billboard site. I've had it time out after 10 requests, but I've also had it time out after one if I've run other requests recently.

For each chart, we have `chart.previousDate` to work with, which allows us to walk back in time. The loop works like this:

- Open our file
- check for the oldest date, start new if not results already
- Find the next oldest chart
- Start a loop and counter and write the results of that week's chart
- Set the chart date to the next oldest date
- Check if that is in our current year. Break if not.
- Wait a time interval and loop again if counter is not maxed

This doesn't completely solve the rate limit, but does pretty well at 10 seconds a week.


In [84]:
import billboard
from datetime import datetime, timedelta, date
import os
import pandas as pd
import time

## Settings

In [85]:
# chart type from api
chart_type = 'billboard-200'

# year we are working on
output_year = "1977"

# output path
outfilename = "../data/billboard-200-" + output_year + ".csv"

print(outfilename)

../data/billboard-200-1977.csv


## Create the file

In [86]:
# headers
header = 'date,title,artist,current,previous,peak,weeks,new\n'

# set exists flag
file_exists = os.path.exists(outfilename)

# checks if file exists and writes if not
if file_exists != True:
    with open(outfilename, 'a') as outputfile:
        outputfile.write(header)
        print("File created with header")
# checks if file empty and writes header if not
else:
    file_empty = os.stat(outfilename).st_size == 0
    if file_empty:
        with open(outfilename, 'a') as outputfile:
            outputfile.write(header)
            print("Added header")
    else:
        print("File has data")

File created with header


## Chart loop

This loop checks the most recent date of the current year's file. If it is new, it starts with the last chart in December and then through older charts. If there are charts already, it picks up where it left off.

Beyond `output_year` above, there are two settings to help control rate limiting:

- counter: How many loops it will do before stopping.
- timer_interval: How long to wait before getting the next chart.

In [87]:
# set the counter
counter = 53

# set the time intervval
timer_interval = 10

# read in file
chart_file = pd.read_csv(outfilename)

# find most oldest week in output
oldest_date = chart_file.date.min()

# if oldest_date isnull, then use begin_chart date
if pd.isnull(oldest_date):
    begin_chart_date = output_year + "-12-25"
    chart = billboard.ChartData(chart_type, date=begin_chart_date)
    print("Starting new year")
    print("Beginning date: " + chart.date)
# else, use next previous date
else:
    chart = billboard.ChartData(chart_type, date=oldest_date)
    chart = billboard.ChartData(chart_type, str(chart.previousDate))
    print("Picking up after: " + oldest_date)
    print("Beginning date: " + chart.date)

with open(outfilename, 'a') as outputfile:
    start_time = time.time()
    for i in range (1,counter+1):
        for position in range (0,200):
            song = chart[position]
            line_out = str(chart.date) + ',' + '"' + song.title + '"' + ',' \
            + '"' + song.artist + '"' + ','  + str(song.rank) + ',' \
            + str(song.lastPos) + ',' + str(song.peakPos) + ',' \
            + str(song.weeks) + ',' + str(song.isNew) + '\n'
            with open(outfilename, 'a') as outputfile:
                outputfile.write(line_out)
        print(chart.date + ": " + str(chart[0]))
        chart = billboard.ChartData(chart_type, str(chart.previousDate))
        # check if year is over
        if chart.date[:4] != output_year:
            print("Year is over")
            break
        else:
            time.sleep(timer_interval)
    print('done')
outputfile.close()

Starting new year
Beginning date: 1977-12-31
1977-12-31: 'Simple Dreams' by Linda Ronstadt
1977-12-24: 'Simple Dreams' by Linda Ronstadt
1977-12-17: 'Simple Dreams' by Linda Ronstadt
1977-12-10: 'Simple Dreams' by Linda Ronstadt
1977-12-03: 'Simple Dreams' by Linda Ronstadt
1977-11-26: 'Rumours' by Fleetwood Mac
1977-11-19: 'Rumours' by Fleetwood Mac
1977-11-12: 'Rumours' by Fleetwood Mac
1977-11-05: 'Rumours' by Fleetwood Mac
1977-10-29: 'Rumours' by Fleetwood Mac
1977-10-22: 'Rumours' by Fleetwood Mac
1977-10-15: 'Rumours' by Fleetwood Mac
1977-10-08: 'Rumours' by Fleetwood Mac
1977-10-01: 'Rumours' by Fleetwood Mac
1977-09-24: 'Rumours' by Fleetwood Mac
1977-09-17: 'Rumours' by Fleetwood Mac
1977-09-10: 'Rumours' by Fleetwood Mac
1977-09-03: 'Rumours' by Fleetwood Mac
1977-08-27: 'Rumours' by Fleetwood Mac
1977-08-20: 'Rumours' by Fleetwood Mac
1977-08-13: 'Rumours' by Fleetwood Mac
1977-08-06: 'Rumours' by Fleetwood Mac
1977-07-30: 'Rumours' by Fleetwood Mac
1977-07-23: 'Rumours' b

## Some testing

This checks the lengh of the last file processed. Should be 10400 or maybe 10600 if there was a chart on both the first and last day of the year or perhaps a leap year.

In [88]:
# read in the file
chart_peek = pd.read_csv(outfilename)

# check the length
len(chart_peek)

10600