## Build Billboard 200 archive

> There are some significant flaws in the files resulting from these scripts that make the data fairly unusable. Since titles and artists mix quotes and include commas, columns are not parsed correctly. This could possible be fixed in the script, or after the fact, but I don't have it in me to figure it out.

The API archive goes back to April 1967.

This notebook builds a year's worth of Billboard 200 charts at a time (but could be changed to pull other charts by adjusting the `chart_type` field. The idea is to build an archive one year at a time. This allows me to fix a single year if it is a problem. And there were issues with timeouts on other charts.

A little about the [Billboard 200](https://en.wikipedia.org/wiki/Billboard_200): The chart grew from a weekly top 10 list in 1956 to become a top 200 in May 1967, and acquired its present title in March 1992. ... On April 1, 1967, the chart was expanded to 175 positions, then finally to 200 positions on May 13, 1967.

There is a rate limit on requests to the billboard site. I've had it time out after 10 requests, but I've also had it time out after one if I've run other requests recently.

For each chart, we have `chart.previousDate` to work with, which allows us to walk back in time. The loop works like this:

- Open our file
- check for the oldest date, start new if not results already
- Find the next oldest chart
- Start a loop and counter and write the results of that week's chart
- Set the chart date to the next oldest date
- Check if that is in our current year. Break if not.
- Wait a time interval and loop again if counter is not maxed

This doesn't completely solve the rate limit, but does pretty well at 10 seconds a week.


In [1]:
import billboard
from datetime import datetime, timedelta, date
import os
import pandas as pd
import time

## Settings

In [13]:
# chart type from api
chart_type = 'billboard-200'

# year we are working on
output_year = "2020"

# output path
outfilename = "../data/billboard-200-" + output_year + ".csv"

print(outfilename)

../data/billboard-200-2020.csv


## Create the file

In [14]:
# headers
header = 'date,title,artist,current,previous,peak,weeks,new\n'

# set exists flag
file_exists = os.path.exists(outfilename)

# checks if file exists and writes if not
if file_exists != True:
    with open(outfilename, 'a') as outputfile:
        outputfile.write(header)
        print("File created with header")
# checks if file empty and writes header if not
else:
    file_empty = os.stat(outfilename).st_size == 0
    if file_empty:
        with open(outfilename, 'a') as outputfile:
            outputfile.write(header)
            print("Added header")
    else:
        print("File has data")

File has data


## Chart loop

This loop checks the most recent date of the current year's file. If it is new, it starts with the last chart in December and then through older charts. If there are charts already, it picks up where it left off.

Beyond `output_year` above, there are two settings to help control rate limiting:

- counter: How many loops it will do before stopping.
- timer_interval: How long to wait before getting the next chart.

In [15]:
# set the counter
counter = 53

# set the time intervval
timer_interval = 10

# read in file
chart_file = pd.read_csv(outfilename)

# find most oldest week in output
oldest_date = chart_file.date.min()

# if oldest_date isnull, then use begin_chart date
if pd.isnull(oldest_date):
    begin_chart_date = output_year + "-12-25"
    chart = billboard.ChartData(chart_type, date=begin_chart_date)
    print("Starting new year")
    print("Beginning date: " + chart.date)
# else, use next previous date
else:
    chart = billboard.ChartData(chart_type, date=oldest_date)
    chart = billboard.ChartData(chart_type, str(chart.previousDate))
    print("Picking up after: " + oldest_date)
    print("Beginning date: " + chart.date)

with open(outfilename, 'a') as outputfile:
    start_time = time.time()
    for i in range (1,counter+1):
        for position in range (0,200):
            song = chart[position]
            line_out = str(chart.date) + ',' + '"' + song.title + '"' + ',' \
            + '"' + song.artist + '"' + ','  + str(song.rank) + ',' \
            + str(song.lastPos) + ',' + str(song.peakPos) + ',' \
            + str(song.weeks) + ',' + str(song.isNew) + '\n'
            with open(outfilename, 'a') as outputfile:
                outputfile.write(line_out)
        print(chart.date + ": " + str(chart[0]))
        chart = billboard.ChartData(chart_type, str(chart.previousDate))
        # check if year is over
        if chart.date[:4] != output_year:
            print("Year is over")
            break
        else:
            time.sleep(timer_interval)
    print('done')
outputfile.close()

Starting new year
Beginning date: 2020-12-26
2020-12-26: 'Evermore' by Taylor Swift
2020-12-19: 'Wonder' by Shawn Mendes
2020-12-12: 'El Ultimo Tour del Mundo' by Bad Bunny
2020-12-05: 'BE' by BTS
2020-11-28: 'Power Up' by AC/DC
2020-11-21: 'Positions' by Ariana Grande
2020-11-14: 'Positions' by Ariana Grande
2020-11-07: 'What You See Is What You Get' by Luke Combs
2020-10-31: 'Folklore' by Taylor Swift
2020-10-24: 'Shoot For The Stars Aim For The Moon' by Pop Smoke
2020-10-17: 'Savage Mode II' by 21 Savage & Metro Boomin
2020-10-10: 'Tickets To My Downfall' by Machine Gun Kelly
2020-10-03: 'Folklore' by Taylor Swift
2020-09-26: 'Top' by YoungBoy Never Broke Again
2020-09-19: 'Detroit 2' by Big Sean
2020-09-12: 'Folklore' by Taylor Swift
2020-09-05: 'Folklore' by Taylor Swift
2020-08-29: 'Folklore' by Taylor Swift
2020-08-22: 'Folklore' by Taylor Swift
2020-08-15: 'Folklore' by Taylor Swift
2020-08-08: 'Folklore' by Taylor Swift
2020-08-01: 'Legends Never Die' by Juice WRLD
2020-07-25:

## Some testing

This checks the lengh of the last file processed. Should be 10400 or maybe 10600 if there was a chart on both the first and last day of the year or perhaps a leap year.

In [16]:
# read in the file
chart_peek = pd.read_csv(outfilename)

# check the length
len(chart_peek)

ParserError: Error tokenizing data. C error: Expected 8 fields in line 8459, saw 9
