## Build Hot 100 archive

Idea of this notebook is to build an archive of the Hot 100 from the the current date to the oldest date __1958-08-02__.

However, there is a rate limit on requests to the billboard site. I've had it time out after 10 requests, but I've also had it time out after one if I've run other requests recently. 

For each chart, we have `chart.previousDate` to work with, which allows us to walk back in time. The loop works like this:

- Open our file
- check for the oldest date
- loop through a number of charts and write them

What I'd like to do is set this on a timer or something to avoid the error or restart after the error.


In [10]:
import billboard
from datetime import datetime, timedelta, date
import os
import pandas as pd
import time

## Settings

In [16]:
outfilename = '../data/hot-100.csv'
chart_type = 'hot-100'
headers = 'date,title,artist,current,previous,peak,weeks\n'

## Create the file

In [3]:
file_exists = os.path.exists(outfilename)

# checks if file exists and writes if not
if file_exists != True:
    with open(outfilename, 'a') as outputfile:
        outputfile.write(headers)
        print("File created with header")
# checks if file empty and writes header if not
else:
    file_empty = os.stat(outfilename).st_size == 0
    if file_empty:
        with open(outfilename, 'a') as outputfile:
            outputfile.write(headers)
            print("Added header")
    else:
        print("File has data")

File has data


## Chart loop

This is currently set to grab one year at a time. There is a pause of 10 seconds after each request to avoid a rate limit on scraping. To change the number of weeks grabbed in the loop, update the `counter` variable.

In [38]:
# read in file
top_100 = pd.read_csv(outfilename)

# set the counter
counter = 6

# find most oldest week in output
oldest_date = top_100.date.min()
print("Oldest date in file: " + oldest_date)

# set up write
chart_date = oldest_date
chart = billboard.ChartData(chart_type, date=chart_date)
chart = billboard.ChartData(chart_type, str(chart.previousDate))

with open(outfilename, 'a') as outputfile:
    start_time = time.time()
    interval = 1
    for i in range (1,counter+1):
        for position in range (0,100):
            song = chart[position]
            line_out = str(chart.date) + ',' + '"' + song.title + '"' + ',' + '"' \
            + song.artist + '"' + ','  + str(song.rank) + ',' + str(song.lastPos) \
            + ',' + str(song.peakPos) + ',' + str(song.weeks) + '\n'
            with open(outfilename, 'a') as outputfile:
                outputfile.write(line_out)
        print(chart.date + ": " + str(chart[0]))
        chart = billboard.ChartData(chart_type, str(chart.previousDate))
        time.sleep(10)
    print('done')

Oldest date in file: 2012-02-11
2012-02-04: 'Set Fire To The Rain' by Adele
2012-01-28: 'We Found Love' by Rihanna Featuring Calvin Harris
2012-01-21: 'We Found Love' by Rihanna Featuring Calvin Harris
2012-01-14: 'Sexy And I Know It' by LMFAO
2012-01-07: 'Sexy And I Know It' by LMFAO
2011-12-31: 'We Found Love' by Rihanna Featuring Calvin Harris
done


## Some testing

This checks the bottom of the file to make sure there are 100 entries for each week.

In [39]:
chart_peek = pd.read_csv(outfilename)

grouped = chart_peek.groupby(['date']).count().sort_values('date', ascending=False)

grouped.tail(10)

Unnamed: 0_level_0,title,artist,current,previous,peak,weeks
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-03-03,100,100,100,100,100,100
2012-02-25,100,100,100,100,100,100
2012-02-18,100,100,100,100,100,100
2012-02-11,100,100,100,100,100,100
2012-02-04,100,100,100,100,100,100
2012-01-28,100,100,100,100,100,100
2012-01-21,100,100,100,100,100,100
2012-01-14,100,100,100,100,100,100
2012-01-07,100,100,100,100,100,100
2011-12-31,100,100,100,100,100,100
