## Build Hot 100 archive

Idea of this notebook is to build an archive of the Hot 100 from the the current date to the oldest date __1958-08-02__.

However, there is a rate limit on requests to the billboard site. I've had it time out after 10 requests, but I've also had it time out after one if I've run other requests recently. 

For each chart, we have `chart.previousDate` to work with, which allows us to walk back in time. The loop works like this:

- Open our file
- check for the oldest date
- Find the next oldest chart
- Start a loop and counter and write the results of that week's chart
- Set the chart date to the next oldest date
- Wait a time interval
- Check the loop and go again or stop

This doesn't completely solve the rate limit, but does pretty well at 10 seconds a week.


In [1]:
import billboard
from datetime import datetime, timedelta, date
import os
import pandas as pd
import time

## Settings

In [12]:
outfilename = '../data/hot-100-new.csv'
chart_type = 'hot-100'
headers = 'date,title,artist,current,previous,peak,weeks\n'

## Create the file

In [3]:
file_exists = os.path.exists(outfilename)

# checks if file exists and writes if not
if file_exists != True:
    with open(outfilename, 'a') as outputfile:
        outputfile.write(headers)
        print("File created with header")
# checks if file empty and writes header if not
else:
    file_empty = os.stat(outfilename).st_size == 0
    if file_empty:
        with open(outfilename, 'a') as outputfile:
            outputfile.write(headers)
            print("Added header")
    else:
        print("File has data")

File has data


## Chart loop

This is currently set to grab one year at a time. There is a pause of 10 seconds after each request to avoid a rate limit on scraping. To change the number of weeks grabbed in the loop, update the `counter` variable.

In [13]:
# read in file
top_100 = pd.read_csv(outfilename)

# set the counter
counter = 5

# set the time intervval
timer_interval = 5

# find most oldest week in output
oldest_date = top_100.date.min()
print(oldest_date)

# set up write
chart_date = oldest_date
chart = billboard.ChartData(chart_type, date=chart_date)
chart = billboard.ChartData(chart_type, str(chart.previousDate))

with open(outfilename, 'a') as outputfile:
    start_time = time.time()
    for i in range (1,counter+1):
        for position in range (0,100):
            song = chart[position]
            line_out = str(chart.date) + ',' + '"' + song.title + '"' + ',' + '"' \
            + song.artist + '"' + ','  + str(song.rank) + ',' + str(song.lastPos) \
            + ',' + str(song.peakPos) + ',' + str(song.weeks) + '\n'
            with open(outfilename, 'a') as outputfile:
                outputfile.write(line_out)
        print(chart.date + ": " + str(chart[0]))
        chart = billboard.ChartData(chart_type, str(chart.previousDate))
        time.sleep(timer_interval)
    print('done')
outputfile.close()

2019-04-06
2019-03-30: '7 Rings' by Ariana Grande
2019-03-23: '7 Rings' by Ariana Grande
2019-03-16: 'Sucker' by Jonas Brothers
2019-03-09: 'Shallow' by Lady Gaga & Bradley Cooper
2019-03-02: '7 Rings' by Ariana Grande
done


## Some testing

This checks the bottom of the file to make sure there are 100 entries for each week.

In [None]:
chart_peek = pd.read_csv(outfilename)

grouped = chart_peek.groupby(['date']).count().sort_values('date', ascending=False)

grouped.tail(10)