Skip to content

A Github actions based web scraper, which publishes a CSV file containing domestic box office statistics each night. Downloads are available from the Github releases tab.

Notifications You must be signed in to change notification settings

tjwaterman99/boxofficemojo-scraper

Repository files navigation

US Domestic box office film revenues

This project publishes a daily export of box office revenues scraped from Box Office Mojo. Each daily export contains all revenue data from January 1st, 2000 up to the current day.

Data published for a specific day is available under the releases tab.

To download the latest version of the raw dataset, you can use the following url.

https://github.com/tjwaterman99/boxofficemojo-scraper/releases/latest/download/revenues_per_day.csv.gz

For example:

import pandas as pd

url = 'https://github.com/tjwaterman99/boxofficemojo-scraper/releases/latest/download/revenues_per_day.csv.gz'
df = pd.read_csv(url, parse_dates=['date'], index_col='id')
df.head()
id date title revenue theaters distributor
362a6861-2040-4257-b414-b932f5c69f10 2018-03-08 00:00:00 Black Panther 4251525 4084 Walt Disney Studios Motion Pictures
25320541-0e30-e62b-2573-284863c73e4a 2018-03-08 00:00:00 Red Sparrow 1270235 3056 Twentieth Century Fox
08f98020-cf73-de6b-4803-2213649f9ea0 2018-03-08 00:00:00 Game Night 931272 3502 Warner Bros.
4a9c0497-0a38-540f-30b2-a06d16dfa784 2018-03-08 00:00:00 Death Wish 860755 2847 Metro-Goldwyn-Mayer (MGM)
e7986901-67fc-537d-9407-c3fc4c7a2faf 2018-03-08 00:00:00 Peter Rabbit 620538 3607 Sony Pictures Entertainment (SPE)

Development

Development requires Python3.6+ and access to a postgres database.

Create a virtual environment.

virtualenv venv --python=python3

Install the requirements.

pip install -r requirements.txt

Set the PG variables. These will be used by DBT during the build steps.

export PGHOST=127.0.0.1
export PGPORT=5432
export PGUSER=postgres
export PGPASSWORD=postgres
export PGDATABASE=postgres

Create the schema on the postgres database.

psql -c "create schema raw;"
psql -f schema.sql

Load the current data.

psql -c "\copy raw.boxofficemojo_revenues from $PWD/parsed.json

Build the dbt models.

dbt run --project-dir $PWD/dbt --profiles-dir $PWD/.dbt

Rebuilding

Parsed data is saved each day in the parsed.json file by a github actions workflow. To rebuild the project with new data, simply fetch the most recent commits on the main branch.

git pull

Then rebuild the raw schema and reinsert the parsed.json data, and rebuild the dbt models.

psql -f schema.sql
psql -c "\copy raw.boxofficemojo_revenues from $PWD/parsed.json
dbt run --project-dir $PWD/dbt --profiles-dir $PWD/.dbt
dbt test --project-dir $PWD/dbt --profiles-dir $PWD/.dbt

To rebuild the parsed.json file from scratch, use the parser.py script.

python parser.py parse-all > parsed.json

About

A Github actions based web scraper, which publishes a CSV file containing domestic box office statistics each night. Downloads are available from the Github releases tab.

Topics

Resources

Stars

Watchers

Forks

Languages