US Domestic box office film revenues

This project publishes a daily export of box office revenues scraped from Box Office Mojo. Each daily export contains all revenue data from January 1st, 2000 up to the current day.

Data published for a specific day is available under the releases tab.

To download the latest version of the raw dataset, you can use the following url.

https://github.com/tjwaterman99/boxofficemojo-scraper/releases/latest/download/revenues_per_day.csv.gz

For example:

import pandas as pd

url = 'https://github.com/tjwaterman99/boxofficemojo-scraper/releases/latest/download/revenues_per_day.csv.gz'
df = pd.read_csv(url, parse_dates=['date'], index_col='id')
df.head()

id	date	title	revenue	theaters	distributor
362a6861-2040-4257-b414-b932f5c69f10	2018-03-08 00:00:00	Black Panther	4251525	4084	Walt Disney Studios Motion Pictures
25320541-0e30-e62b-2573-284863c73e4a	2018-03-08 00:00:00	Red Sparrow	1270235	3056	Twentieth Century Fox
08f98020-cf73-de6b-4803-2213649f9ea0	2018-03-08 00:00:00	Game Night	931272	3502	Warner Bros.
4a9c0497-0a38-540f-30b2-a06d16dfa784	2018-03-08 00:00:00	Death Wish	860755	2847	Metro-Goldwyn-Mayer (MGM)
e7986901-67fc-537d-9407-c3fc4c7a2faf	2018-03-08 00:00:00	Peter Rabbit	620538	3607	Sony Pictures Entertainment (SPE)

Development

Development requires Python3.6+ and access to a postgres database.

Create a virtual environment.

virtualenv venv --python=python3

Install the requirements.

pip install -r requirements.txt

Set the PG variables. These will be used by DBT during the build steps.

export PGHOST=127.0.0.1
export PGPORT=5432
export PGUSER=postgres
export PGPASSWORD=postgres
export PGDATABASE=postgres

Create the schema on the postgres database.

psql -c "create schema raw;"
psql -f schema.sql

Load the current data.

psql -c "\copy raw.boxofficemojo_revenues from $PWD/parsed.json

Build the dbt models.

dbt run --project-dir $PWD/dbt --profiles-dir $PWD/.dbt

Rebuilding

Parsed data is saved each day in the parsed.json file by a github actions workflow. To rebuild the project with new data, simply fetch the most recent commits on the main branch.

git pull

Then rebuild the raw schema and reinsert the parsed.json data, and rebuild the dbt models.

psql -f schema.sql
psql -c "\copy raw.boxofficemojo_revenues from $PWD/parsed.json
dbt run --project-dir $PWD/dbt --profiles-dir $PWD/.dbt
dbt test --project-dir $PWD/dbt --profiles-dir $PWD/.dbt

To rebuild the parsed.json file from scratch, use the parser.py script.

python parser.py parse-all > parsed.json

Name		Name	Last commit message	Last commit date
Latest commit History 1,377 Commits
.dbt		.dbt
.github/workflows		.github/workflows
data		data
dbt		dbt
images		images
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
parsed.json		parsed.json
parser.py		parser.py
plotter.py		plotter.py
requirements.txt		requirements.txt
schema.sql		schema.sql
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.dbt

.dbt

.github/workflows

.github/workflows

data

data

dbt

dbt

images

images

.gitignore

.gitignore

.python-version

.python-version

README.md

README.md

parsed.json

parsed.json

parser.py

parser.py

plotter.py

plotter.py

requirements.txt

requirements.txt

schema.sql

schema.sql

scraper.py

scraper.py

Repository files navigation

US Domestic box office film revenues

Development

Rebuilding

About

Releases 1,316

Contributors 2

Languages

tjwaterman99/boxofficemojo-scraper

Folders and files

Latest commit

History

Repository files navigation

US Domestic box office film revenues

Development

Rebuilding

About

Topics

Resources

Stars

Watchers

Forks

Languages