Newspaper front pages
This project is about programatically analyzing the front pages of newspapers.
It consists of two sections:
- The code to download the current day's newspapers from the the Newseum website, and parse out text, bounding boxes, font sizes, and font faces.
- The analyses (mostly Jupyter notebooks) of the resulting data. See the
analysis/subdirectory for more information.
For desired contributions, see
Requirements: Python 3, node, bash, qpdf, jq
brew install qpdf
brew install jq
sudo apt-get update
sudo apt-get install qpdf
sudo apt-get install jq
virtualenv venv source venv/bin/activate pip install -r requirements.txt
psycopg2 is only necessary as a dependency if writing to postgres.
ipython is there because I use it for development.
./runDaily.sh <-- This downloads the front page of today's newspapers into
date/[date]/ and performs the extractions, and loads them into a default postgres database. Note, you will need to run
createdb frontpages to use the default settings.
runDaily.sh is doing:)
./download.sh<-- This downloads the front page of today's newspapers into
./decrypt.sh<-- This runs all the pdfs through a passwordless decrypt (automatically done by pdf viewers), and deletes the original pdfs.
./parse.sh<-- This extracts xml files from the decrypted pdfs, and saves them in the same data directory.
python parse_xml_to_db.py XML_FILES_DIR OUTPUT_DB OUTPUT_TABLE, where
data/[date]directory above. This aggregates the XML output from the earlier steps up to the pdf textbox level, and saves it into a database.
python ingest_newspaper_metadata.py METADATA_FILE DB_URI OUTPUT_TABLEwhere
data/[date]-metadata.json-- a file output by
download.sh. This writes metadata about newspapers that haven't been seen yet to a database table.
Inside of the
analysis/ directory, there is a separate README for how to setup the notebook to play with the analyses.