Skip to content
Analysis of the front page of newspapers
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Newspaper front pages

This project is about programatically analyzing the front pages of newspapers.

Mentions of "Syria" by day across many newspapers

It consists of two sections:

  • The code to download the current day's newspapers from the the Newseum website, and parse out text, bounding boxes, font sizes, and font faces.
  • The analyses (mostly Jupyter notebooks) of the resulting data. See the analysis/ subdirectory for more information.

For desired contributions, see :).

Crawler/parser installation:

Requirements: Python 3, node, bash, qpdf, jq

System dependencies:


  • qpdf: brew install qpdf
  • jq: brew install jq


First run sudo apt-get update

  • qpdf: sudo apt-get install qpdf
  • jq: sudo apt-get install jq


virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

psycopg2 is only necessary as a dependency if writing to postgres. ipython is there because I use it for development.


npm install


./ <-- This downloads the front page of today's newspapers into date/[date]/ and performs the extractions, and loads them into a default postgres database. Note, you will need to run createdb frontpages to use the default settings.

Detailed usage:

(What is doing:)

  1. ./ <-- This downloads the front page of today's newspapers into data/[date]/
  2. ./ <-- This runs all the pdfs through a passwordless decrypt (automatically done by pdf viewers), and deletes the original pdfs.
  3. ./ <-- This extracts xml files from the decrypted pdfs, and saves them in the same data directory.
  4. python XML_FILES_DIR OUTPUT_DB OUTPUT_TABLE, where XML_FILES_DIR is the data/[date] directory above. This aggregates the XML output from the earlier steps up to the pdf textbox level, and saves it into a database.
  5. python METADATA_FILE DB_URI OUTPUT_TABLE where METADATA_FILE is data/[date]-metadata.json -- a file output by This writes metadata about newspapers that haven't been seen yet to a database table.


Inside of the analysis/ directory, there is a separate README for how to setup the notebook to play with the analyses.

You can’t perform that action at this time.