This repo has a variety of tools for collecting and inspecting data from the New York Times website (www.nytimes.com).
Prerequisites:
- A Postgres database with the correct schema (see Postgres database)
- Node 14.5+
- Yarn 1.22+
- An API key for the official New York Times API
$ git clone https://github.com/tomjcleveland/nyt-scraper.git
$ cd nyt-scraper
$ yarn
$ export PGUSER=nyt_app \
PGHOST=<HOSTNAME> \
PGPASSWORD=<PASSWORD> \
PGDATABASE=nyt \
PGPORT=5432 \
NYT_KEY=<NYT_API_KEY>
$ node crawl.js
Here's a high-level overview of what lives where:
├── README.md
├── lib <-- JS libraries
├── package.json <-- NodeJS package config
├── public <-- Static files for website
├── schema.sql <-- Postgres schema
├── scripts <-- Scraping scripts
├── server.js <-- Web server
└── views <-- HTML templates for website
There are three scripts for scraping data from the New York Times and saving it to a database.
- Visits www.nytimes.com
- Collects all headlines on the homepage
- Associates headlines with articles
- Saves headline info to Postgres
- Uses the official NYT API to fetch article popularity data
- Saves this data to Postgres
- Samples 50 articles from Postgres
- Updates data about this article, including
- A snapshot of the current article body
- Updated article metadata (authors, tags, etc.)
This repo also contains code for a simple website which can be used to view the article data.
Start the web server like this:
$ export PGUSER=nyt_app \
PGHOST=<HOSTNAME> \
PGPASSWORD=<PASSWORD> \
PGDATABASE=nyt \
PGPORT=5432 \
BASE_URL=<URL_OF_WEBSITE>
$ node server.js
At which point it should be running at http://localhost:3000.
Cron jobs are actions that run on a schedule, and most Unix-based systems have it pre-installed.
This project uses several cron jobs.
These cron jobs run the scraping scripts on a specified interval, so that we can have time-series data:
*/1 * * * * cd /home/ubuntu/nyt-scraper && NODE_ENV=production PGUSER=nyt_app PGHOST=<HOSTNAME> PGPASSWORD=<PASSWORD> PGDATABASE=nyt PGPORT=5432 NYT_KEY=<NYT_API_KEY> node crawl.js >> /var/log/nyt/cron.log 2&>1
35 * * * * cd /home/ubuntu/nyt-scraper && NODE_ENV=production PGUSER=nyt_app PGHOST=<HOSTNAME> PGPASSWORD=<PASSWORD> PGDATABASE=nyt PGPORT=5432 NYT_KEY=<NYT_API_KEY> node getPopularity.js >> /var/log/nyt/popularity.log 2&>1
49 */1 * * * cd /home/ubuntu/nyt-scraper && NODE_ENV=production PGUSER=nyt_app PGHOST=<HOSTNAME> PGPASSWORD=<PASSWORD> PGDATABASE=nyt PGPORT=5432 NYT_KEY=<NYT_API_KEY> node backfill.js >> /var/log/nyt/backfill.log 2&>1
And these cron jobs refresh cached data so that our web server can run more efficiently:
*/30 * * * * PGPASSWORD=<PASSWORD> psql -h <HOSTNAME> -U postgres -d nyt -c 'REFRESH MATERIALIZED VIEW CONCURRENTLY nyt.articlestats;' >> /var/log/nyt/psql.log 2&>1
7 */1 * * * PGPASSWORD=<PASSWORD> psql -h <HOSTNAME> -U postgres -d nyt -c 'SELECT refreshSearchIndex();' >> /var/log/nyt/search-index.log 2&>1
All scraped data is written to a Postgres database. The schema for this database is defined in schema.sql.
Once you have a Postgres server running, you can apply the schema like this (assuming the admin user is postgres):
$ psql -h <HOSTNAME> -f schema.sql -U postgres
You'll also need to add a password for the nyt_app user, which your scripts will use to connect to the database:
$ psql -h <HOSTNNAME> -U postgres -d nyt
=> ALTER ROLE nyt_app WITH LOGIN;
=> ALTER ROLE nyt_app WITH PASSWORD 'mysupersecretpasssword';
