NLP in Journalism Workshop at PyDays
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
NLP in Journalism.pdf


NLP in Journalism Workshop at PyDays


  1. Clone this repo with
  2. Create a virtualenv environment with virtualenv env.
  3. Activate it with source env/bin/activate
  4. Install Python library requirements with pip install -r requirements.txt.
  5. Install Redis. Start the server with redis-server.

Loading dataset

We are using the Vox articles dataset, which contains all articles published on before March 2017. You can download the dataset (in TSV format) from Copy this into the data/ directory, so that we have data/vox_Articles.tsv.

Then we'll want to load and clean the data. In general, this involves:

  • removing HTML tags
  • removing stop words
  • tokenizing
  • stemming

In order for the Flask API to work, we'll need to build a SQLite database with our articles. To do this, run python --load_from ./data/vox_Articles.tsv.

Once you’ve loaded the data into SQLite and set up Redis, we can run the API, which lets us see the data in a more organized fashion: python The API should be running on

We can test that it’s up with, which should return a list of article ids from the database that you can query. You can also pick one of the article ids and try<article_id>, which will output specific data about that article.