NLP in Journalism Workshop at PyDays
- Clone this repo with
- Create a virtualenv environment with
- Activate it with
- Install Python library requirements with
pip install -r requirements.txt.
- Install Redis. Start the server with
We are using the Vox articles dataset, which contains all articles published on Vox.com before March 2017.
You can download the dataset (in TSV format) from https://data.world/elenadata/vox-articles). Copy this into the
data/ directory, so that we have
Then we'll want to load and clean the data. In general, this involves:
- removing HTML tags
- removing stop words
In order for the Flask API to work, we'll need to build a SQLite database with our articles. To do this, run
python main.py --load_from ./data/vox_Articles.tsv.
Once you’ve loaded the data into SQLite and set up Redis, we can run the API, which lets us see the data in a more organized fashion:
python api.py. The API should be running on
We can test that it’s up with
http://0.0.0.0:8000/articles, which should return a list of article ids from the database that you can query. You can also pick one of the article ids and try
http://0.0.0.0:8000/articles/<article_id>, which will output specific data about that article.