This repository will contain our toolkit for extracting information from web pages. It will be built in stages to contain the following capabilities:
- Several structure extractors to identify the main content of a page and tables
- A host of data extractors for common entities, including people, places, phone, email, dates, etc.
- A trainable algorithm to rank extractions
- Automated experimentation to measure precision and recall of extractions
conda-env create .
source activate etk_env
python -m spacy download en
python -m unittest discover
jupyter notebook etk_examples.ipynb
or
jupyter notebook etk_extraction_using_config.ipynb
Before running the code in the notebook, change the kernel to
Python [conda env:etk_env]