etk

This repository will contain our toolkit for extracting information from web pages. It will be built in stages to contain the following capabilities:

Several structure extractors to identify the main content of a page and tables
A host of data extractors for common entities, including people, places, phone, email, dates, etc.
A trainable algorithm to rank extractions
Automated experimentation to measure precision and recall of extractions

Setup

conda-env create .
source activate etk_env
python -m spacy download en

python -m unittest discover

jupyter notebook etk_examples.ipynb
or
jupyter notebook etk_extraction_using_config.ipynb

Before running the code in the notebook, change the kernel to Python [conda env:etk_env]

Name		Name	Last commit message	Last commit date
Latest commit History 992 Commits
docs		docs
etk		etk
.gitignore		.gitignore
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
add_images_ads.py		add_images_ads.py
convert_to_cdr3.py		convert_to_cdr3.py
environment.yml		environment.yml
etk_date_test.ipynb		etk_date_test.ipynb
etk_examples.ipynb		etk_examples.ipynb
etk_extraction_using_config.ipynb		etk_extraction_using_config.ipynb
etk_extraction_using_custom_spacy.ipynb		etk_extraction_using_custom_spacy.ipynb
etk_name_rules.ipynb		etk_name_rules.ipynb
etk_name_spacy_extractor.ipynb		etk_name_spacy_extractor.ipynb
etk_phonenum_rules.ipynb		etk_phonenum_rules.ipynb
etk_price_rules.ipynb		etk_price_rules.ipynb
etk_stock_symbol_rules.ipynb		etk_stock_symbol_rules.ipynb
export_environment.sh		export_environment.sh
make_es_spark.sh		make_es_spark.sh
make_spark.sh		make_spark.sh
recreate_environment.sh		recreate_environment.sh
requirements.txt		requirements.txt
run_etk_spark.py		run_etk_spark.py