This repository documents ongoing work on the evaluation of main text extraction from web pages.
The contribution is two folds:
- A meta-tool to use various text extarction systems at once
- An evaluation procedure based on a reference cleaned version
To come : unsupervised evaluation
The evaluation tool can compare different outputs to a single reference. It can also be used to compare different versions of a text generated in other text extraction settings : OCR, ASR ....
Most tools work on Python3 only You need to have pip installed (https://pip.pypa.io/en/stable/installing/)
Run the following command to install all the packages :
pip install -r requirements.txt (can take a while)
NB: If you are a windows user, take a look at this page : https://projects.raspberrypi.org/en/projects/using-pip-on-windows/2
now you can run this command: python test_all_tools.py
Directories:
Corpus/html
raw html filesCorpus/cleaned
cleaned file, one directory by toolCorpus/reference
reference cleaning version (needed for evaluation)
We defined three categories: (I) tools designed to extract all the textual content (recall-oriented tools), usually not focused on press articles; (II) tools focusing on the readability of web pages and (III) tools dedicated to text content extraction.
Cat. | Tool | Version | Github | Reference |
---|---|---|---|---|
I | Html2text | 2020.1.16 | Alir3z4/html2text/ | [https://core.ac.uk/download/pdf/127601559.pdf] |
I | Inscriptis | 1.0 | weblyzard/inscriptis | |
II | Newspaper3k | 0.2.8 | codelucas/newspaper | |
II | News-please | 1.4.25 | fhamborg/news-please | |
II | Readability | 0.7.1 | buriy/python-readability | |
III | Boilerpy3 | 1.0.2 | jmriebold/BoilerPy3 | [https://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf] |
III | Dragnet | 2.0.4 | dragnet-org/dragnet | [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4694&rep=rep1&type=pdf] |
III | Goose3 | 3.1.6 | goose3/goose3 | |
III | JusText | 2.2.0 | miso-belica/jusText | [https://is.muni.cz/th/45523/fi_d/phdthesis.pdf] |
III | Trafilatura | 0.4.1 | adbar/trafilatura | [https://hal.archives-ouvertes.fr/hal-02447264/document] |
Web-Assembled Data-Driven Language-oriented Evaluation. Just because.
Authors: Gaël Lejeune & Adrien Barbaresi.
Corpus/reference reference cleaning version (needed for evaluation)
##TODO: Add instructions and a make for processing everything
Node js issues (readabilipy) see : https://www.digitalocean.com/community/tutorials/how-to-install-node-js-on-ubuntu-18-04-fr
Encoding errors : utf-8 should be the norm but in fact is not Some issues and possible solutions : (Non-ISO extended-ASCII text) : https://superuser.com/questions/669700/non-iso-extended-ascii-text
Current work: Windows OS issues: " DLL load failed while importing"
- cchardet : PyYoshi/cChardet#61
- lxml (via Goose import)