No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

AItools 4 - Acquisition - Web Page Content Extraction

Library and command line program to extract main content sentences from web pages. The program can run both locally and with hadoop:

java -cp aitools4-aq-web-page-content-extraction-bin.jar local --help

java -cp aitools4-aq-web-page-content-extraction-bin.jar local --input foo.html,bar.warc.gz --output out

hadoop jar aitools4-aq-web-page-content-extraction-bin.jar hadoop --input foo.html,bar.warc.gz --output out

See the documentation of for more information. To extract all text from a web page, use

When you use this software, cite it as

Johannes Kiesel, Benno Stein, and Stefan Lucks.
A Large-scale Analysis of the Mnemonic Password Advice.
In Proceedings of the 24th Annual Network and Distributed System Security Symposium (NDSS 17),
February 2017. 


The default settings of the extractor are the ones used in the paper.

Dependencies (packed into the aitools4-aq-web-page-content-extraction-bin.jar)

  • aitools3-ie-languagedetection (available on request)
  • aitools3-ie-stopwords (available on request)
  • apache-commons-cli-1.2
  • apache-hadoop-2.5.2
  • apache-httpcomponents-client-4.5.2
  • icu4j-
  • jericho-html-3.3
  • Lemur project WARC classes (for WARC files and Hadoop)