No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src/de/aitools/aq
.gitignore
LICENSE
README.md
aitools4-aq-web-page-content-extraction-bin.jar
aitools4-aq-web-page-content-extraction-dependencies.jar

README.md

AItools 4 - Acquisition - Web Page Content Extraction

Library and command line program to extract main content sentences from web pages. The program can run both locally and with hadoop:

java -cp aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor local --help

java -cp aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor local --input foo.html,bar.warc.gz --output out

hadoop jar aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor hadoop --input foo.html,bar.warc.gz --output out

See the documentation of de.aitools.aq.web.extractor.PotthastJerichoExtractor for more information. To extract all text from a web page, use de.aitools.aq.web.extractor.JerichoHtmlSentenceExtractor.

When you use this software, cite it as

Johannes Kiesel, Benno Stein, and Stefan Lucks.
A Large-scale Analysis of the Mnemonic Password Advice.
In Proceedings of the 24th Annual Network and Distributed System Security Symposium (NDSS 17),
February 2017. 

[bibtex]

The default settings of the extractor are the ones used in the paper.

Dependencies (packed into the aitools4-aq-web-page-content-extraction-bin.jar)

  • aitools3-ie-languagedetection (available on request)
  • aitools3-ie-stopwords (available on request)
  • apache-commons-cli-1.2
  • apache-hadoop-2.5.2
  • apache-httpcomponents-client-4.5.2
  • icu4j-4.8.1.1
  • jericho-html-3.3
  • Lemur project WARC classes (for WARC files and Hadoop)