Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.

AItools 4 - Acquisition - Web Page Content Extraction

Library and command line program to extract main content sentences from web pages. The program can run both locally and with hadoop:

java -cp aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor local --help

java -cp aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor local --input foo.html,bar.warc.gz --output out

hadoop jar aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor hadoop --input foo.html,bar.warc.gz --output out

See the documentation of de.aitools.aq.web.extractor.PotthastJerichoExtractor for more information. To extract all text from a web page, use de.aitools.aq.web.extractor.JerichoHtmlSentenceExtractor.

When you use this software, cite it as

Johannes Kiesel, Benno Stein, and Stefan Lucks.
A Large-scale Analysis of the Mnemonic Password Advice.
In Proceedings of the 24th Annual Network and Distributed System Security Symposium (NDSS 17),
February 2017. 

[bibtex]

The default settings of the extractor are the ones used in the paper.

Dependencies (packed into the aitools4-aq-web-page-content-extraction-bin.jar)

  • aitools3-ie-languagedetection (available on request)
  • aitools3-ie-stopwords (available on request)
  • apache-commons-cli-1.2
  • apache-hadoop-2.5.2
  • apache-httpcomponents-client-4.5.2
  • icu4j-4.8.1.1
  • jericho-html-3.3
  • Lemur project WARC classes (for WARC files and Hadoop)

About

No description, website, or topics provided.

Resources

License

Packages

No packages published

Languages