Skip to content

webis-de/aitools4-aq-web-page-content-extraction

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.

AItools 4 - Acquisition - Web Page Content Extraction

Library and command line program to extract main content sentences from web pages. The program can run both locally and with hadoop:

java -cp aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor local --help

java -cp aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor local --input foo.html,bar.warc.gz --output out

hadoop jar aitools4-aq-web-page-content-extraction-bin.jar de.aitools.aq.web.extractor.PotthastJerichoExtractor hadoop --input foo.html,bar.warc.gz --output out

See the documentation of de.aitools.aq.web.extractor.PotthastJerichoExtractor for more information. To extract all text from a web page, use de.aitools.aq.web.extractor.JerichoHtmlSentenceExtractor.

When you use this software, cite it as

Johannes Kiesel, Benno Stein, and Stefan Lucks.
A Large-scale Analysis of the Mnemonic Password Advice.
In Proceedings of the 24th Annual Network and Distributed System Security Symposium (NDSS 17),
February 2017. 

[bibtex]

The default settings of the extractor are the ones used in the paper.

Dependencies (packed into the aitools4-aq-web-page-content-extraction-bin.jar)

  • aitools3-ie-languagedetection (available on request)
  • aitools3-ie-stopwords (available on request)
  • apache-commons-cli-1.2
  • apache-hadoop-2.5.2
  • apache-httpcomponents-client-4.5.2
  • icu4j-4.8.1.1
  • jericho-html-3.3
  • Lemur project WARC classes (for WARC files and Hadoop)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages