Skip to content

Technical Considerations

Marcel Heinz edited this page Oct 12, 2018 · 8 revisions

How to access Articles?

For every task, we present a possible very different way on how to mine Wikipedia's taxonomy and then the parts like title, Infobox, summary.

Wikipedia API

TODO

Dbpedia SPARQL Endpoint

TODO (see mine folder)

Wikipedia Dumps

Categories

  1. Download dumps here https://dumps.wikimedia.org/enwiki/
  2. For extracting articles and categories under certain root categories, you will need the 'page.sql' and the 'categorylinks.sql' dumps.
  1. For mining article texts, download the pages-articles-multistream.xml
    1. From here on, it's a free choice in what format you want to work. I chose to proceed working in csv format. Beware here, as the combination ',"' that is apparent in some articles may mess up your CSV processing.
    1. WikiOnto scripts check whether the article is in the scope and then extracts the text.

Clone this wiki locally