Technical Considerations

Jump to bottom

Marcel Heinz edited this page Sep 24, 2018 · 8 revisions

How to access Articles?

For every task, we present a possible very different way on how to mine Wikipedia's taxonomy and then the parts like title, Infobox, summary.

Wikipedia API

TODO

Dbpedia SPARQL Endpoint

TODO (see mine folder)

Wikipedia Dumps

Categories

Download dumps here https://dumps.wikimedia.org/enwiki/
For extracting articles and categories under certain root categories, you will need the 'pages' and the 'categorylinks' dumps.
Load the SQL dumps into a mysql database (Beware: Follow the advice at https://stackoverflow.com/questions/30387731/loading-enwiki-latest-categorylinks-sql-into-mysql for better performance when loading the dumps. For Windows, you might want to install Windows-grep and cygwin.)
You can join the tables on categorylinks.cl_from = page.page_id (as advised here: https://stackoverflow.com/questions/4789843/a-sql-query-that-acquires-the-list-of-categories-given-a-page-title-from-wikiped)
You may want to dump the results of a join into a CSV file. If you run your mysql server locally, the easiest way is to work with INTO OUTFILE '<path.csv>' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'. You can increase the performance for the dumps by further refining the query from step 4. If you only want subcategory relationships, use WHERE page.page_namespace = 14. If you only want article relationships, use WHERE page.page_namespace = 0.