-
Notifications
You must be signed in to change notification settings - Fork 1
Technical Considerations
Marcel Heinz edited this page Sep 24, 2018
·
8 revisions
For every task, we present a possible very different way on how to mine Wikipedia's taxonomy and then the parts like title, Infobox, summary.
TODO
TODO (see mine folder)
- Download dumps here https://dumps.wikimedia.org/enwiki/
- For extracting articles and categories under certain root categories, you will need the 'pages' and the 'categorylinks' dumps.
- Load the SQL dumps into a mysql database (Beware: Follow the advice at https://stackoverflow.com/questions/30387731/loading-enwiki-latest-categorylinks-sql-into-mysql for better performance when loading the dumps. For Windows, you might want to install Windows-grep and cygwin.)
- You can join the tables on categorylinks.cl_from = page.page_id (as advised here: https://stackoverflow.com/questions/4789843/a-sql-query-that-acquires-the-list-of-categories-given-a-page-title-from-wikiped)
- You may want to dump the results of a join into a CSV file. If you run your mysql server locally, the easiest way is to work with INTO OUTFILE '<path.csv>' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'. You can increase the performance for the dumps by further refining the query from step 4. If you only want subcategory relationships, use WHERE page.page_namespace = 14. If you only want article relationships, use WHERE page.page_namespace = 0.