-
Notifications
You must be signed in to change notification settings - Fork 1
Technical Considerations
Marcel Heinz edited this page Oct 12, 2018
·
8 revisions
For every task, we present a possible very different way on how to mine Wikipedia's taxonomy and then the parts like title, Infobox, summary.
TODO
TODO (see mine folder)
- Download dumps here https://dumps.wikimedia.org/enwiki/
- For extracting articles and categories under certain root categories, you will need the 'page.sql' and the 'categorylinks.sql' dumps.
-
- Load the SQL dumps into a mysql database (Beware: Follow the advice at https://stackoverflow.com/questions/30387731/loading-enwiki-latest-categorylinks-sql-into-mysql for better performance when loading the dumps. For Windows, you might want to install Windows-grep and cygwin.)
-
- You can join the tables on categorylinks.cl_from = page.page_id (as advised here: https://stackoverflow.com/questions/4789843/a-sql-query-that-acquires-the-list-of-categories-given-a-page-title-from-wikiped)
-
- You may want to dump the results of a join into a CSV file. If you run your mysql server locally, the easiest way is to work with INTO OUTFILE '<path.csv>' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'. You can increase the performance for the dumps by further refining the query from step 4. If you only want subcategory relationships, use WHERE page.page_namespace = 14. If you only want article relationships, use WHERE page.page_namespace = 0.
-
- If you want to read the CSVs using Python's csv module, read the advice here as you will encounter weird quotes inside titles: https://stackoverflow.com/questions/23897193/handling-escaped-quotes-with-pythons-csv-reader
- For mining article texts, download the pages-articles-multistream.xml
-
- From here on, it's a free choice in what format you want to work. I chose to proceed working in csv format. Beware here, as the combination ',"' that is apparent in some articles may mess up your CSV processing.
-
- WikiOnto scripts check whether the article is in the scope and then extracts the text.