Wikilinks is a parsing framework for Wikipedia written in Python. The framework is intended to extract different link features (e.g., network topological, visual) from Wikipedia in order to study human navigation. It can be used in combination with the clickstream dataset by Ellery Wulczyn and Dario Taraborelli from Wikimedia. The corresponding Wikipedia XML dump can be found here. Click here for more recent dumps.
The framework extracts the
title of an article from the XML dump. Redirects are resoleved using the XML dump. The corresponding HTML file for each article is then crawled from the Wikimedia API and processed.
For each link (
target_article_id pair in the
links table) in the zero namespace of Wikipedia it extracts then the following information:
target_position_in_texttarget link's position in text
target_position_in_text_onlytarget link's position in text only, all links in tables are ignored
target_position_in_sectionposition in section
target_position_in_section_in_text_onlytarget link's position in section only, all links in tables of the section are ignored
section_namethe name of the section
section_numberthe number of the section
target_position_in_tableposition of the target link in the table
table_numberthe number of the table
table_css_classthe cascading style sheed class of the table (can be used to classify the tables, i.e., infobox, navbox, etc.)
table_css_stylefurther styling of the table, extracted from the style element of the table tag (can be used to classify the tables, i.e., infobox, navbox, etc.)
target_x_coord_1920_1080the x coordinate of the visual position of the left upper corner of the target link for resolution 1920x1080
target_y_coord_1920_1080the y coordinate of the visual position of the left upper corner of the target link for resolution 1920x1080
For each article in the
article table we also extract the corresponding web page length of the rendered HTML and store it in the
page_length_1920_1080 of the table
page_length. The page length can be used in different ways, e.g., normalization.
Building the database
CREATE DATABASE `wikilinks` DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_bin; GRANT ALL ON `wikilinks`.* TO `wikilinks`@`localhost` IDENTIFIED BY 'wikilinks'; GRANT ALL ON `wikilinks`.* TO `wikilinks`@`%` IDENTIFIED BY 'wikilinks';
We use binary collation for comparing strings, i.e., article titles - see stackoverflow entry.
Please copy the
conf_template.py file to
conf.py and change the settings accordingly to your database setup and preferences.
Modules description and use
After creating the databese this should be the first script to execute.
builder.py script should be rather self-explanatory. It allows one to:
- Create the basic database structure (create tables: articles and redirects).
- Create the reference entries for articles by parsing the Wikipedia dump files and resolving redirects.
crawler.py uses the
rev_id of an article in the 'articles' table to crawl the corresponding HTML file.
This process takes around 2 days with 20 threads. The size of the zipped dump is around 60GB.
startlinkinserter.py script creates and populates the tables:
page_length. Xfvb screen has to be available at DISPLAY 1, before it can be run since it extracts visual postions of the links.
You will need a lot of RAM for this process and it can take some days to finish.
After the links are extraced the
links_index.sql script should be executed in order to create index structures.
tableclassinserter.py script creates and populates the table
table_css_class. After the css classes are extraced the
table_css_class_index.sql script should be executed in order to create index structures.
Importing and classifying the clickstream data.
The scritps for creating and classifing the clickstream data are located in the
The first script to execute is the
clickstream.sql. It creates the
clickstream table and imports the (referrer-resource pairs) transitions data.
unique_links.sql script have to be execuded after the
links table is populated. Since a link can occure multiple times in an article, the
unique_links.sql script creates a table containing only distinct links.
This table represents the Wikipedia network.
clickstream_derived.sql is the last one to be executed. This script matches the transitions in the clickstream data and the links extracted by the parser. Additionally, it classifies the transitions for the purpous of studing navigation according to the following schema:
internal-linka link that links from article
b, both in the zero namespace.
internal-self-loopa link from article
ais in the zero namespace.
internal-teleportationa transition from article
bboth in the zero namespace, but in article
athere is no (network structural) link to article
internal-nonexistenta transition from article
ais in the zero namespace, but
sm-entrypointtransitions for social media web sites (Facebook and Twitter) to an article in the zero namespace.
se-entrypointtransitions from search engines (Google, Yahoo! and Bing) to an article in the zero namespace.
wikipedia-entrypointtransitions from other Wikipedia projects (other Wikipedia project (language editions)) to an article in the zero namespace.
wikimedia-entrypointtransitions from other Wikimedia projects (other Wikimedia project) to an article in the zero namespace.
noreferrertransitions from somewhere (e.g., from browser’s address bar direct to article) to an article in the zero namespace.
othertransitions from somewhere (the source is known but not relevant (no search engine, no social media, no Wiki-project etc.)) to an article in the zero namespace.
Creates the Wikipedia network in the graph tool format from the unique links extracted from the parser.
Creates a network in the graph tool format from the transitions in the clickstream that could have been mapped to links in the
heatmaps.py script uses the clickstream data and the link data to create heatmaps showing in which regions on screen links are placed and consumed.
rbo.py script calculates the ranked biased overlap between ranked transition lists.
- import categories and assign a category to each article.
- extract links from captions of figures.
- extract the anchor text of the links.
- configurable number of threads for the crawler and for the parsers.
- Paul Laufer
- Daniel Lamprecht
- Philipp Singer
- Florian Lemmerich
This project is published under the MIT License.