Skip to content
Newer
Older
100644 175 lines (119 sloc) 7.58 KB
b081d5f @scy Update the README.
authored
1 This is Levitation, a project to convert Wikipedia database dumps into Git
2 repositories. It has been successfully tested with a small Wiki
3 (pdc.wikipedia.org) having 3,700 articles and 40,000 revisions. Importing
4 those took 10 minutes on a Core 2 Duo 1.66 GHz. RAM usage is minimal: Pages are
5 imported one after the other, it will at most require the amount of memory
6 needed to keep all revisions of a single page into memory. You should be safe
7 with 1 GB of RAM.
8046053 @scy Add a confusing README with the master plan.
authored
8
ec56c5b @scy Add landing pad for new users to README.
authored
9 See below (“Things that work”) for the current status.
10
11 Some knowledge of Git is required to use this tool.
12
8046053 @scy Add a confusing README with the master plan.
authored
13
14 How it should be done:
15
16 You can get recent dumps of all Wikimedia wikis at:
17 http://download.wikimedia.org/backup-index.html
18
19 The pages-meta-history.xml file is what we want. (In case you’re wondering:
b081d5f @scy Update the README.
authored
20 Wikimedia does not offer content SQL dumps anymore, and there are now full-
21 history dump for en.wikipedia.org because of its size.) It includes all pages
22 in all namespaces and all of their revisions. The problem is the data’s order.
8046053 @scy Add a confusing README with the master plan.
authored
23
24 In the dump files, the first <page> tag contains the first page that has been
25 created and all of its revisions. The second <page> tag contains the second
26 page and all of its revisions, and so on. However, when importing into Git, you
27 need that data sorted by the revision’s time, across all pages.
28
29 Confused? Let me rephrase that. The data in the XML dump is grouped by pages.
30 Assuming the Wiki was created in 2005, you get the very first revision of the
31 very first page and then every following revision _of_that_page_, even if the
32 last revision is from 2 days ago. Then, the stream goes on with the second page
33 that was ever created, back in 2005, and again all of its history. The pages in
34 the XML are ordered by the date they were first created.
35
36 In contrast, Git wants the first commit to be the state _of_the_whole_Wiki_ at
37 the time the first page was created. That one is easy, as it is the first
38 revision in the XML stream. The second commit is not so easy, though. If the
39 second change on that Wiki was to modify the one (and only) existing page,
40 that’s fine. If, however, a second page was created, the next commit needs to
41 add that page’s content, without first doing something with the other revisions
42 of the first page.
43
b081d5f @scy Update the README.
authored
44 Therefore you need to reorder the data in the dump. Reordering several TB of
8046053 @scy Add a confusing README with the master plan.
authored
45 data isn’t the most trivial thing to do. A careful balance between performance,
b081d5f @scy Update the README.
authored
46 disk and memory usage must be found. The plan is as follows.
8046053 @scy Add a confusing README with the master plan.
authored
47
48 In order to minimize the amount of disk space required, the import.py tool
49 reads the XML stream from stdin, allowing you to pipe a file that is
50 decompressed on the fly (using bzip2 or 7zip) through it.
51
52 import.py’s stdout should be connected to git-fast-import, which will take care
53 of writing the revisions to disk. Since they are in the wrong order, only blobs
54 will be created at that time. import.py will use the “mark” feature of
55 git-fast-import to provide the MediaWiki revision ID (which is unique over all
56 pages) to git-fast-import. That way, we will later be able to reference that
57 particular revision when creating commits.
58
59 However, we still need to remember all the revisions’ metadata and be able to
60 access it in a fast way given the revision ID. To be able to do that, an
61 additional metadata file will be written while reading the XML stream. It
62 contains, for each revision, a fixed-width dataset, at the position
63 (revision-number * dataset-width).
64
65 Additionally, we need to keep track of user-id/user-name relations. A second
66 file keeps track of those, using 255-byte strings per user, again at the
67 position (user-id * 256) since the first byte stores string length.
68
69 The same is true for page titles, therefore we need a third file.
70
71 Now we have to walk through the revisions file and create a commit using the
72 blob we already created, collecting page title and author name from the other
73 files as we go.
74
75 The real content files will need to be put into different directories because
76 of limitations in the maximum number of files per directory.
77
78
79 Now to the actual implementation:
80
81
82 Things that work:
83
b081d5f @scy Update the README.
authored
84 - Read a Wikipedia XML full-history dump and output it in a format suitable
85 for piping into git-fast-import(1). The resulting repository contains one
86 file per page. All revisions are available in the history. There are some
87 restrictions, read below.
88
f40dacd @scy Use the original mod summary as commit message.
authored
89 - Use the original modification summary as commit message.
90
ba92413 @scy Use Wiki hostname for author mail addresses.
authored
91 - Read the Wiki URL from the XML file and set user mail addresses accordingly.
92
b081d5f @scy Update the README.
authored
93
94 Things that are still missing:
95
f40dacd @scy Use the original mod summary as commit message.
authored
96 - Store additional information in the commit message that specifies page and
97 revision ID as well as whether the edit was marked as “minor”.
b081d5f @scy Update the README.
authored
98
99 - Use the page’s name as file name instead of the page ID. To do that, we need
100 a meta file that stores the page names. Page name _should_ be limited to 255
101 _bytes_ UTF-8, however MediaWiki is not quite clear about whether that
102 should actually be _characters_. Additionally, the page name does not
103 include the namespace name, even if though in the dump it’s prefixed with
104 the namespace. The meta file needs to store the namespace ID as well. Since
105 there usually are only a few namespaces (10 or 20), we can keep their names
106 in RAM.
107
108 - Use the author name in the commit instead of the user ID. To do that, we
109 need a meta file that stores the author names. Author name _should_ be
110 limited to 255 _bytes_ UTF-8, however MediaWiki is not quite clear about
111 whether that should actually be _characters_.
112
113 - Think of a neat file hierarchy for the file tree: You cannot throw 3 million
114 articles into a single directory. Instead, you need additional subdirectory
115 levels. I’ll probably go for the first n bytes of the hashed article name.
116
117 - Use a locally timezoned timestamp for the commit date instead of an UTC one.
8046053 @scy Add a confusing README with the master plan.
authored
118
b081d5f @scy Update the README.
authored
119 - Allow IPv6 addresses as IP edit usernames. (Although afaics MediaWiki itself
120 cannot handle IPv6 addresses, so we got some time.)
8046053 @scy Add a confusing README with the master plan.
authored
121
122
b081d5f @scy Update the README.
authored
123 Things that are strange:
8046053 @scy Add a confusing README with the master plan.
authored
124
b081d5f @scy Update the README.
authored
125 - The resulting Git repo is larger than the _uncompressed_ XML file. Delta
126 compression is working fine. I suspect the problem is large trees because
127 of the still missing subdirectories.
8046053 @scy Add a confusing README with the master plan.
authored
128
129
b081d5f @scy Update the README.
authored
130 Things that are cool:
8046053 @scy Add a confusing README with the master plan.
authored
131
b081d5f @scy Update the README.
authored
132 - “git checkout master~30000” takes you back 30,000 edits in time — and on my
133 test machine it only took about a second.
134
135 - The XML data might be in the wrong order to directly create commits from it,
136 but it is in the right order for blob delta compression: When passing blobs
137 to git-fast-import, delta compression will be tried based on the previous
138 blob — which is the same page, one revision before. Therefore, delta
139 compression will succeed and save you tons of storage.
8fd1846 @scy Add example usage to README.
authored
140
141
142 Example usage:
143
144 This will import the pdc.wikipedia.org dump into a new Git repository “repo”:
145
146 rm -rf repo; git init --bare repo && \
147 ./import.py < ~/pdcwiki-20091103-pages-meta-history.xml | \
148 GIT_DIR=repo git fast-import | \
149 sed 's/^progress //'
b081d5f @scy Update the README.
authored
150
151
f40dacd @scy Use the original mod summary as commit message.
authored
152 Storage requirements:
153
154 “maxrev” be the highest revision ID in the file. You may probably retrieve this
155 using something like: tac dump.xml | grep -m 1 '^ <id>'
156
157 The revision metadata storage (METAFILE) needs maxrev*17 bytes.
158
159 The revision comment storage (COMMFILE) needs maxrev*257 bytes.
160
161
b081d5f @scy Update the README.
authored
162 Contacting the author:
163
164 This monster is written by Tim “Scytale” Weber. It is an experiment, whether the
165 current “relevance war” in the German Wikipedia can be ended by decentralizing
166 content.
167
168 Find ways to contact me on http://scytale.name/contact/, talk to me on Twitter
169 (@Scytale) or on IRC (freenode, #oqlt).
170
20e9bca @scy Keep main URL under my control.
authored
171 Get the most up-to-date code at http://scytale.name/proj/levitation/.
b081d5f @scy Update the README.
authored
172
173
174 This whole bunch of tasty bytes is licensed under the terms of the WTFPLv2.
Something went wrong with that request. Please try again.