Skip to content

Commit

Permalink
Update the README.
Browse files Browse the repository at this point in the history
  • Loading branch information
Tim Weber committed Nov 6, 2009
1 parent ba2ce87 commit b081d5f
Showing 1 changed file with 75 additions and 29 deletions.
104 changes: 75 additions & 29 deletions README
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
This is a collection of tools to convert Wikipedia database dumps into Git
repositories. Since I’m bloody creative, they are named „Gitipedia“. Currently,
development is focused on testing wheter it can be done at all.
This is Levitation, a project to convert Wikipedia database dumps into Git
repositories. It has been successfully tested with a small Wiki
(pdc.wikipedia.org) having 3,700 articles and 40,000 revisions. Importing
those took 10 minutes on a Core 2 Duo 1.66 GHz. RAM usage is minimal: Pages are
imported one after the other, it will at most require the amount of memory
needed to keep all revisions of a single page into memory. You should be safe
with 1 GB of RAM.


How it should be done:
Expand All @@ -9,8 +13,9 @@ You can get recent dumps of all Wikimedia wikis at:
http://download.wikimedia.org/backup-index.html

The pages-meta-history.xml file is what we want. (In case you’re wondering:
Wikimedia does not offer content SQL dumps anymore.) It includes all pages in
all namespaces and all of their revisions. The problem is the data’s order.
Wikimedia does not offer content SQL dumps anymore, and there are now full-
history dump for en.wikipedia.org because of its size.) It includes all pages
in all namespaces and all of their revisions. The problem is the data’s order.

In the dump files, the first <page> tag contains the first page that has been
created and all of its revisions. The second <page> tag contains the second
Expand All @@ -32,11 +37,9 @@ that’s fine. If, however, a second page was created, the next commit needs to
add that page’s content, without first doing something with the other revisions
of the first page.

Therefore you need to reorder the data in the dump. Reordering up to 1 TB of
Therefore you need to reorder the data in the dump. Reordering several TB of
data isn’t the most trivial thing to do. A careful balance between performance,
disk and memory usage must be found. The plan is as follows. (Note that the
current code does not comply to the plan. This is because the plan got more
sophisticated and the code did not follow yet.)
disk and memory usage must be found. The plan is as follows.

In order to minimize the amount of disk space required, the import.py tool
reads the XML stream from stdin, allowing you to pipe a file that is
Expand Down Expand Up @@ -69,38 +72,66 @@ The real content files will need to be put into different directories because
of limitations in the maximum number of files per directory.


Sorry if all of this seems a bit confused, it was written in a hurry.

Now to the actual implementation:


Things that work:

- Read a Wikipedia XML full-history dump and store each revision in a bz2
compressed file, ordered by date.
- Read a Wikipedia XML full-history dump and output it in a format suitable
for piping into git-fast-import(1). The resulting repository contains one
file per page. All revisions are available in the history. There are some
restrictions, read below.


Things that are still missing:

- Use the original modification summary as commit message. Currently we use
the rather boring "Revision X". However, these summaries need to be
temporarily stored in an external file because we cannot keep them in RAM
for larger Wikis. Store additional information in the commit message that
specifies page and revision ID.

- Use the page’s name as file name instead of the page ID. To do that, we need
a meta file that stores the page names. Page name _should_ be limited to 255
_bytes_ UTF-8, however MediaWiki is not quite clear about whether that
should actually be _characters_. Additionally, the page name does not
include the namespace name, even if though in the dump it’s prefixed with
the namespace. The meta file needs to store the namespace ID as well. Since
there usually are only a few namespaces (10 or 20), we can keep their names
in RAM.

- Use the author name in the commit instead of the user ID. To do that, we
need a meta file that stores the author names. Author name _should_ be
limited to 255 _bytes_ UTF-8, however MediaWiki is not quite clear about
whether that should actually be _characters_.

- Think of a neat file hierarchy for the file tree: You cannot throw 3 million
articles into a single directory. Instead, you need additional subdirectory
levels. I’ll probably go for the first n bytes of the hashed article name.

- Use a locally timezoned timestamp for the commit date instead of an UTC one.

- Allow IPv6 addresses as IP edit usernames. (Although afaics MediaWiki itself
cannot handle IPv6 addresses, so we got some time.)

Things to come:

- Think of a neat file hierarchy for the Git working copy: You cannot throw
3 million articles into a single directory. Instead, you need additional
subdirectory levels. I’ll probably go for the first n bytes of the hashed
article name.
Things that are strange:

- Create a Git repository and use git-fast-import to throw all those revisions
into the repository.
- The resulting Git repo is larger than the _uncompressed_ XML file. Delta
compression is working fine. I suspect the problem is large trees because
of the still missing subdirectories.


Things to probably rework:
Things that are cool:

- Currently, each revision is read from the XML dump and stored in a bz2
compressed file, to allow us to read the revisions ordered by date, but
without keeping them in memory. Then, in the planned step 2, those files
would be fed to Git. Instead, we could directly create blobs via
git-fast-import and just remember their revision numbers and dates, sort
them and create commits accordingly. This would require some more memory (to
remember all the dates and IDs and to sort them) but lower disk usage and IO
quite a lot.
- “git checkout master~30000” takes you back 30,000 edits in time — and on my
test machine it only took about a second.

- The XML data might be in the wrong order to directly create commits from it,
but it is in the right order for blob delta compression: When passing blobs
to git-fast-import, delta compression will be tried based on the previous
blob — which is the same page, one revision before. Therefore, delta
compression will succeed and save you tons of storage.


Example usage:
Expand All @@ -111,3 +142,18 @@ This will import the pdc.wikipedia.org dump into a new Git repository “repo”
./import.py < ~/pdcwiki-20091103-pages-meta-history.xml | \
GIT_DIR=repo git fast-import | \
sed 's/^progress //'


Contacting the author:

This monster is written by Tim “Scytale” Weber. It is an experiment, whether the
current “relevance war” in the German Wikipedia can be ended by decentralizing
content.

Find ways to contact me on http://scytale.name/contact/, talk to me on Twitter
(@Scytale) or on IRC (freenode, #oqlt).

Get the most up-to-date code at http://github.com/scy/levitation.


This whole bunch of tasty bytes is licensed under the terms of the WTFPLv2.

0 comments on commit b081d5f

Please sign in to comment.