Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
This is a Perl port of scy's levitation. It reads MediaWiki dump files revision by revision and writes a data stream to stdout suitable for git fast-import. The first 1000 pages of the german Wikipedia and all their revisions (about 390000) can be dumped in about 15 min on relatively moderate hardware. Dependencies ------------ You need at least Perl 5.10. The Perl interpreter has to be compiled with threads support. You also need a working C compiler for the inline SHA1 C function. Currently this _must_ be gcc 4.3 callable as 'gcc-4.3'. This will be fixed soon. You need the following modules and their dependencies from CPAN: - Regexp::Common - Inline - JSON::XS - Compress::Raw::Zlib - Carp::Assert - CDB_File - XML::Bare >= 0.44 - Deep::Hash::Utils Some Linux distributions will already have the first set. Under Debian / Ubuntu the following command should set you: sudo apt-get install libregexp-common-perl \ libinline-perl libjson-xs-perl \ libcompress-raw-zlib-perl libcarp-assert-perl Usage ----- First, initialize a git repository: cd /tmp mkdir blawiki cd blawiki git init Then, "levitate". This is a three-step process: cat /path/to/blawiki-dump.xml | /path/to/levitation-perl/step1.pl LC_ALL=C sort rev-table.txt > rev-sorted.txt /path/to/levitation-perl/step2.pl | /path/to/levitation-perl/gfi.pl Alternatively, you can just change to an empty directory and call the "levitate" helper script with a path to a dump as parameter (may be 7z, bz2, gz or xml): mkdir /tmp/blawiki cd /tmp/blawiki /path/to/levitation-perl/levitate /path/to/blawiki-dump... Lots of progress information is printed to standard error, so it may be best to redirect that to a file. Have fun.