Postprocess XML output from wikiprep (Wikipedia preprocessor) into JSON
Python
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.hgignore
README
wikiprep-to-json.py
xmlwikiprep.py

README

wikiprep-postprocess
--------------------

written by Joseph Turian
copyright (C) 2011, MetaOptimize LLC
released under the BSD license

Postprocess XML output from wikiprep (Wikipedia preprocessor) into one
JSON element per line (suitable for mongoimport).
This uses cElementTree.

SETUP:
    # Download a wikidump
    wget http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-articles.xml.bz2
    # Install wikiprep
    git clone http://code.zemanta.com/tsolc/git/wikiprep wikiprep-zemanta.git
    # ...

    # Run wikiprep
    wikiprep -format composite -compress -f enwiki-20110526-pages-articles.xml.bz2

USAGE:
    ./wikiprep-to-json.py enwiki-20110526-pages-articles.gum.xml.*.gz > enwiki-20110526-pages-articles.json

REQUIREMENTS:
    My Python common library:
        http://github.com/turian/common/