A Utility Library for Wikipedia dumps
Java Other
Pull request Compare This branch is 4 commits behind marcusklang:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
dist
src
testdata
LICENSE
Makefile
README.md
autogenerate-configs.php
pom.xml
wikiforia.iml

README.md

Forked Changes

Added support for Plain Text output format on top of existing XML format. Use Case: Needed support to extract text only from the Wikipedia in order to use it as a Corpus for different Machine Learning experiments.

To run it: Download wikiforia-x.y.z.jar from dist/ directory, open your terminal, go/cd to download location and run

java -jar wikiforia-x.y.z.jar 
     -pages [path to the file ending with multistream.xml.bz2] 
     -output [output xml path]
     -outputformat plain-text

Read Original Wikiforia README.md here