This repository contains java and scala code providing mediawiki XML-dumps parsing capability to Hadoop through InputFormat classes.
The java code provides an InputFormat for the old hadoop API (mapred), and the scala code for the new API (mapreduce).