Convert Wikipedia XML dump files to JSON or Text files
Text corpora are required for algorithm design/benchmarking in information retrieval, machine learning, language processing.
The Wikipedia data is ideal because it is large (7 million documents in English Wikipedia) and available in many languages.
Unfortunately the XML format of the Wikipedia dump is somewhat proprietary and inaccessible. WikipediaExport solves this problem by converting the XML dump to plain text or JSON - two formats that can be easily consumed by many tools.
Download wikipedia dump files at:
http://dumps.wikimedia.org/enwiki/latest/
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Export to text file:
dotnet WikipediaExport.dll inputpath="C:\data\wikipedia/enwiki-latest-pages-articles.xml" format=text
Export to JSON file:
dotnet WikipediaExport.dll inputpath="C:\data\wikipedia/enwiki-latest-pages-articles.xml" format=json
Text file
Five consecutive lines constitute a single document:
title
content
domain
url
docDate (Unix time: milliseconds since the beginning of 1970)
JSON file
title
content (all "\r" have been replaced with " ")
domain
url
docDate (Unix time: milliseconds since the beginning of 1970)
WikipediaExport is used to generate the input data for LuceneBench, a benchmark program to compare the performance of Lucene (a search engine library written in Java, powering the search platforms Solr and Elasticsearch) and SeekStorm (a high-performance search platform written in C#, powering the SeekStorm Search as a Service).
WikipediaExport is contributed by SeekStorm - the high performance Search as a Service & search API