Skip to content

wolfgarbe/WikipediaExport

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 

WikipediaExport
MIT License

Convert Wikipedia XML dump files to JSON or Text files

Text corpora are required for algorithm design/benchmarking in information retrieval, machine learning, language processing.
The Wikipedia data is ideal because it is large (7 million documents in English Wikipedia) and available in many languages.

Unfortunately the XML format of the Wikipedia dump is somewhat proprietary and inaccessible. WikipediaExport solves this problem by converting the XML dump to plain text or JSON - two formats that can be easily consumed by many tools.

Download wikipedia dump files at:
http://dumps.wikimedia.org/enwiki/latest/
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Usage

Export to text file:
dotnet WikipediaExport.dll inputpath="C:\data\wikipedia/enwiki-latest-pages-articles.xml" format=text

Export to JSON file:
dotnet WikipediaExport.dll inputpath="C:\data\wikipedia/enwiki-latest-pages-articles.xml" format=json

Format output file

Text file

Five consecutive lines constitute a single document:
title
content
domain
url
docDate (Unix time: milliseconds since the beginning of 1970)

JSON file

title
content (all "\r" have been replaced with " ")
domain
url
docDate (Unix time: milliseconds since the beginning of 1970)

Application

WikipediaExport is used to generate the input data for LuceneBench, a benchmark program to compare the performance of Lucene (a search engine library written in Java, powering the search platforms Solr and Elasticsearch) and SeekStorm (a high-performance search platform written in C#, powering the SeekStorm Search as a Service).


WikipediaExport is contributed by SeekStorm - the high performance Search as a Service & search API

Releases

No releases published

Packages

No packages published