Skip to content
svick edited this page Jan 29, 2012 · 6 revisions

Wikipedia SQL dump parser is a simple and (currently) far from complete library to access SQL database dumps of Wikipedia (and other WikiMedia projects) under .Net 4.0 without the need to import them into a MySQL database.

WikiMedia offers dumps of databases of their projects (including Wikipedia) at http://download.wikimedia.org/backup-index.html. Some of them (e.g. the text of the articles) are available in XML, which is easily readable from .Net. Other dumps (like category or image information) is in SQL format that can be imported into a MySQL database and used from there. Because I felt the roundtrip to the database is unnecessary, I have created this library, which reads the dumps and directly makes them available to any .Net application. The library is able to download the needed dumps automatically or read the ones you have downloaded yourself from the disk.

Consider you wanted to sum the size of all images on English Wikibooks using information from the 31 Ocotber 2010 dump. The following code does that:

using System;
using System.Linq;
using WpSqlDumpParser.EntityCollections;
using WpSqlDumpParser.IO;

namespace TotalImageSizeExample
{
	class Program
	{
		static void Main()
		{
			// set the directory, where dumps will be downloaded to and/or read from
			// (if not set, current directory will be used)
			CachingStream.CachePath = @"F:\Wikipedia dumps";

			// get the collection of all images in the dump
			// images is of type IEnumerable<Image> and is loaded lazily
			var images = Images.Instance.Get("enwikibooks", new DateTime(2010, 10, 31));

			// use LINQ to sum the size of all images
			var totalSize = images.Select(i => i.Size).Sum();

			// write the result to the console
			Console.WriteLine(totalSize);
		}
	}
}

The access to the dumps is provided through classes in the WpSqlDumpParser.EntityCollections namespace. Currently they are:

  • [[CategoryLinks|https://github.com/svick/Wikipedia-SQl-dump-parser/blob/master/Wikipedia%20SQL%20dump%20parser/EntityCollections/CategoryLinks.cs]]
  • [[ImageLinks|https://github.com/svick/Wikipedia-SQl-dump-parser/blob/master/Wikipedia%20SQL%20dump%20parser/EntityCollections/ImageLinks.cs]]
  • [[Images|https://github.com/svick/Wikipedia-SQl-dump-parser/blob/master/Wikipedia%20SQL%20dump%20parser/EntityCollections/Images.cs]]
  • [[Pages|https://github.com/svick/Wikipedia-SQl-dump-parser/blob/master/Wikipedia%20SQL%20dump%20parser/EntityCollections/Pages.cs]]
  • [[PageLinks|https://github.com/svick/Wikipedia-SQl-dump-parser/blob/master/Wikipedia%20SQL%20dump%20parser/EntityCollections/PageLinks.cs]]

They all derive from the [[Dump<T>|https://github.com/svick/Wikipedia-SQl-dump-parser/blob/master/Wikipedia%20SQL%20dump%20parser/EntityCollections/Dump.cs]] class and you can use its IEnumerable<T> Get(string wiki, DateTime date) method to get the collection from the specific dump. E.g. for images it's:

Images.Instance.Get("enwikibooks", new DateTime(2010, 10, 31))

To see what properties are available for specific classes, use Visual Studio's Intellisense or have a look at the source code.

Clone this wiki locally