Skip to content

yuryshulaev/wiki-import

Repository files navigation

wiki-import — Wikipedia AST to LevelDB

Parses pages from a Wikipedia XML dump into JSON AST using wikiparse and saves them into a LevelDB database.

Installation

npm i -g wiki-import

Usage

Download and unpack a *-pages-articles.xml.bz2 archive of your choosing from Wikimedia Downloads. Make sure you have enough free space for the database and run:

wiki-to-leveldb <pages-articles.xml[.bz2]> <dbPath> [workerCount = cpuCount] [--with-source]

for example:

lbzip2 -kd simplewiki-20220101-pages-articles.xml.bz2
wiki-to-leveldb simplewiki-20220101-pages-articles.xml simplewiki

The first step is optional: if you want to trade time for storage space, you can pass an *.xml.bz2 archive directly to wiki-to-leveldb for streaming decompression, that will be about 1.5 times slower — lbzip2 is highly parallel.

It takes about 1 minute to import simplewiki (1 GB dump) on tmpfs on a 10-core CPU.

About

Wikipedia AST to LevelDB importer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published