/
README
34 lines (24 loc) · 2.33 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
The main purpose of BzReader is to allow the offline browsing of the Wikipedia.
BzReader project page is http://code.google.com/p/bzreader/
This fork adds the following functionalities:
1. Upgrade to Lucene.Net 2.9.2
2. Proper search for Hebrew dumps, using HebMorph (http://www.code972.com/blog/hebmorph)
Released under GPLv2
Compiling from source
---------------------
1. Make sure you have a local copy of Lucene.Net (DLL or sources). Their SVN repository is at https://svn.apache.org/repos/asf/lucene/lucene.net
2. If you are interested in the Hebrew integration, you may want to compile HebMorph from source as well instead of using the precompiled option: http://github.com/synhershko/HebMorph
Usage
-----
The typical usage would be:
1. Download one or several .xml.bz2 dumps of Wikipedia topics from http://download.wikipedia.org/backup-index.html
2. You will need the files named pages-articles.xml.bz2 for your language. Alternatively, you may use any database dumps which follow Wikipedia XML format - Wikibooks, Wikitravel etc.
3. Install BzReader
4. To enable proper Hebrew searches, point to a path where hspell files reside. Go to Options > Set hspell path, and select the hspell-data-files folder shipping with the BzReader. Without doing so, a simple Hebrew analyzer will be used instead of a morphological one.
5. Open the Wikipedia dump in BzReader (you can have multiple dumps opened at the same time, all of them will be searched together)
6. Let BzReader create the index for fast retrieval of topics. This might be quite lengthy for big dumps. It takes about 4 hours on Core 2 Duo to index the current dump of all English topics (enwiki-20080312-pages-articles.xml.bz2, 3.5 Gb file). The BzReader index for this dump takes about 550 Mb.
7. Browse!
8. Enter your search terms into the search window on the left, wait for the result(s) and click on the topic. Try also something like "music*" to find anything that starts with "music".
Known issues
------------
Chinese/Japanese/Korean users - please note that indexing the dumps in your languages is currently too slow to be of real use. This is due to the complexity of the actual languages. Unfortunately, I don't really know the appropriate languages to actually fix the tokenizing code. Just as a guide - 2.7 Mb compressed jawikinews dump takes about 1.5 hours to index on Core 2 Duo.