This repository has been archived by the owner on May 13, 2020. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
A first attempt at high-level documentation as requested by Brial.
Casey, you have the baton now.
- Loading branch information
1 parent
6a7c85a
commit d650124
Showing
1 changed file
with
124 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
ZCTextIndex | ||
=========== | ||
|
||
This product is a replacement for the full text indexing facility of | ||
ZCatalog. Specifically, it is an alternative to | ||
PluginIndexes/TextIndex. | ||
|
||
Advantages of using ZCTextIndex over TextIndex: | ||
|
||
- A new query language, supporting both explicit and implicit Boolean | ||
operators, parentheses, globbing, and phrase searching. Apart from | ||
explicit operators and globbing, the syntax is roughly the same as | ||
that popularized by Google. | ||
|
||
- A more refined scoring algorithm, resulting in better selectiveness: | ||
it's much more likely that you'll find the document you are looking | ||
for among the first fe highest-ranked results. | ||
|
||
- Actually, ZCTextIndex gives you a choice of two scoring algorithms | ||
from recent literature: the Cosine ranking from the Managing | ||
Gigabytes book, and Okapi from more recent research papers. Okapi | ||
usually does better, so it is the default (but your milage may | ||
vary). | ||
|
||
- A redesigned Lexicon, using a pipeline architecture to split the | ||
input text into words. This makes it possible to mix and match | ||
pipeline components, e.g. you can choose between an HTML-aware | ||
splitter and a plain text splitter, and additional components can be | ||
added to the pipeline for case folding, stopword removal, and other | ||
features. Enough example pipeline components are provided to get | ||
you started, and it is very easy to write new components. | ||
|
||
Performance is roughly the same as for TextIndex, and we're expecting | ||
to make tweaks to the code that will make it faster. | ||
|
||
This code can be used outside of Zope too; all you need is a | ||
standalone ZODB installation to make your index persistent. Several | ||
functional test programs in the tests subdirectory show how to do | ||
this, for example mhindex.py, mailtest.py, indexhtml.py, and | ||
queryhtml.py. | ||
|
||
|
||
How to use as a Zope Product | ||
---------------------------- | ||
|
||
XXX Casey, please write this. | ||
|
||
|
||
Code overview | ||
------------- | ||
|
||
ZMI interface: | ||
|
||
__init__.py ZMI publishing code | ||
ZCTextIndex.py pluggable index class | ||
PipelineFactory.py ZMI helper to configure the pipeline | ||
|
||
Indexing: | ||
|
||
BaseIndex.py common code for Cosine and Okapi index | ||
CosineIndex.py Cosine index implementation | ||
OkapiIndex.py Okapi index implementation | ||
okascore.c C implementation of scoring loop | ||
|
||
Lexicon: | ||
|
||
Lexicon.py lexicon and sample pipeline elements | ||
HTMLSplitter.py HTML-aware splitter | ||
StopDict.py list of English stopwords | ||
stopper.c C implementation of stop word remover | ||
|
||
Query parser: | ||
|
||
QueryParser.py parse a query into a parse tree | ||
ParseTree.py parse tree node classes and exceptions | ||
|
||
Utilities: | ||
|
||
NBest.py find N best items in a list without sorting | ||
SetOps.py efficient weighted set operations | ||
WidCode.py list compression allowing phrase searches | ||
RiceCode.py list compression code (as yet unused) | ||
|
||
Interfaces (these speak for themselves): | ||
|
||
IIndex.py | ||
ILexicon.py | ||
INBest.py | ||
IPipelineElement.py | ||
IPipelineElementFactory.py | ||
IQueryParseTree.py | ||
IQueryParser.py | ||
ISplitter.py | ||
|
||
Subdirectories: | ||
|
||
tests unittests and some functional tests/examples | ||
dtml ZMI templates | ||
www images used in the ZMI | ||
|
||
Tests | ||
----- | ||
|
||
Functional tests and helpers: | ||
|
||
hs-tool.py helper to interpret hotshot profiler logs | ||
indexhtml.py index a collection of HTML files | ||
mailtest.py index and query a Unix mailbox file | ||
mhindex.py index and query a set of MH folders | ||
python.txt output from benchmark queries | ||
queryhtml.py query an index created by indexhtml.py | ||
wordstats.py dump statistics about each indexed word | ||
|
||
Unit tests (these speak for themselves): | ||
|
||
testIndex.py | ||
testLexicon.py | ||
testNBest.py | ||
testPipelineFactory.py | ||
testQueryEngine.py | ||
testQueryParser.py | ||
testSetOps.py | ||
testStopper.py | ||
testZCTextIndex.py |