Skip to content
This repository has been archived by the owner on May 13, 2020. It is now read-only.

Commit

Permalink
A first attempt at high-level documentation as requested by Brial.
Browse files Browse the repository at this point in the history
Casey, you have the baton now.
  • Loading branch information
gvanrossum committed Jun 4, 2002
1 parent 6a7c85a commit d650124
Showing 1 changed file with 124 additions and 0 deletions.
124 changes: 124 additions & 0 deletions README.txt
@@ -0,0 +1,124 @@
ZCTextIndex
===========

This product is a replacement for the full text indexing facility of
ZCatalog. Specifically, it is an alternative to
PluginIndexes/TextIndex.

Advantages of using ZCTextIndex over TextIndex:

- A new query language, supporting both explicit and implicit Boolean
operators, parentheses, globbing, and phrase searching. Apart from
explicit operators and globbing, the syntax is roughly the same as
that popularized by Google.

- A more refined scoring algorithm, resulting in better selectiveness:
it's much more likely that you'll find the document you are looking
for among the first fe highest-ranked results.

- Actually, ZCTextIndex gives you a choice of two scoring algorithms
from recent literature: the Cosine ranking from the Managing
Gigabytes book, and Okapi from more recent research papers. Okapi
usually does better, so it is the default (but your milage may
vary).

- A redesigned Lexicon, using a pipeline architecture to split the
input text into words. This makes it possible to mix and match
pipeline components, e.g. you can choose between an HTML-aware
splitter and a plain text splitter, and additional components can be
added to the pipeline for case folding, stopword removal, and other
features. Enough example pipeline components are provided to get
you started, and it is very easy to write new components.

Performance is roughly the same as for TextIndex, and we're expecting
to make tweaks to the code that will make it faster.

This code can be used outside of Zope too; all you need is a
standalone ZODB installation to make your index persistent. Several
functional test programs in the tests subdirectory show how to do
this, for example mhindex.py, mailtest.py, indexhtml.py, and
queryhtml.py.


How to use as a Zope Product
----------------------------

XXX Casey, please write this.


Code overview
-------------

ZMI interface:

__init__.py ZMI publishing code
ZCTextIndex.py pluggable index class
PipelineFactory.py ZMI helper to configure the pipeline

Indexing:

BaseIndex.py common code for Cosine and Okapi index
CosineIndex.py Cosine index implementation
OkapiIndex.py Okapi index implementation
okascore.c C implementation of scoring loop

Lexicon:

Lexicon.py lexicon and sample pipeline elements
HTMLSplitter.py HTML-aware splitter
StopDict.py list of English stopwords
stopper.c C implementation of stop word remover

Query parser:

QueryParser.py parse a query into a parse tree
ParseTree.py parse tree node classes and exceptions

Utilities:

NBest.py find N best items in a list without sorting
SetOps.py efficient weighted set operations
WidCode.py list compression allowing phrase searches
RiceCode.py list compression code (as yet unused)

Interfaces (these speak for themselves):

IIndex.py
ILexicon.py
INBest.py
IPipelineElement.py
IPipelineElementFactory.py
IQueryParseTree.py
IQueryParser.py
ISplitter.py

Subdirectories:

tests unittests and some functional tests/examples
dtml ZMI templates
www images used in the ZMI

Tests
-----

Functional tests and helpers:

hs-tool.py helper to interpret hotshot profiler logs
indexhtml.py index a collection of HTML files
mailtest.py index and query a Unix mailbox file
mhindex.py index and query a set of MH folders
python.txt output from benchmark queries
queryhtml.py query an index created by indexhtml.py
wordstats.py dump statistics about each indexed word

Unit tests (these speak for themselves):

testIndex.py
testLexicon.py
testNBest.py
testPipelineFactory.py
testQueryEngine.py
testQueryParser.py
testSetOps.py
testStopper.py
testZCTextIndex.py

0 comments on commit d650124

Please sign in to comment.