A minimalist realtime full-text search index
C Ruby C++
Permalink
Failed to load latest commit information.
build
integration-tests ignore unknown output in integration eval script Jun 9, 2012
ruby bump version Jun 9, 2012
www bump version Jun 9, 2012
.gitignore update .gitignore Apr 6, 2012
COPYING add COPYING Mar 15, 2012
Makefile
README bump version Jun 9, 2012
RELEASE-SCRIPT
batch-run-queries.c initial checkin Feb 9, 2011
benchmark-queries.c add benchmark-queries binary Mar 30, 2012
defaults.h initial checkin Feb 9, 2011
dump.c wire in postings list headers Jun 9, 2012
entry.c ignore tokens larger than 50 characters Apr 11, 2012
entry.h move posarray into a generic resizable array Apr 6, 2012
error.c distinguish between different error types Jun 9, 2012
error.h
file-indexer.c initial checkin Feb 9, 2011
index.c dump more index info in dump.c Jun 10, 2012
index.h
interactive.c add snippeting to interactive.c Apr 11, 2012
khash.h
lock.c allow multiple concurrent writers Apr 1, 2012
lock.h minor fixup in lock.h Apr 6, 2012
make-queries.c initial checkin Feb 9, 2011
mbox-indexer.c improve mbox parsing Apr 11, 2012
mmap-obj.c fix up some debug macro type specifiers (sigh) Jun 9, 2012
mmap-obj.h check mmap_objs for size changes before all reads Mar 23, 2012
query-parser.c
query-parser.h initial checkin Feb 9, 2011
query-parser.lex
query-parser.y implement EVERY selector, specified as '*' in queries Jun 19, 2011
query.c bugfix: don't produce invalid substitutions for labels Mar 15, 2012
query.h
rarray.h
search.c wire in postings list headers Jun 9, 2012
search.h initial checkin Feb 9, 2011
segment.c dump segment version in dump.c Jun 10, 2012
segment.h use postings header counts for single-term count operations Jun 9, 2012
snippeter.c wire in postings list headers Jun 9, 2012
snippeter.h add snippeting Apr 6, 2012
stringmap.c breaking change to stringmap Mar 27, 2012
stringmap.h breaking change to stringmap Mar 27, 2012
stringpool.c make everything compile under os x Feb 9, 2011
stringpool.h initial checkin Feb 9, 2011
termhash.c wire in postings list headers Jun 9, 2012
termhash.h wire in postings list headers Jun 9, 2012
test-labels.c change TEST_ macros to be more verbose Jun 9, 2012
test-queries.c change TEST_ macros to be more verbose Jun 9, 2012
test-search.c change TEST_ macros to be more verbose Jun 9, 2012
test-segment.c change TEST_ macros to be more verbose Jun 9, 2012
test-snippets.c
test-stringmap.c change TEST_ macros to be more verbose Jun 9, 2012
test-stringpool.c change TEST_ macros to be more verbose Jun 9, 2012
test-termhash.c wire in postings list headers Jun 9, 2012
test-tokenizer.c
test.h change TEST_ macros to be more verbose Jun 9, 2012
timer.h
tokenizer.lex bugfix in tokenization: newlines in the middle of words Apr 9, 2012
whistlepig.h add snippeting Apr 6, 2012

README

= Whistlepig

Whistlepig is a minimalist realtime full-text search index. Its goal is
to be as small and maintainable as possible, while still remaining
useful, performant and scalable to large corpora. If you want realtime
full-text search without the frills, Whistlepig may be for you.

Whistlepig is written in ANSI C99. It currently provides a C API and Ruby
bindings.

Latest version: 0.12, released 2012-06-09.
        Status: beta
          News: http://all-thing.net/label/whistlepig/
      Homepage: http://masanjin.net/whistlepig/
   Bug reports: http://github.com/wmorgan/whistlepig/issues

= Getting it

       Tarball:  http://masanjin.net/whistlepig/whistlepig-0.12.tar.gz
       Rubygem:  gem install whistlepig
           Git:  git clone git://github.com/wmorgan/whistlepig.git

= Realtime search

Roughly speaking, realtime search means:
- documents are available to to queries immediately after indexing, without any
  reindexing or index merging;
- later documents are more important than earlier documents.

Whistlepig takes these principles at face value.
- It only returns documents in the reverse (LIFO) order to which they were
  added, and performs no ranking, reordering, or scoring.
- It only supports incremental indexing. There is no notion of batch indexing
  or index merging.
- It does not support document deletion or modification (except in the
  special case of labels; see below).
- It only supports in-memory indexes.

Features that Whistlepig does provide:
- Incremental indexing. Updates to the index are immediately available to
  readers.
- Fielded terms with arbitrary fields.
- A full query language and parser with conjunctions, disjunctions, phrases,
  negations, grouping, and nesting.
- Labels: arbitrary tokens which can be added to and removed from documents
  at any point, and incorporated into search queries.
- Early query termination and resumable queries.
- A tiny, < 3 KLOC ANSI C99 implementation.

== Benchmarks

On my not-particularly-new Linux desktop, I can index 8.5 MB/s of text
data per process, including some minor parsing.

Index sizes are roughly 50% of the original corpus size, e.g. the 1.4gb
Enron email corpus (http://cs.cmu.edu/~enron/) is 753mb in the index.

Query performance is entirely dependent on the queries and the index
size. Run the benchmark-queries to see some examples.

== Synopsis (using Ruby bindings)

  require 'rubygems'
  require 'whistlepig'

  include Whistlepig

  index = Index.new "index"

  entry1 = Entry.new
  entry1.add_string "body", "hello there bob"
  docid1 = index.add_entry entry1              # => 1

  entry2 = Entry.new
  entry2.add_string "body", "goodbye bob"
  docid2 = index.add_entry entry2              # => 2

  q1 = Query.new "body", "bob"
  results1 = index.search q1                   # => [2, 1]

  q2 = q1.and Query.new("body", "hello")
  results2 = index.search q2                   # => [1]

  index.add_label docid2, "funny"

  q3 = Query.new "body", "bob ~funny"
  results3 = index.search q3                   # => [2]

  entry3 = Entry.new
  entry3.add_string "body", "hello joe"
  entry3.add_string "subject", "what do you know?"
  docid3 = index.add_entry entry3              # => 3

  q4 = Query.new "body", "subject:know hello"
  results4 = index.search q4                   # => [3]

== Concurrency

Whistlepig supports multi-process concurrency. Multiple reader and
writer processes can access the same index without mangling data.

Internally, Whistlepig uses pthread read-write locks to synchronize
readers and writers. This allows multiple concurrent readers but only a
single writer.

While this locking approach guarantees index correctness, it decreases
read and write performance when one or more writers exist. Systems with
high write loads may benefit from sharding documents across independent
indexes rather than sending everything to the same index.

== Design tradeoffs

I have generally erred on the side of maintainable code at the expense
of speed. Simpler implementations have been preferred over more complex,
faster versions. If you ever have to modify Whistlepig to suit your
needs, you will appreciate this.

== Bug reports

Please file bugs here: https://github.com/wmorgan/whistlepig/issues
Please send comments to: wmorgan-whistlepig-readme@masanjin.net.