Issue tracking now available at lighthouseapp.
- Install Maven 2.
- checkout repository
- type 'mvn'
- configure couchdb (see below)
[couchdb] os_process_timeout=60000 ; increase the timeout from 5 seconds. [external] fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search [update_notification] indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index [httpd_db_handlers] _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
By default all attributes are indexed. You can customize this process by adding a design document at _design/lucene. You must supply an attribute called "transform" which takes and returns a document.
{ "transform":"function(doc) { return doc; }" }
function(doc) { return doc; }
function(doc) { return null; }
function(doc) { delete doc.social_security_number; delete doc.date_of_birth; return doc; }
function(doc) { function DumpObject(obj) { var result = ""; for (var property in obj) { var value=obj[property]; if (typeof value == 'object') { result += DumpObject(value) + " "; } else { result += value + " "; } } return result; } doc.all=DumpObject(doc); return doc; }
The function is evaluated by Rhino. You may add, modify and remove any attributes. Additionally, returning null will exclude the document from indexing entirely.
Couchdb-lucene uses Apache Tika to index attachments of the following types, assuming the correct content_type is set in couchdb;
- Excel spreadsheets (application/vnd.ms-excel)
- Word documents (application/msword)
- Powerpoint presentations (application/vnd.ms-powerpoint)
- Visio (application/vnd.visio)
- Outlook (application/vnd.ms-outlook)
- XML (application/xml)
- HTML (text/html)
- Images (image/*)
- Java class files
- Java jar archives
- MP3 (audio/mp3)
- OpenDocument (application/vnd.oasis.opendocument.*)
- Plain text (text/plain)
- PDF (application/pdf)
- RTF (application/rtf)
You can perform all types of queries using Lucene's default query syntax. The following parameters can be passed for more sophisticated searches;
- q
- the query to run (e.g, subject:hello)
- sort
- the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).
- limit
- the maximum number of results to return
- skip
- the number of results to skip
- include_docs
- whether to include the source docs
- stale=ok
- If you set the stale option ok, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.
- debug
- if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.
- rewrite
- (EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.
All parameters except 'q' are optional.
- _id
- The _id of the document.
- _db
- The source database of the document.
- _body
- Any text extracted from any attachment.
All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
- dc.contributor
- An entity responsible for making contributions to the content of the resource.
- dc.coverage
- The extent or scope of the content of the resource.
- dc.creator
- An entity primarily responsible for making the content of the resource.
- dc.date
- A date associated with an event in the life cycle of the resource.
- dc.description
- An account of the content of the resource.
- dc.format
- Typically, Format may include the media-type or dimensions of the resource.
- dc.identifier
- Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.
- dc.language
- A language of the intellectual content of the resource.
- dc.modified
- Date on which the resource was changed.
- dc.publisher
- An entity responsible for making the resource available.
- dc.relation
- A reference to a related resource.
- dc.rights
- Information about rights held in and over the resource.
- dc.source
- A reference to a resource from which the present resource is derived.
- dc.subject
- The topic of the content of the resource.
- dc.title
- A name given to the resource.
- dc.type
- The nature or genre of the content of the resource.
http://localhost:5984/dbname/_fti?q=field_name:value http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
Here's an example of a JSON response without sorting;
{ "q": "+_db:enron +content:enron", "skip": 0, "limit": 2, "total_rows": 176852, "search_duration": 518, "fetch_duration": 4, "rows": [ { "_id": "hain-m-all_documents-257.", "score": 1.601625680923462 }, { "_id": "hain-m-notes_inbox-257.", "score": 1.601625680923462 } ] }
And the same with sorting;
{ "q": "+_db:enron +content:enron", "skip": 0, "limit": 3, "total_rows": 176852, "search_duration": 660, "fetch_duration": 4, "sort_order": [ { "field": "source", "reverse": false, "type": "string" }, { "reverse": false, "type": "doc" } ], "rows": [ { "_id": "shankman-j-inbox-105.", "score": 0.6131107211112976, "sort_order": [ "enron", 6 ] }, { "_id": "shankman-j-inbox-8.", "score": 0.7492915391921997, "sort_order": [ "enron", 7 ] }, { "_id": "shankman-j-inbox-30.", "score": 0.507369875907898, "sort_order": [ "enron", 8 ] } ] }
Calling couchdb-lucene without arguments returns a JSON object with information about the index.
http://127.0.0.1:5984/enron/_fti
returns;
{"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\ /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
couchdb-lucene respects several system properties;
- couchdb.url
- the url to contact CouchDB with (default is "http://localhost:5984")
- couchdb.lucene.dir
- specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.
You can override these properties like this;
fti=/usr/bin/java -D couchdb.lucene.dir=/tmp \ -cp /home/rnewson/Source/couchdb-lucene/target/classes:\ /home/rnewson/Source/couchdb-lucene/target/dependency\ com.github.rnewson.couchdb.lucene.Main
If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
- couchdb.user
- the user to authenticate as.
- couchdb.password
- the password to authenticate with.