Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 463 lines (369 sloc) 14.527 kb
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
1 <h1>Issue Tracking</h1>
a785480 @rnewson lighthouse sucks at formatting anything, abandon ship.
authored
2
7a0d1d3 @rnewson lighthouse sucks at formatting anything, abandon ship.
authored
3 Issue tracking at <a href="http://github.com/rnewson/couchdb-lucene/issues">github</a>.
5d4e56a @rnewson update readme.
authored
4
ef3f787 @rnewson add sysreq for Sun JDK.
authored
5 <h1>System Requirements</h1>
6
905a196 @rnewson update README as user reports indicate later OpenJDK versions now work.
authored
7 Sun JDK 5 or higher is recommended.
8
9 Couchdb-lucene is known to be incompatible with some versions of OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library. The version in Ubuntu 8.10 (6b12-0ubuntu6.4) is known to work and it uses Rhino 1.7R1.
ef3f787 @rnewson add sysreq for Sun JDK.
authored
10
5220b65 @rnewson tweak README.md
authored
11 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
12
13 <ol>
14 <li>Install Maven 2.
15 <li>checkout repository
16 <li>type 'mvn'
17 <li>configure couchdb (see below)
18 </ol>
19
20 <h1>Configure CouchDB</h1>
21
22 <pre>
0563120 @rnewson fixes.
authored
23 [couchdb]
24 os_process_timeout=60000 ; increase the timeout from 5 seconds.
25
b207965 @rnewson improve README readability.
authored
26 [external]
77d4f67 @rnewson fix readme.
authored
27 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
28
29 [update_notification]
30 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
31
32 [httpd_db_handlers]
33 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
34 </pre>
35
36 <h1>Indexing Strategy</h1>
37
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
38 <h2>Document Indexing</h2>
39
5077366 @rnewson clarify design document and add matching query URL's.
authored
40 You must supply a index function in order to enable couchdb-lucene as, by default, nothing will be indexed.
a2e9024 @rnewson wip
authored
41
437eae9 @rnewson s/view/fulltext in README.md
authored
42 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by the index functions.
c207a60 @rnewson update README
authored
43
5077366 @rnewson clarify design document and add matching query URL's.
authored
44 Here's an complete example of a design document with couchdb-lucene features:
a2e9024 @rnewson wip
authored
45
697884b @rnewson documentation of future features.
authored
46 <pre>
47 {
5077366 @rnewson clarify design document and add matching query URL's.
authored
48 "_id":"lucene",
49 "views": {
50 "normal_couch_view": {
51 "map": "function(){}"
52 }
697884b @rnewson documentation of future features.
authored
53 },
5077366 @rnewson clarify design document and add matching query URL's.
authored
54 "fulltext": {
55 "by_subject": {
56 "defaults": { "store":"yes" },
57 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
58 },
59 "by_content": {
60 "defaults": { "store":"no" },
61 "index":"function(doc) { var ret=new Document(); ret.add(doc.content); return ret }"
62 }
697884b @rnewson documentation of future features.
authored
63 }
64 }
65 </pre>
66
5077366 @rnewson clarify design document and add matching query URL's.
authored
67 Here are some example URL's for the given design document;
68
69 <pre>
70 http://localhost:5984/database/_fti/lucene/by_subject?q=hello
71 http://localhost:5984/database/_fti/lucene/by_content?q=hello
72 </pre>
73
697884b @rnewson documentation of future features.
authored
74 A fulltext object contains multiple index view declarations. An index view consists of;
75
76 <dl>
77 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
78 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
79
80 <h3>The Defaults Object</h3>
81
82 The following indexing options can be defaulted;
83
84 <table>
85 <tr>
86 <th>name</th>
87 <th>description</th>
88 <th>available options</th>
89 <th>default</th>
90 </tr>
91 <tr>
a40523d @rnewson documentation of future features.
authored
92 <th>field</th>
93 <td>the field name to index under</td>
94 <td>user-defined</td>
95 <td>default</td>
96 </tr>
97 <tr>
697884b @rnewson documentation of future features.
authored
98 <th>store</th>
f16fc9c @rnewson docs
authored
99 <td>whether the data is stored. The value will be returned in the search result.</td>
697884b @rnewson documentation of future features.
authored
100 <td>yes, no</td>
101 <td>no</td>
102 </tr>
103 <tr>
104 <th>index</th>
105 <td>whether (and how) the data is indexed</td>
8328332 @rnewson typo
authored
106 <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
697884b @rnewson documentation of future features.
authored
107 <td>analyzed</td>
108 </tr>
109 </table>
087dcec @rnewson update documentation.
authored
110
111 <h3>The Document class</h3>
112
113 You may construct a new Document instance with;
114
115 <pre>
116 var doc = new Document();
117 </pre>
118
a40523d @rnewson documentation of future features.
authored
119 Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
087dcec @rnewson update documentation.
authored
120
4111703 @rnewson automatically detect Dates, remove special date() method.
authored
121 The data is usually interpreted as a String but couchdb-lucene provides special handling if a Javascript Date object is passed. Specifically, the date is indexed as a numeric value, which allows correct sorting, and stored (if requested) in ISO 8601 format (with a timezone marker).
122
087dcec @rnewson update documentation.
authored
123 <pre>
a40523d @rnewson documentation of future features.
authored
124 // Add with all the defaults.
125 doc.add("value");
126
127 // Add a subject field.
128 doc.add("this is the subject line.", {"field":"subject"});
9a71557 @rnewson formatting
authored
129
a40523d @rnewson documentation of future features.
authored
130 // Add but ensure it's stored.
131 doc.add("value", {"store":"yes"});
9a71557 @rnewson formatting
authored
132
a40523d @rnewson documentation of future features.
authored
133 // Add but don't analyze.
134 doc.add("don't analyze me", {"index":"not_analyzed"});
9a71557 @rnewson formatting
authored
135
136 // Extract text from the named attachment and index it (but not store it).
a40523d @rnewson documentation of future features.
authored
137 doc.attachment("attachment name", {"field":"attachments"});
087dcec @rnewson update documentation.
authored
138 </pre>
139
ccb81a8 @rnewson add example transforms section.
authored
140 <h3>Example Transforms</h3>
141
390858a @rnewson re-add Index Everything example.
authored
142 <h4>Index Everything</h4>
143
144 <pre>
145 function(doc) {
7bad7dc @rnewson correct syntax error in JS fun.
authored
146 var ret = new Document();
147
148 function idx(obj) {
149 for (var key in obj) {
150 switch (typeof obj[key]) {
151 case 'object':
152 idx(obj[key]);
153 break;
154 case 'function':
155 break;
156 default:
157 ret.add(obj[key]);
158 break;
159 }
160 }
161 };
162
163 idx(doc);
164
165 if (doc._attachments) {
166 for (var i in doc._attachments) {
167 ret.attachment("attachment", i);
168 }
390858a @rnewson re-add Index Everything example.
authored
169 }
7bad7dc @rnewson correct syntax error in JS fun.
authored
170
171 return ret;
390858a @rnewson re-add Index Everything example.
authored
172 }
173 </pre>
174
ccb81a8 @rnewson add example transforms section.
authored
175 <h4>Index Nothing</h4>
176
177 <pre>
178 function(doc) {
179 return null;
180 }
181 </pre>
182
c207a60 @rnewson update README
authored
183 <h4>Index Select Fields</h4>
ccb81a8 @rnewson add example transforms section.
authored
184
185 <pre>
186 function(doc) {
c207a60 @rnewson update README
authored
187 var result = new Document();
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
188 result.add(doc.subject, {"field":"subject", "store":"yes"});
189 result.add(doc.content, {"field":"subject"});
190 result.add({"field":"indexed_at"});
c207a60 @rnewson update README
authored
191 return result;
ccb81a8 @rnewson add example transforms section.
authored
192 }
193 </pre>
194
c207a60 @rnewson update README
authored
195 <h4>Index Attachments</h4>
ccb81a8 @rnewson add example transforms section.
authored
196
197 <pre>
198 function(doc) {
c207a60 @rnewson update README
authored
199 var result = new Document();
200 for(var a in doc._attachments) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
201 result.add_attachment(a, {"field":"attachment"});
ccb81a8 @rnewson add example transforms section.
authored
202 }
c207a60 @rnewson update README
authored
203 return result;
204 }
205 </pre>
206
207 <h4>A More Complex Example</h4>
208
209 <pre>
210 function(doc) {
211 var mk = function(name, value, group) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
212 var ret = new Document();
2946c9a @rnewson fix example.
authored
213 ret.add(value, {"field": group, "store":"yes"});
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
214 ret.add(group, {"field":"group", "store":"yes"});
c207a60 @rnewson update README
authored
215 return ret;
216 };
217 var ret = [];
218 if(doc.type != "reference") return null;
219 for(var g in doc.groups) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
220 ret.add(mk("library", doc.groups[g].library, g));
221 ret.add(mk("method", doc.groups[g].method, g));
222 ret.add(mk("target", doc.groups[g].target, g));
c207a60 @rnewson update README
authored
223 }
224 return ret;
225 }
226 </pre>
b207965 @rnewson improve README readability.
authored
227
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
228 <h2>Attachment Indexing</h2>
229
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
230 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
231
ec94e21 @rnewson updated README.md
authored
232 <h3>Supported Formats</h3>
233
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
234 <ul>
235 <li>Excel spreadsheets (application/vnd.ms-excel)
236 <li>Word documents (application/msword)
237 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
238 <li>Visio (application/vnd.visio)
239 <li>Outlook (application/vnd.ms-outlook)
240 <li>XML (application/xml)
241 <li>HTML (text/html)
242 <li>Images (image/*)
243 <li>Java class files
244 <li>Java jar archives
245 <li>MP3 (audio/mp3)
246 <li>OpenDocument (application/vnd.oasis.opendocument.*)
247 <li>Plain text (text/plain)
248 <li>PDF (application/pdf)
249 <li>RTF (application/rtf)
250 </ul>
251
b207965 @rnewson improve README readability.
authored
252 <h1>Searching with couchdb-lucene</h1>
253
39b22c8 @rnewson document that default search field is the _body field that attachment…
authored
254 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 @rnewson improve README readability.
authored
255
256 <dl>
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
257 <dt>q</dt><dd>the query to run (e.g, subject:hello). If not specified, the default field is searched.</dd>
f9c61e3 @rnewson format README
authored
258 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
259 <dt>limit</dt><dd>the maximum number of results to return</dd>
260 <dt>skip</dt><dd>the number of results to skip</dd>
261 <dt>include_docs</dt><dd>whether to include the source docs</dd>
262 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
263 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
264 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f @rnewson tweak README.md
authored
265 </dl>
b207965 @rnewson improve README readability.
authored
266
267 <i>All parameters except 'q' are optional.</i>
268
ec94e21 @rnewson updated README.md
authored
269 <h2>Special Fields</h2>
270
271 <dl>
f9c61e3 @rnewson format README
authored
272 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec @rnewson update documentation.
authored
273 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 @rnewson include all DC attributes, if present.
authored
274 </dl>
275
276 <h2>Dublin Core</h2>
277
278 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
279
280 <dl>
6e99faa @rnewson dc. to _dc.
authored
281 <dt>_dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
282 <dt>_dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
283 <dt>_dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
284 <dt>_dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
285 <dt>_dc.description</dt><dd>An account of the content of the resource.</dd>
286 <dt>_dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
287 <dt>_dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
288 <dt>_dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
289 <dt>_dc.modified</dt><dd>Date on which the resource was changed.</dd>
290 <dt>_dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
291 <dt>_dc.relation</dt><dd>A reference to a related resource.</dd>
292 <dt>_dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
293 <dt>_dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
294 <dt>_dc.subject</dt><dd>The topic of the content of the resource.</dd>
295 <dt>_dc.title</dt><dd>A name given to the resource.</dd>
296 <dt>_dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 @rnewson updated README.md
authored
297 </dl>
298
b207965 @rnewson improve README readability.
authored
299 <h2>Examples</h2>
300
301 <pre>
405e3a3 @rnewson update query urls' to reflect new syntax
authored
302 http://localhost:5984/dbname/_fti/design_doc/view_name?q=field_name:value
303 http://localhost:5984/dbname/_fti/design_doc/view_name?q=field_name:value&sort=other_field
304 http://localhost:5984/dbname/_fti/design_doc/view_name?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
b207965 @rnewson improve README readability.
authored
305 </pre>
306
307 <h2>Search Results Format</h2>
308
0fcf578 @rnewson update docs.
authored
309 The search result contains a number of fields at the top level, in addition to your search results.
310
311 <dl>
312 <dt>q</dt><dd>The query that was executed.</dd>
313 <dt>etag</dt><dd>An opaque token that reflects the current version of the index. This value is also returned in an ETag header to facilitate HTTP caching.</dd>
314 <dt>skip</dt><dd>The number of initial matches that was skipped.</dd>
315 <dt>limit</dt><dd>The maximum number of results that can appear.</dd>
316 <dt>total_rows</dt><dd>The total number of matches for this query.</dd>
317 <dt>search_duration</dt><dd>The number of milliseconds spent performing the search.</dd>
318 <dt>fetch_duration</dt><dd>The number of milliseconds spent retrieving the documents.</dd>
24591d9 @rnewson docs
authored
319 <dt>rows</dt><dd>The search results array, described below.</dd>
0fcf578 @rnewson update docs.
authored
320 </dl>
321
24591d9 @rnewson docs
authored
322 <h2>The search results array</h2>
323
324 The search results arrays consists of zero, one or more objects with the following fields;
0fcf578 @rnewson update docs.
authored
325
326 <dl>
327 <dt>id</dt><dd>The unique identifier for this match.</dd>
328 <dt>score</dt><dd>The normalized score (0.0-1.0, inclusive) for this match</dd>
329 <dt>fields</dt><dd>All the fields that were stored with this match</dd>
330 <dt>doc</dt><dd>The original document from couch, if requested with include_docs=true</dd>
331 </dl>
332
fd16315 @rnewson update README.md
authored
333 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
334
118d28e @rnewson JSON example output.
authored
335 <pre>
336 {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
337 "q": "+content:enron",
fd16315 @rnewson update README.md
authored
338 "skip": 0,
339 "limit": 2,
340 "total_rows": 176852,
341 "search_duration": 518,
342 "fetch_duration": 4,
343 "rows": [
344 {
0fcf578 @rnewson update docs.
authored
345 "id": "hain-m-all_documents-257.",
fd16315 @rnewson update README.md
authored
346 "score": 1.601625680923462
347 },
348 {
0fcf578 @rnewson update docs.
authored
349 "id": "hain-m-notes_inbox-257.",
fd16315 @rnewson update README.md
authored
350 "score": 1.601625680923462
351 }
118d28e @rnewson JSON example output.
authored
352 ]
353 }
354 </pre>
355
fd16315 @rnewson update README.md
authored
356 And the same with sorting;
357
118d28e @rnewson JSON example output.
authored
358 <pre>
359 {
0fcf578 @rnewson update docs.
authored
360 "q": "+content:enron",
fd16315 @rnewson update README.md
authored
361 "skip": 0,
362 "limit": 3,
363 "total_rows": 176852,
364 "search_duration": 660,
365 "fetch_duration": 4,
366 "sort_order": [
367 {
368 "field": "source",
369 "reverse": false,
370 "type": "string"
371 },
372 {
373 "reverse": false,
374 "type": "doc"
375 }
118d28e @rnewson JSON example output.
authored
376 ],
fd16315 @rnewson update README.md
authored
377 "rows": [
378 {
0fcf578 @rnewson update docs.
authored
379 "id": "shankman-j-inbox-105.",
fd16315 @rnewson update README.md
authored
380 "score": 0.6131107211112976,
381 "sort_order": [
382 "enron",
383 6
384 ]
385 },
386 {
0fcf578 @rnewson update docs.
authored
387 "id": "shankman-j-inbox-8.",
fd16315 @rnewson update README.md
authored
388 "score": 0.7492915391921997,
389 "sort_order": [
390 "enron",
391 7
392 ]
393 },
394 {
0fcf578 @rnewson update docs.
authored
395 "id": "shankman-j-inbox-30.",
fd16315 @rnewson update README.md
authored
396 "score": 0.507369875907898,
397 "sort_order": [
398 "enron",
399 8
400 ]
401 }
118d28e @rnewson JSON example output.
authored
402 ]
403 }
404 </pre>
405
139a78c @rnewson add info retrieval.
authored
406 <h1>Fetching information about the index</h1>
407
7a12058 @rnewson docs
authored
408 Calling couchdb-lucene without arguments returns a JSON object with information about the <i>whole</i> index.
139a78c @rnewson add info retrieval.
authored
409
410 <pre>
411 http://127.0.0.1:5984/enron/_fti
412 </pre>
413
414 returns;
415
416 <pre>
417 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
418 </pre>
419
b207965 @rnewson improve README readability.
authored
420 <h1>Working With The Source</h1>
421
422 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
423
424 <pre>
490ae39 @rnewson break long lines in README.md
authored
425 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
426 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
427 </pre>
428
429 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
430
431 <h1>Configuration</h1>
432
433 couchdb-lucene respects several system properties;
434
435 <dl>
f9c61e3 @rnewson format README
authored
436 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
437 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 @rnewson enhanced logging.
authored
438 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
45c0d9f @rnewson allow customization of default boolean operator in QueryParser.
authored
439 <dt>couchdb.lucene.operator<dt><dd>specify the default boolean operator for queries. If not specified, the default is "OR". You can specify either "OR" or "AND".</dd>
b207965 @rnewson improve README readability.
authored
440 </dl>
441
442 You can override these properties like this;
443
444 <pre>
fe20455 @rnewson fix typo in documentation [#7 state:resolved]
authored
445 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 @rnewson break long lines in README.md
authored
446 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
447 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
448 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
449 </pre>
b2d01cc @rnewson update README for basic auth.
authored
450
451 <h2>Basic Authentication</h2>
452
453 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
454
455 <dl>
f9c61e3 @rnewson format README
authored
456 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
457 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc @rnewson update README for basic auth.
authored
458 </dl>
ccb3c81 @rnewson add note about ipv6 localhost workaround. [#12 state:resolved]
authored
459
460 <h2>IPv6</h2>
461
462 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.