Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 482 lines (389 sloc) 14.694 kB
5d4e56a @rnewson update readme.
authored
1 <h1>News</h1>
2
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
3 The indexing API in 0.3 has changed since 0.2 to allow multiple design documents and "views" into Lucene. It will moves the Lucene-specific stuff into an options object.
764563b @rnewson update news in README.
authored
4
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
5 <h1>Issue Tracking</h1>
6b2b22c @rnewson add lighthouseapp link.
authored
6 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a @rnewson update readme.
authored
7
ef3f787 @rnewson add sysreq for Sun JDK.
authored
8 <h1>System Requirements</h1>
9
10 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
11
5220b65 @rnewson tweak README.md
authored
12 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
13
14 <ol>
15 <li>Install Maven 2.
16 <li>checkout repository
17 <li>type 'mvn'
18 <li>configure couchdb (see below)
19 </ol>
20
21 <h1>Configure CouchDB</h1>
22
23 <pre>
0563120 @rnewson fixes.
authored
24 [couchdb]
25 os_process_timeout=60000 ; increase the timeout from 5 seconds.
26
b207965 @rnewson improve README readability.
authored
27 [external]
77d4f67 @rnewson fix readme.
authored
28 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
29
30 [update_notification]
31 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
32
33 [httpd_db_handlers]
34 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
35 </pre>
36
37 <h1>Indexing Strategy</h1>
38
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
39 <h2>Document Indexing</h2>
40
697884b @rnewson documentation of future features.
authored
41 You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.
a2e9024 @rnewson wip
authored
42
697884b @rnewson documentation of future features.
authored
43 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by those view functions.
c207a60 @rnewson update README
authored
44
697884b @rnewson documentation of future features.
authored
45 Declare your functions as follows;
a2e9024 @rnewson wip
authored
46
697884b @rnewson documentation of future features.
authored
47 <pre>
48 {
8ff99e1 @rnewson tidy docs
authored
49 "views": {
50 <i>conventional view code goes here</i>
51 },
697884b @rnewson documentation of future features.
authored
52 "fulltext": {
53 "by_subject": {
54 "defaults": { "store":"yes" },
55 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
56 },
57 "french_documents": {
58 "defaults": { "language":"fr" },
59 "index":"function(doc) { if (doc.language != "fr") { return null;} var ret=new Document(); <i>etc</i> return ret; }"
60 }
61 }
62 }
63 </pre>
64
65 A fulltext object contains multiple index view declarations. An index view consists of;
66
67 <dl>
68 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
69 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
70
71 <h3>The Defaults Object</h3>
72
73 The following indexing options can be defaulted;
74
75 <table>
76 <tr>
77 <th>name</th>
78 <th>description</th>
79 <th>available options</th>
80 <th>default</th>
81 </tr>
82 <tr>
a40523d @rnewson documentation of future features.
authored
83 <th>field</th>
84 <td>the field name to index under</td>
85 <td>user-defined</td>
86 <td>default</td>
87 </tr>
88 <tr>
6f9033e @rnewson document type option
authored
89 <th>type</th>
90 <td>the type of data, which may affect analysis</td>
91 <td>date, number, text</td>
92 <td>text</td>
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
93 </tr>
6f9033e @rnewson document type option
authored
94 <tr>
697884b @rnewson documentation of future features.
authored
95 <th>store</th>
96 <td>whether the data is stored</td>
97 <td>yes, no</td>
98 <td>no</td>
99 </tr>
100 <tr>
101 <th>index</th>
102 <td>whether (and how) the data is indexed</td>
8328332 @rnewson typo
authored
103 <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
697884b @rnewson documentation of future features.
authored
104 <td>analyzed</td>
105 </tr>
106 <tr>
107 <th>analyzer</th>
108 <td>how the data is analyzed</td>
109 <td>simple, standard</td>
110 <td>standard</td>
111 </tr>
112 <tr>
113 <th>language</th>
114 <td>which language the data is in</td>
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
115 <td>auto, br, cjk, cn, cz, de, el, en, fr, nl, ru, th</td>
697884b @rnewson documentation of future features.
authored
116 <td>en</td>
117 </tr>
118 </table>
087dcec @rnewson update documentation.
authored
119
120 <h3>The Document class</h3>
121
122 You may construct a new Document instance with;
123
124 <pre>
125 var doc = new Document();
126 </pre>
127
a40523d @rnewson documentation of future features.
authored
128 Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
087dcec @rnewson update documentation.
authored
129
130 <pre>
a40523d @rnewson documentation of future features.
authored
131 // Add with all the defaults.
132 doc.add("value");
133
134 // Add a subject field.
135 doc.add("this is the subject line.", {"field":"subject"});
9a71557 @rnewson formatting
authored
136
a40523d @rnewson documentation of future features.
authored
137 // Add but ensure it's stored.
138 doc.add("value", {"store":"yes"});
9a71557 @rnewson formatting
authored
139
a40523d @rnewson documentation of future features.
authored
140 // Add but don't analyze.
141 doc.add("don't analyze me", {"index":"not_analyzed"});
9a71557 @rnewson formatting
authored
142
143 // Extract text from the named attachment and index it (but not store it).
a40523d @rnewson documentation of future features.
authored
144 doc.attachment("attachment name", {"field":"attachments"});
9a71557 @rnewson formatting
authored
145
146 // Interpret "value" as a date using the default date formats.
a40523d @rnewson documentation of future features.
authored
147 doc.add("2009-01-01T00:00:00Z", {"type":"date"});
9a71557 @rnewson formatting
authored
148
149 // intrepret "value" as a date using the supplied format string
150 // (see Java's SimpleDateFormat class for the syntax).
8ff99e1 @rnewson tidy docs
authored
151 doc.add("2009-01-01", {"type":"date", "format":"YYYY-MM-dd"});
152
153 // intrepret "value" as a number.
154 doc.add("100", {"type":"number"});
087dcec @rnewson update documentation.
authored
155 </pre>
156
ccb81a8 @rnewson add example transforms section.
authored
157 <h3>Example Transforms</h3>
158
390858a @rnewson re-add Index Everything example.
authored
159 <h4>Index Everything</h4>
160
161 <pre>
162 function(doc) {
163 var ret = new Document();
164
165 function idx(obj) {
166 for (var key in obj) {
167 switch (typeof obj[key]) {
168 case 'object':
169 idx(obj[key]);
170 break;
171 case 'function':
172 break;
173 default:
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
174 ret.add(obj[key], {"field", key});
175 /*
176 * Uncomment next line to include
177 * all attributes into the default field.
0b6780f @rnewson expand index-everything example
authored
178 */
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
179 // ret.add(obj[key]);
390858a @rnewson re-add Index Everything example.
authored
180 break;
181 }
182 }
183 }
184
0b6780f @rnewson expand index-everything example
authored
185 // Index all attributes
390858a @rnewson re-add Index Everything example.
authored
186 idx(doc);
0b6780f @rnewson expand index-everything example
authored
187
188 // Index all attachments
189 for(var a in doc._attachments) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
190 ret.add_attachment(a, {"field", "attachments"});
0b6780f @rnewson expand index-everything example
authored
191 }
192
390858a @rnewson re-add Index Everything example.
authored
193 return ret;
194 }
195 </pre>
196
ccb81a8 @rnewson add example transforms section.
authored
197 <h4>Index Nothing</h4>
198
199 <pre>
200 function(doc) {
201 return null;
202 }
203 </pre>
204
c207a60 @rnewson update README
authored
205 <h4>Index Select Fields</h4>
ccb81a8 @rnewson add example transforms section.
authored
206
207 <pre>
208 function(doc) {
c207a60 @rnewson update README
authored
209 var result = new Document();
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
210 result.add(doc.subject, {"field":"subject", "store":"yes"});
211 result.add(doc.content, {"field":"subject"});
212 result.add({"field":"indexed_at"});
c207a60 @rnewson update README
authored
213 return result;
ccb81a8 @rnewson add example transforms section.
authored
214 }
215 </pre>
216
c207a60 @rnewson update README
authored
217 <h4>Index Attachments</h4>
ccb81a8 @rnewson add example transforms section.
authored
218
219 <pre>
220 function(doc) {
c207a60 @rnewson update README
authored
221 var result = new Document();
222 for(var a in doc._attachments) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
223 result.add_attachment(a, {"field":"attachment"});
ccb81a8 @rnewson add example transforms section.
authored
224 }
c207a60 @rnewson update README
authored
225 return result;
226 }
227 </pre>
228
229 <h4>A More Complex Example</h4>
230
231 <pre>
232 function(doc) {
233 var mk = function(name, value, group) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
234 var ret = new Document();
235 ret.add(value, {"field":group, "store":"yes"}); // ERROR
236 ret.add(group, {"field":"group", "store":"yes"});
c207a60 @rnewson update README
authored
237 return ret;
238 };
239 var ret = [];
240 if(doc.type != "reference") return null;
241 for(var g in doc.groups) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
242 ret.add(mk("library", doc.groups[g].library, g));
243 ret.add(mk("method", doc.groups[g].method, g));
244 ret.add(mk("target", doc.groups[g].target, g));
c207a60 @rnewson update README
authored
245 }
246 return ret;
247 }
248 </pre>
b207965 @rnewson improve README readability.
authored
249
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
250 <h2>Attachment Indexing</h2>
251
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
252 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
253
ec94e21 @rnewson updated README.md
authored
254 <h3>Supported Formats</h3>
255
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
256 <ul>
257 <li>Excel spreadsheets (application/vnd.ms-excel)
258 <li>Word documents (application/msword)
259 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
260 <li>Visio (application/vnd.visio)
261 <li>Outlook (application/vnd.ms-outlook)
262 <li>XML (application/xml)
263 <li>HTML (text/html)
264 <li>Images (image/*)
265 <li>Java class files
266 <li>Java jar archives
267 <li>MP3 (audio/mp3)
268 <li>OpenDocument (application/vnd.oasis.opendocument.*)
269 <li>Plain text (text/plain)
270 <li>PDF (application/pdf)
271 <li>RTF (application/rtf)
272 </ul>
273
b207965 @rnewson improve README readability.
authored
274 <h1>Searching with couchdb-lucene</h1>
275
39b22c8 @rnewson document that default search field is the _body field that attachment…
authored
276 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 @rnewson improve README readability.
authored
277
278 <dl>
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
279 <dt>q</dt><dd>the query to run (e.g, subject:hello). If not specified, the default field is searched.</dd>
f9c61e3 @rnewson format README
authored
280 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
281 <dt>limit</dt><dd>the maximum number of results to return</dd>
282 <dt>skip</dt><dd>the number of results to skip</dd>
283 <dt>include_docs</dt><dd>whether to include the source docs</dd>
284 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
285 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
286 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f @rnewson tweak README.md
authored
287 </dl>
b207965 @rnewson improve README readability.
authored
288
289 <i>All parameters except 'q' are optional.</i>
290
ec94e21 @rnewson updated README.md
authored
291 <h2>Special Fields</h2>
292
293 <dl>
f9c61e3 @rnewson format README
authored
294 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec @rnewson update documentation.
authored
295 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 @rnewson include all DC attributes, if present.
authored
296 </dl>
297
298 <h2>Dublin Core</h2>
299
300 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
301
302 <dl>
f9c61e3 @rnewson format README
authored
303 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
304 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
305 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
306 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
307 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
308 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
309 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
310 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
311 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
312 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
313 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
314 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
315 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
316 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
317 <dt>dc.title</dt><dd>A name given to the resource.</dd>
318 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 @rnewson updated README.md
authored
319 </dl>
320
b207965 @rnewson improve README readability.
authored
321 <h2>Examples</h2>
322
323 <pre>
324 http://localhost:5984/dbname/_fti?q=field_name:value
325 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
326 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
327 </pre>
328
329 <h2>Search Results Format</h2>
330
0fcf578 @rnewson update docs.
authored
331 The search result contains a number of fields at the top level, in addition to your search results.
332
333 <dl>
334 <dt>q</dt><dd>The query that was executed.</dd>
335 <dt>etag</dt><dd>An opaque token that reflects the current version of the index. This value is also returned in an ETag header to facilitate HTTP caching.</dd>
336 <dt>skip</dt><dd>The number of initial matches that was skipped.</dd>
337 <dt>limit</dt><dd>The maximum number of results that can appear.</dd>
338 <dt>total_rows</dt><dd>The total number of matches for this query.</dd>
339 <dt>search_duration</dt><dd>The number of milliseconds spent performing the search.</dd>
340 <dt>fetch_duration</dt><dd>The number of milliseconds spent retrieving the documents.</dd>
341 <dt>rows</dt><dd>The search results object, described below.</dd>
342 </dl>
343
344 <h2>The search results object</h2>
345
346 <dl>
347 <dt>id</dt><dd>The unique identifier for this match.</dd>
348 <dt>score</dt><dd>The normalized score (0.0-1.0, inclusive) for this match</dd>
349 <dt>fields</dt><dd>All the fields that were stored with this match</dd>
350 <dt>doc</dt><dd>The original document from couch, if requested with include_docs=true</dd>
351 </dl>
352
fd16315 @rnewson update README.md
authored
353 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
354
118d28e @rnewson JSON example output.
authored
355 <pre>
356 {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
357 "q": "+content:enron",
fd16315 @rnewson update README.md
authored
358 "skip": 0,
359 "limit": 2,
360 "total_rows": 176852,
361 "search_duration": 518,
362 "fetch_duration": 4,
363 "rows": [
364 {
0fcf578 @rnewson update docs.
authored
365 "id": "hain-m-all_documents-257.",
fd16315 @rnewson update README.md
authored
366 "score": 1.601625680923462
367 },
368 {
0fcf578 @rnewson update docs.
authored
369 "id": "hain-m-notes_inbox-257.",
fd16315 @rnewson update README.md
authored
370 "score": 1.601625680923462
371 }
118d28e @rnewson JSON example output.
authored
372 ]
373 }
374 </pre>
375
fd16315 @rnewson update README.md
authored
376 And the same with sorting;
377
118d28e @rnewson JSON example output.
authored
378 <pre>
379 {
0fcf578 @rnewson update docs.
authored
380 "q": "+content:enron",
fd16315 @rnewson update README.md
authored
381 "skip": 0,
382 "limit": 3,
383 "total_rows": 176852,
384 "search_duration": 660,
385 "fetch_duration": 4,
386 "sort_order": [
387 {
388 "field": "source",
389 "reverse": false,
390 "type": "string"
391 },
392 {
393 "reverse": false,
394 "type": "doc"
395 }
118d28e @rnewson JSON example output.
authored
396 ],
fd16315 @rnewson update README.md
authored
397 "rows": [
398 {
0fcf578 @rnewson update docs.
authored
399 "id": "shankman-j-inbox-105.",
fd16315 @rnewson update README.md
authored
400 "score": 0.6131107211112976,
401 "sort_order": [
402 "enron",
403 6
404 ]
405 },
406 {
0fcf578 @rnewson update docs.
authored
407 "id": "shankman-j-inbox-8.",
fd16315 @rnewson update README.md
authored
408 "score": 0.7492915391921997,
409 "sort_order": [
410 "enron",
411 7
412 ]
413 },
414 {
0fcf578 @rnewson update docs.
authored
415 "id": "shankman-j-inbox-30.",
fd16315 @rnewson update README.md
authored
416 "score": 0.507369875907898,
417 "sort_order": [
418 "enron",
419 8
420 ]
421 }
118d28e @rnewson JSON example output.
authored
422 ]
423 }
424 </pre>
425
139a78c @rnewson add info retrieval.
authored
426 <h1>Fetching information about the index</h1>
427
428 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
429
430 <pre>
431 http://127.0.0.1:5984/enron/_fti
432 </pre>
433
434 returns;
435
436 <pre>
437 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
438 </pre>
439
b207965 @rnewson improve README readability.
authored
440 <h1>Working With The Source</h1>
441
442 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
443
444 <pre>
490ae39 @rnewson break long lines in README.md
authored
445 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
446 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
447 </pre>
448
449 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
450
451 <h1>Configuration</h1>
452
453 couchdb-lucene respects several system properties;
454
455 <dl>
f9c61e3 @rnewson format README
authored
456 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
457 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 @rnewson enhanced logging.
authored
458 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b207965 @rnewson improve README readability.
authored
459 </dl>
460
461 You can override these properties like this;
462
463 <pre>
fe20455 @rnewson fix typo in documentation [#7 state:resolved]
authored
464 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 @rnewson break long lines in README.md
authored
465 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
466 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
467 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
468 </pre>
b2d01cc @rnewson update README for basic auth.
authored
469
470 <h2>Basic Authentication</h2>
471
472 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
473
474 <dl>
f9c61e3 @rnewson format README
authored
475 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
476 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc @rnewson update README for basic auth.
authored
477 </dl>
ccb3c81 @rnewson add note about ipv6 localhost workaround. [#12 state:resolved]
authored
478
479 <h2>IPv6</h2>
480
481 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.