Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 454 lines (360 sloc) 14.316 kB
5d4e56a @rnewson update readme.
authored
1 <h1>News</h1>
2
c6ef99a @rnewson remove lang/analyzer attributes as they've been pushed to 0.4.
authored
3 The indexing API in 0.3 has changed since 0.2 to allow multiple design documents and "views" into Lucene. It also moves the Lucene-specific stuff into an options object.
764563b @rnewson update news in README.
authored
4
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
5 <h1>Issue Tracking</h1>
a785480 @rnewson lighthouse sucks at formatting anything, abandon ship.
authored
6
7a0d1d3 @rnewson lighthouse sucks at formatting anything, abandon ship.
authored
7 Issue tracking at <a href="http://github.com/rnewson/couchdb-lucene/issues">github</a>.
5d4e56a @rnewson update readme.
authored
8
ef3f787 @rnewson add sysreq for Sun JDK.
authored
9 <h1>System Requirements</h1>
10
905a196 @rnewson update README as user reports indicate later OpenJDK versions now work.
authored
11 Sun JDK 5 or higher is recommended.
12
13 Couchdb-lucene is known to be incompatible with some versions of OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library. The version in Ubuntu 8.10 (6b12-0ubuntu6.4) is known to work and it uses Rhino 1.7R1.
ef3f787 @rnewson add sysreq for Sun JDK.
authored
14
5220b65 @rnewson tweak README.md
authored
15 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
16
17 <ol>
18 <li>Install Maven 2.
19 <li>checkout repository
20 <li>type 'mvn'
21 <li>configure couchdb (see below)
22 </ol>
23
24 <h1>Configure CouchDB</h1>
25
26 <pre>
0563120 @rnewson fixes.
authored
27 [couchdb]
28 os_process_timeout=60000 ; increase the timeout from 5 seconds.
29
b207965 @rnewson improve README readability.
authored
30 [external]
77d4f67 @rnewson fix readme.
authored
31 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
32
33 [update_notification]
34 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
35
36 [httpd_db_handlers]
37 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
38 </pre>
39
40 <h1>Indexing Strategy</h1>
41
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
42 <h2>Document Indexing</h2>
43
697884b @rnewson documentation of future features.
authored
44 You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.
a2e9024 @rnewson wip
authored
45
437eae9 @rnewson s/view/fulltext in README.md
authored
46 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by the index functions.
c207a60 @rnewson update README
authored
47
c4b76d0 @rnewson clarify the example is a design doc.
authored
48 Declare your design document as follows;
a2e9024 @rnewson wip
authored
49
697884b @rnewson documentation of future features.
authored
50 <pre>
51 {
52 "fulltext": {
53 "by_subject": {
54 "defaults": { "store":"yes" },
55 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
56 },
c6ef99a @rnewson remove lang/analyzer attributes as they've been pushed to 0.4.
authored
57 "by_content": {
58 "defaults": { "store":"no" },
59 "index":"function(doc) { var ret=new Document(); ret.add(doc.content); return ret }"
697884b @rnewson documentation of future features.
authored
60 }
61 }
62 }
63 </pre>
64
65 A fulltext object contains multiple index view declarations. An index view consists of;
66
67 <dl>
68 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
69 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
70
71 <h3>The Defaults Object</h3>
72
73 The following indexing options can be defaulted;
74
75 <table>
76 <tr>
77 <th>name</th>
78 <th>description</th>
79 <th>available options</th>
80 <th>default</th>
81 </tr>
82 <tr>
a40523d @rnewson documentation of future features.
authored
83 <th>field</th>
84 <td>the field name to index under</td>
85 <td>user-defined</td>
86 <td>default</td>
87 </tr>
88 <tr>
697884b @rnewson documentation of future features.
authored
89 <th>store</th>
f16fc9c @rnewson docs
authored
90 <td>whether the data is stored. The value will be returned in the search result.</td>
697884b @rnewson documentation of future features.
authored
91 <td>yes, no</td>
92 <td>no</td>
93 </tr>
94 <tr>
95 <th>index</th>
96 <td>whether (and how) the data is indexed</td>
8328332 @rnewson typo
authored
97 <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
697884b @rnewson documentation of future features.
authored
98 <td>analyzed</td>
99 </tr>
100 </table>
087dcec @rnewson update documentation.
authored
101
102 <h3>The Document class</h3>
103
104 You may construct a new Document instance with;
105
106 <pre>
107 var doc = new Document();
108 </pre>
109
a40523d @rnewson documentation of future features.
authored
110 Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
087dcec @rnewson update documentation.
authored
111
4111703 @rnewson automatically detect Dates, remove special date() method.
authored
112 The data is usually interpreted as a String but couchdb-lucene provides special handling if a Javascript Date object is passed. Specifically, the date is indexed as a numeric value, which allows correct sorting, and stored (if requested) in ISO 8601 format (with a timezone marker).
113
087dcec @rnewson update documentation.
authored
114 <pre>
a40523d @rnewson documentation of future features.
authored
115 // Add with all the defaults.
116 doc.add("value");
117
118 // Add a subject field.
119 doc.add("this is the subject line.", {"field":"subject"});
9a71557 @rnewson formatting
authored
120
a40523d @rnewson documentation of future features.
authored
121 // Add but ensure it's stored.
122 doc.add("value", {"store":"yes"});
9a71557 @rnewson formatting
authored
123
a40523d @rnewson documentation of future features.
authored
124 // Add but don't analyze.
125 doc.add("don't analyze me", {"index":"not_analyzed"});
9a71557 @rnewson formatting
authored
126
127 // Extract text from the named attachment and index it (but not store it).
a40523d @rnewson documentation of future features.
authored
128 doc.attachment("attachment name", {"field":"attachments"});
087dcec @rnewson update documentation.
authored
129 </pre>
130
ccb81a8 @rnewson add example transforms section.
authored
131 <h3>Example Transforms</h3>
132
390858a @rnewson re-add Index Everything example.
authored
133 <h4>Index Everything</h4>
134
135 <pre>
136 function(doc) {
7bad7dc @rnewson correct syntax error in JS fun.
authored
137 var ret = new Document();
138
139 function idx(obj) {
140 for (var key in obj) {
141 switch (typeof obj[key]) {
142 case 'object':
143 idx(obj[key]);
144 break;
145 case 'function':
146 break;
147 default:
148 ret.add(obj[key]);
149 break;
150 }
151 }
152 };
153
154 idx(doc);
155
156 if (doc._attachments) {
157 for (var i in doc._attachments) {
158 ret.attachment("attachment", i);
159 }
390858a @rnewson re-add Index Everything example.
authored
160 }
7bad7dc @rnewson correct syntax error in JS fun.
authored
161
162 return ret;
390858a @rnewson re-add Index Everything example.
authored
163 }
164 </pre>
165
ccb81a8 @rnewson add example transforms section.
authored
166 <h4>Index Nothing</h4>
167
168 <pre>
169 function(doc) {
170 return null;
171 }
172 </pre>
173
c207a60 @rnewson update README
authored
174 <h4>Index Select Fields</h4>
ccb81a8 @rnewson add example transforms section.
authored
175
176 <pre>
177 function(doc) {
c207a60 @rnewson update README
authored
178 var result = new Document();
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
179 result.add(doc.subject, {"field":"subject", "store":"yes"});
180 result.add(doc.content, {"field":"subject"});
181 result.add({"field":"indexed_at"});
c207a60 @rnewson update README
authored
182 return result;
ccb81a8 @rnewson add example transforms section.
authored
183 }
184 </pre>
185
c207a60 @rnewson update README
authored
186 <h4>Index Attachments</h4>
ccb81a8 @rnewson add example transforms section.
authored
187
188 <pre>
189 function(doc) {
c207a60 @rnewson update README
authored
190 var result = new Document();
191 for(var a in doc._attachments) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
192 result.add_attachment(a, {"field":"attachment"});
ccb81a8 @rnewson add example transforms section.
authored
193 }
c207a60 @rnewson update README
authored
194 return result;
195 }
196 </pre>
197
198 <h4>A More Complex Example</h4>
199
200 <pre>
201 function(doc) {
202 var mk = function(name, value, group) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
203 var ret = new Document();
2946c9a @rnewson fix example.
authored
204 ret.add(value, {"field": group, "store":"yes"});
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
205 ret.add(group, {"field":"group", "store":"yes"});
c207a60 @rnewson update README
authored
206 return ret;
207 };
208 var ret = [];
209 if(doc.type != "reference") return null;
210 for(var g in doc.groups) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
211 ret.add(mk("library", doc.groups[g].library, g));
212 ret.add(mk("method", doc.groups[g].method, g));
213 ret.add(mk("target", doc.groups[g].target, g));
c207a60 @rnewson update README
authored
214 }
215 return ret;
216 }
217 </pre>
b207965 @rnewson improve README readability.
authored
218
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
219 <h2>Attachment Indexing</h2>
220
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
221 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
222
ec94e21 @rnewson updated README.md
authored
223 <h3>Supported Formats</h3>
224
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
225 <ul>
226 <li>Excel spreadsheets (application/vnd.ms-excel)
227 <li>Word documents (application/msword)
228 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
229 <li>Visio (application/vnd.visio)
230 <li>Outlook (application/vnd.ms-outlook)
231 <li>XML (application/xml)
232 <li>HTML (text/html)
233 <li>Images (image/*)
234 <li>Java class files
235 <li>Java jar archives
236 <li>MP3 (audio/mp3)
237 <li>OpenDocument (application/vnd.oasis.opendocument.*)
238 <li>Plain text (text/plain)
239 <li>PDF (application/pdf)
240 <li>RTF (application/rtf)
241 </ul>
242
b207965 @rnewson improve README readability.
authored
243 <h1>Searching with couchdb-lucene</h1>
244
39b22c8 @rnewson document that default search field is the _body field that attachment…
authored
245 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 @rnewson improve README readability.
authored
246
247 <dl>
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
248 <dt>q</dt><dd>the query to run (e.g, subject:hello). If not specified, the default field is searched.</dd>
f9c61e3 @rnewson format README
authored
249 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
250 <dt>limit</dt><dd>the maximum number of results to return</dd>
251 <dt>skip</dt><dd>the number of results to skip</dd>
252 <dt>include_docs</dt><dd>whether to include the source docs</dd>
253 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
254 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
255 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f @rnewson tweak README.md
authored
256 </dl>
b207965 @rnewson improve README readability.
authored
257
258 <i>All parameters except 'q' are optional.</i>
259
ec94e21 @rnewson updated README.md
authored
260 <h2>Special Fields</h2>
261
262 <dl>
f9c61e3 @rnewson format README
authored
263 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec @rnewson update documentation.
authored
264 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 @rnewson include all DC attributes, if present.
authored
265 </dl>
266
267 <h2>Dublin Core</h2>
268
269 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
270
271 <dl>
6e99faa @rnewson dc. to _dc.
authored
272 <dt>_dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
273 <dt>_dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
274 <dt>_dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
275 <dt>_dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
276 <dt>_dc.description</dt><dd>An account of the content of the resource.</dd>
277 <dt>_dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
278 <dt>_dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
279 <dt>_dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
280 <dt>_dc.modified</dt><dd>Date on which the resource was changed.</dd>
281 <dt>_dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
282 <dt>_dc.relation</dt><dd>A reference to a related resource.</dd>
283 <dt>_dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
284 <dt>_dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
285 <dt>_dc.subject</dt><dd>The topic of the content of the resource.</dd>
286 <dt>_dc.title</dt><dd>A name given to the resource.</dd>
287 <dt>_dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 @rnewson updated README.md
authored
288 </dl>
289
b207965 @rnewson improve README readability.
authored
290 <h2>Examples</h2>
291
292 <pre>
405e3a3 @rnewson update query urls' to reflect new syntax
authored
293 http://localhost:5984/dbname/_fti/design_doc/view_name?q=field_name:value
294 http://localhost:5984/dbname/_fti/design_doc/view_name?q=field_name:value&sort=other_field
295 http://localhost:5984/dbname/_fti/design_doc/view_name?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
b207965 @rnewson improve README readability.
authored
296 </pre>
297
298 <h2>Search Results Format</h2>
299
0fcf578 @rnewson update docs.
authored
300 The search result contains a number of fields at the top level, in addition to your search results.
301
302 <dl>
303 <dt>q</dt><dd>The query that was executed.</dd>
304 <dt>etag</dt><dd>An opaque token that reflects the current version of the index. This value is also returned in an ETag header to facilitate HTTP caching.</dd>
305 <dt>skip</dt><dd>The number of initial matches that was skipped.</dd>
306 <dt>limit</dt><dd>The maximum number of results that can appear.</dd>
307 <dt>total_rows</dt><dd>The total number of matches for this query.</dd>
308 <dt>search_duration</dt><dd>The number of milliseconds spent performing the search.</dd>
309 <dt>fetch_duration</dt><dd>The number of milliseconds spent retrieving the documents.</dd>
24591d9 @rnewson docs
authored
310 <dt>rows</dt><dd>The search results array, described below.</dd>
0fcf578 @rnewson update docs.
authored
311 </dl>
312
24591d9 @rnewson docs
authored
313 <h2>The search results array</h2>
314
315 The search results arrays consists of zero, one or more objects with the following fields;
0fcf578 @rnewson update docs.
authored
316
317 <dl>
318 <dt>id</dt><dd>The unique identifier for this match.</dd>
319 <dt>score</dt><dd>The normalized score (0.0-1.0, inclusive) for this match</dd>
320 <dt>fields</dt><dd>All the fields that were stored with this match</dd>
321 <dt>doc</dt><dd>The original document from couch, if requested with include_docs=true</dd>
322 </dl>
323
fd16315 @rnewson update README.md
authored
324 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
325
118d28e @rnewson JSON example output.
authored
326 <pre>
327 {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
328 "q": "+content:enron",
fd16315 @rnewson update README.md
authored
329 "skip": 0,
330 "limit": 2,
331 "total_rows": 176852,
332 "search_duration": 518,
333 "fetch_duration": 4,
334 "rows": [
335 {
0fcf578 @rnewson update docs.
authored
336 "id": "hain-m-all_documents-257.",
fd16315 @rnewson update README.md
authored
337 "score": 1.601625680923462
338 },
339 {
0fcf578 @rnewson update docs.
authored
340 "id": "hain-m-notes_inbox-257.",
fd16315 @rnewson update README.md
authored
341 "score": 1.601625680923462
342 }
118d28e @rnewson JSON example output.
authored
343 ]
344 }
345 </pre>
346
fd16315 @rnewson update README.md
authored
347 And the same with sorting;
348
118d28e @rnewson JSON example output.
authored
349 <pre>
350 {
0fcf578 @rnewson update docs.
authored
351 "q": "+content:enron",
fd16315 @rnewson update README.md
authored
352 "skip": 0,
353 "limit": 3,
354 "total_rows": 176852,
355 "search_duration": 660,
356 "fetch_duration": 4,
357 "sort_order": [
358 {
359 "field": "source",
360 "reverse": false,
361 "type": "string"
362 },
363 {
364 "reverse": false,
365 "type": "doc"
366 }
118d28e @rnewson JSON example output.
authored
367 ],
fd16315 @rnewson update README.md
authored
368 "rows": [
369 {
0fcf578 @rnewson update docs.
authored
370 "id": "shankman-j-inbox-105.",
fd16315 @rnewson update README.md
authored
371 "score": 0.6131107211112976,
372 "sort_order": [
373 "enron",
374 6
375 ]
376 },
377 {
0fcf578 @rnewson update docs.
authored
378 "id": "shankman-j-inbox-8.",
fd16315 @rnewson update README.md
authored
379 "score": 0.7492915391921997,
380 "sort_order": [
381 "enron",
382 7
383 ]
384 },
385 {
0fcf578 @rnewson update docs.
authored
386 "id": "shankman-j-inbox-30.",
fd16315 @rnewson update README.md
authored
387 "score": 0.507369875907898,
388 "sort_order": [
389 "enron",
390 8
391 ]
392 }
118d28e @rnewson JSON example output.
authored
393 ]
394 }
395 </pre>
396
139a78c @rnewson add info retrieval.
authored
397 <h1>Fetching information about the index</h1>
398
7a12058 @rnewson docs
authored
399 Calling couchdb-lucene without arguments returns a JSON object with information about the <i>whole</i> index.
139a78c @rnewson add info retrieval.
authored
400
401 <pre>
402 http://127.0.0.1:5984/enron/_fti
403 </pre>
404
405 returns;
406
407 <pre>
408 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
409 </pre>
410
b207965 @rnewson improve README readability.
authored
411 <h1>Working With The Source</h1>
412
413 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
414
415 <pre>
490ae39 @rnewson break long lines in README.md
authored
416 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
417 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
418 </pre>
419
420 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
421
422 <h1>Configuration</h1>
423
424 couchdb-lucene respects several system properties;
425
426 <dl>
f9c61e3 @rnewson format README
authored
427 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
428 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 @rnewson enhanced logging.
authored
429 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
45c0d9f @rnewson allow customization of default boolean operator in QueryParser.
authored
430 <dt>couchdb.lucene.operator<dt><dd>specify the default boolean operator for queries. If not specified, the default is "OR". You can specify either "OR" or "AND".</dd>
b207965 @rnewson improve README readability.
authored
431 </dl>
432
433 You can override these properties like this;
434
435 <pre>
fe20455 @rnewson fix typo in documentation [#7 state:resolved]
authored
436 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 @rnewson break long lines in README.md
authored
437 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
438 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
439 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
440 </pre>
b2d01cc @rnewson update README for basic auth.
authored
441
442 <h2>Basic Authentication</h2>
443
444 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
445
446 <dl>
f9c61e3 @rnewson format README
authored
447 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
448 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc @rnewson update README for basic auth.
authored
449 </dl>
ccb3c81 @rnewson add note about ipv6 localhost workaround. [#12 state:resolved]
authored
450
451 <h2>IPv6</h2>
452
453 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.