Skip to content
This repository
Newer
Older
100644 452 lines (363 sloc) 13.419 kb
5d4e56ad »
2009-03-06 update readme.
1 <h1>News</h1>
2
5e4e1811 »
2009-04-10 Add documentation on proposed enhancements to the indexing API for 0.3.
3 The indexing API in 0.3 will change once again to allow multiple design documents and "views" into Lucene. It will also move much of the Lucene-specific stuff into an options object. Please read the TODO for details.
4
5 The indexing API in 0.2 has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
764563b5 »
2009-04-04 update news in README.
6
6b2b22c4 »
2009-03-16 add lighthouseapp link.
7 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56ad »
2009-03-06 update readme.
8
ef3f787b »
2009-04-06 add sysreq for Sun JDK.
9 <h1>System Requirements</h1>
10
11 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
12
5220b654 »
2009-02-14 tweak README.md
13 <h1>Build couchdb-lucene</h1>
b2079657 »
2009-02-14 improve README readability.
14
15 <ol>
16 <li>Install Maven 2.
17 <li>checkout repository
18 <li>type 'mvn'
19 <li>configure couchdb (see below)
20 </ol>
21
22 <h1>Configure CouchDB</h1>
23
24 <pre>
05631204 »
2009-03-07 fixes.
25 [couchdb]
26 os_process_timeout=60000 ; increase the timeout from 5 seconds.
27
b2079657 »
2009-02-14 improve README readability.
28 [external]
77d4f67e »
2009-03-07 fix readme.
29 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024b »
2009-03-06 wip
30
31 [update_notification]
32 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b2079657 »
2009-02-14 improve README readability.
33
34 [httpd_db_handlers]
35 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
36 </pre>
37
38 <h1>Indexing Strategy</h1>
39
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
40 <h2>Document Indexing</h2>
41
697884bd »
2009-04-21 documentation of future features.
42 You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.
a2e9024b »
2009-03-06 wip
43
697884bd »
2009-04-21 documentation of future features.
44 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by those view functions.
c207a604 »
2009-04-04 update README
45
697884bd »
2009-04-21 documentation of future features.
46 Declare your functions as follows;
a2e9024b »
2009-03-06 wip
47
697884bd »
2009-04-21 documentation of future features.
48 <pre>
49 {
8ff99e14 »
2009-04-21 tidy docs
50 "views": {
51 <i>conventional view code goes here</i>
52 },
697884bd »
2009-04-21 documentation of future features.
53 "fulltext": {
54 "by_subject": {
55 "defaults": { "store":"yes" },
56 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
57 },
58 "french_documents": {
59 "defaults": { "language":"fr" },
60 "index":"function(doc) { if (doc.language != "fr") { return null;} var ret=new Document(); <i>etc</i> return ret; }"
61 }
62 }
63 }
64 </pre>
65
66 A fulltext object contains multiple index view declarations. An index view consists of;
67
68 <dl>
69 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
70 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
71
72 <h3>The Defaults Object</h3>
73
74 The following indexing options can be defaulted;
75
76 <table>
77 <tr>
78 <th>name</th>
79 <th>description</th>
80 <th>available options</th>
81 <th>default</th>
82 </tr>
83 <tr>
a40523d3 »
2009-04-21 documentation of future features.
84 <th>field</th>
85 <td>the field name to index under</td>
86 <td>user-defined</td>
87 <td>default</td>
88 </tr>
89 <tr>
697884bd »
2009-04-21 documentation of future features.
90 <th>store</th>
91 <td>whether the data is stored</td>
92 <td>yes, no</td>
93 <td>no</td>
94 </tr>
95 <tr>
96 <th>index</th>
97 <td>whether (and how) the data is indexed</td>
83283325 »
2009-04-21 typo
98 <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
697884bd »
2009-04-21 documentation of future features.
99 <td>analyzed</td>
100 </tr>
101 <tr>
102 <th>analyzer</th>
103 <td>how the data is analyzed</td>
104 <td>simple, standard</td>
105 <td>standard</td>
106 </tr>
107 <tr>
108 <th>language</th>
109 <td>which language the data is in</td>
110 <td>br, cjk, cn, cz, de, el, en, fr, nl, ru, th</td>
111 <td>en</td>
112 </tr>
113 </table>
087dcec0 »
2009-04-04 update documentation.
114
115 <h3>The Document class</h3>
116
117 You may construct a new Document instance with;
118
119 <pre>
120 var doc = new Document();
121 </pre>
122
a40523d3 »
2009-04-21 documentation of future features.
123 Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
087dcec0 »
2009-04-04 update documentation.
124
125 <pre>
a40523d3 »
2009-04-21 documentation of future features.
126 // Add with all the defaults.
127 doc.add("value");
128
129 // Add a subject field.
130 doc.add("this is the subject line.", {"field":"subject"});
9a715570 »
2009-04-05 formatting
131
a40523d3 »
2009-04-21 documentation of future features.
132 // Add but ensure it's stored.
133 doc.add("value", {"store":"yes"});
9a715570 »
2009-04-05 formatting
134
a40523d3 »
2009-04-21 documentation of future features.
135 // Add but don't analyze.
136 doc.add("don't analyze me", {"index":"not_analyzed"});
9a715570 »
2009-04-05 formatting
137
138 // Extract text from the named attachment and index it (but not store it).
a40523d3 »
2009-04-21 documentation of future features.
139 doc.attachment("attachment name", {"field":"attachments"});
9a715570 »
2009-04-05 formatting
140
141 // Interpret "value" as a date using the default date formats.
a40523d3 »
2009-04-21 documentation of future features.
142 doc.add("2009-01-01T00:00:00Z", {"type":"date"});
9a715570 »
2009-04-05 formatting
143
144 // intrepret "value" as a date using the supplied format string
145 // (see Java's SimpleDateFormat class for the syntax).
8ff99e14 »
2009-04-21 tidy docs
146 doc.add("2009-01-01", {"type":"date", "format":"YYYY-MM-dd"});
147
148 // intrepret "value" as a number.
149 doc.add("100", {"type":"number"});
087dcec0 »
2009-04-04 update documentation.
150 </pre>
151
ccb81a8a »
2009-03-20 add example transforms section.
152 <h3>Example Transforms</h3>
153
390858a2 »
2009-04-05 re-add Index Everything example.
154 <h4>Index Everything</h4>
155
156 <pre>
157 function(doc) {
158 var ret = new Document();
159
160 function idx(obj) {
161 for (var key in obj) {
162 switch (typeof obj[key]) {
163 case 'object':
164 idx(obj[key]);
165 break;
166 case 'function':
167 break;
168 default:
169 ret.field(key, obj[key]);
0b6780f9 »
2009-04-05 expand index-everything example
170 /* Uncomment next line to include
171 * all attributes into a single field.
172 */
173 // ret.field("all", obj[key]);
390858a2 »
2009-04-05 re-add Index Everything example.
174 break;
175 }
176 }
177 }
178
0b6780f9 »
2009-04-05 expand index-everything example
179 // Index all attributes
390858a2 »
2009-04-05 re-add Index Everything example.
180 idx(doc);
0b6780f9 »
2009-04-05 expand index-everything example
181
182 // Index all attachments
183 for(var a in doc._attachments) {
184 ret.attachment("attachment", a);
185 }
186
390858a2 »
2009-04-05 re-add Index Everything example.
187 return ret;
188 }
189 </pre>
190
ccb81a8a »
2009-03-20 add example transforms section.
191 <h4>Index Nothing</h4>
192
193 <pre>
194 function(doc) {
195 return null;
196 }
197 </pre>
198
c207a604 »
2009-04-04 update README
199 <h4>Index Select Fields</h4>
ccb81a8a »
2009-03-20 add example transforms section.
200
201 <pre>
202 function(doc) {
c207a604 »
2009-04-04 update README
203 var result = new Document();
f59999b3 »
2009-04-04 improve examples
204 result.field("subject", doc.subject, "yes");
205 result.field("content", doc.content);
5ff4cda4 »
2009-04-04 add date example.
206 result.date("indexed_at", new Date());
c207a604 »
2009-04-04 update README
207 return result;
ccb81a8a »
2009-03-20 add example transforms section.
208 }
209 </pre>
210
c207a604 »
2009-04-04 update README
211 <h4>Index Attachments</h4>
ccb81a8a »
2009-03-20 add example transforms section.
212
213 <pre>
214 function(doc) {
c207a604 »
2009-04-04 update README
215 var result = new Document();
216 for(var a in doc._attachments) {
217 result.attachment("attachment", a);
ccb81a8a »
2009-03-20 add example transforms section.
218 }
c207a604 »
2009-04-04 update README
219 return result;
220 }
221 </pre>
222
223 <h4>A More Complex Example</h4>
224
225 <pre>
226 function(doc) {
227 var mk = function(name, value, group) {
228 var ret = new Document(name, value, "yes");
229 ret.field("group", group, "yes");
230 return ret;
231 };
232 var ret = [];
233 if(doc.type != "reference") return null;
234 for(var g in doc.groups) {
235 ret.push(mk("library", doc.groups[g].library, g));
236 ret.push(mk("method", doc.groups[g].method, g));
237 ret.push(mk("target", doc.groups[g].target, g));
238 }
239 return ret;
240 }
241 </pre>
b2079657 »
2009-02-14 improve README readability.
242
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
243 <h2>Attachment Indexing</h2>
244
8059ce07 »
2009-03-08 s/couchdb/couchdb-lucene
245 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
246
ec94e218 »
2009-02-18 updated README.md
247 <h3>Supported Formats</h3>
248
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
249 <ul>
250 <li>Excel spreadsheets (application/vnd.ms-excel)
251 <li>Word documents (application/msword)
252 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
253 <li>Visio (application/vnd.visio)
254 <li>Outlook (application/vnd.ms-outlook)
255 <li>XML (application/xml)
256 <li>HTML (text/html)
257 <li>Images (image/*)
258 <li>Java class files
259 <li>Java jar archives
260 <li>MP3 (audio/mp3)
261 <li>OpenDocument (application/vnd.oasis.opendocument.*)
262 <li>Plain text (text/plain)
263 <li>PDF (application/pdf)
264 <li>RTF (application/rtf)
265 </ul>
266
b2079657 »
2009-02-14 improve README readability.
267 <h1>Searching with couchdb-lucene</h1>
268
39b22c82 »
2009-04-01 document that default search field is the _body field that attachment…
269 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b2079657 »
2009-02-14 improve README readability.
270
271 <dl>
f9c61e32 »
2009-03-22 format README
272 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
273 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
274 <dt>limit</dt><dd>the maximum number of results to return</dd>
275 <dt>skip</dt><dd>the number of results to skip</dd>
276 <dt>include_docs</dt><dd>whether to include the source docs</dd>
277 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
278 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
279 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f2 »
2009-02-14 tweak README.md
280 </dl>
b2079657 »
2009-02-14 improve README readability.
281
282 <i>All parameters except 'q' are optional.</i>
283
ec94e218 »
2009-02-18 updated README.md
284 <h2>Special Fields</h2>
285
286 <dl>
f9c61e32 »
2009-03-22 format README
287 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec0 »
2009-04-04 update documentation.
288 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a371 »
2009-03-08 include all DC attributes, if present.
289 </dl>
290
291 <h2>Dublin Core</h2>
292
293 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
294
295 <dl>
f9c61e32 »
2009-03-22 format README
296 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
297 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
298 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
299 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
300 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
301 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
302 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
303 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
304 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
305 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
306 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
307 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
308 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
309 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
310 <dt>dc.title</dt><dd>A name given to the resource.</dd>
311 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e218 »
2009-02-18 updated README.md
312 </dl>
313
b2079657 »
2009-02-14 improve README readability.
314 <h2>Examples</h2>
315
316 <pre>
317 http://localhost:5984/dbname/_fti?q=field_name:value
318 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
319 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
320 </pre>
321
322 <h2>Search Results Format</h2>
323
fd163159 »
2009-03-07 update README.md
324 Here's an example of a JSON response without sorting;
b2079657 »
2009-02-14 improve README readability.
325
118d28eb »
2009-02-17 JSON example output.
326 <pre>
327 {
fd163159 »
2009-03-07 update README.md
328 "q": "+_db:enron +content:enron",
329 "skip": 0,
330 "limit": 2,
331 "total_rows": 176852,
332 "search_duration": 518,
333 "fetch_duration": 4,
334 "rows": [
335 {
336 "_id": "hain-m-all_documents-257.",
337 "score": 1.601625680923462
338 },
339 {
340 "_id": "hain-m-notes_inbox-257.",
341 "score": 1.601625680923462
342 }
118d28eb »
2009-02-17 JSON example output.
343 ]
344 }
345 </pre>
346
fd163159 »
2009-03-07 update README.md
347 And the same with sorting;
348
118d28eb »
2009-02-17 JSON example output.
349 <pre>
350 {
fd163159 »
2009-03-07 update README.md
351 "q": "+_db:enron +content:enron",
352 "skip": 0,
353 "limit": 3,
354 "total_rows": 176852,
355 "search_duration": 660,
356 "fetch_duration": 4,
357 "sort_order": [
358 {
359 "field": "source",
360 "reverse": false,
361 "type": "string"
362 },
363 {
364 "reverse": false,
365 "type": "doc"
366 }
118d28eb »
2009-02-17 JSON example output.
367 ],
fd163159 »
2009-03-07 update README.md
368 "rows": [
369 {
370 "_id": "shankman-j-inbox-105.",
371 "score": 0.6131107211112976,
372 "sort_order": [
373 "enron",
374 6
375 ]
376 },
377 {
378 "_id": "shankman-j-inbox-8.",
379 "score": 0.7492915391921997,
380 "sort_order": [
381 "enron",
382 7
383 ]
384 },
385 {
386 "_id": "shankman-j-inbox-30.",
387 "score": 0.507369875907898,
388 "sort_order": [
389 "enron",
390 8
391 ]
392 }
118d28eb »
2009-02-17 JSON example output.
393 ]
394 }
395 </pre>
396
139a78cc »
2009-03-09 add info retrieval.
397 <h1>Fetching information about the index</h1>
398
399 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
400
401 <pre>
402 http://127.0.0.1:5984/enron/_fti
403 </pre>
404
405 returns;
406
407 <pre>
408 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
409 </pre>
410
b2079657 »
2009-02-14 improve README readability.
411 <h1>Working With The Source</h1>
412
413 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
414
415 <pre>
490ae390 »
2009-02-14 break long lines in README.md
416 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29a »
2009-03-18 move to com.github.rnewson package.
417 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b2079657 »
2009-02-14 improve README readability.
418 </pre>
419
420 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
421
422 <h1>Configuration</h1>
423
424 couchdb-lucene respects several system properties;
425
426 <dl>
f9c61e32 »
2009-03-22 format README
427 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
428 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4c »
2009-04-17 enhanced logging.
429 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b2079657 »
2009-02-14 improve README readability.
430 </dl>
431
432 You can override these properties like this;
433
434 <pre>
fe204556 »
2009-04-01 fix typo in documentation [#7 state:resolved]
435 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae390 »
2009-02-14 break long lines in README.md
436 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
437 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29a »
2009-03-18 move to com.github.rnewson package.
438 com.github.rnewson.couchdb.lucene.Main
b2079657 »
2009-02-14 improve README readability.
439 </pre>
b2d01ccc »
2009-03-16 update README for basic auth.
440
441 <h2>Basic Authentication</h2>
442
443 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
444
445 <dl>
f9c61e32 »
2009-03-22 format README
446 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
447 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01ccc »
2009-03-16 update README for basic auth.
448 </dl>
ccb3c813 »
2009-04-13 add note about ipv6 localhost workaround. [#12 state:resolved]
449
450 <h2>IPv6</h2>
451
452 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.