Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 453 lines (363 sloc) 13.419 kb
5d4e56a Robert Newson update readme.
authored
1 <h1>News</h1>
2
5e4e181 Robert Newson Add documentation on proposed enhancements to the indexing API for 0.3.
authored
3 The indexing API in 0.3 will change once again to allow multiple design documents and "views" into Lucene. It will also move much of the Lucene-specific stuff into an options object. Please read the TODO for details.
4
5 The indexing API in 0.2 has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
764563b Robert Newson update news in README.
authored
6
6b2b22c Robert Newson add lighthouseapp link.
authored
7 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a Robert Newson update readme.
authored
8
ef3f787 Robert Newson add sysreq for Sun JDK.
authored
9 <h1>System Requirements</h1>
10
11 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
12
5220b65 Robert Newson tweak README.md
authored
13 <h1>Build couchdb-lucene</h1>
b207965 Robert Newson improve README readability.
authored
14
15 <ol>
16 <li>Install Maven 2.
17 <li>checkout repository
18 <li>type 'mvn'
19 <li>configure couchdb (see below)
20 </ol>
21
22 <h1>Configure CouchDB</h1>
23
24 <pre>
0563120 Robert Newson fixes.
authored
25 [couchdb]
26 os_process_timeout=60000 ; increase the timeout from 5 seconds.
27
b207965 Robert Newson improve README readability.
authored
28 [external]
77d4f67 Robert Newson fix readme.
authored
29 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 Robert Newson wip
authored
30
31 [update_notification]
32 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 Robert Newson improve README readability.
authored
33
34 [httpd_db_handlers]
35 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
36 </pre>
37
38 <h1>Indexing Strategy</h1>
39
4a60080 Robert Newson use couchdb's content_type rather than auto-detect.
authored
40 <h2>Document Indexing</h2>
41
697884b Robert Newson documentation of future features.
authored
42 You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.
a2e9024 Robert Newson wip
authored
43
697884b Robert Newson documentation of future features.
authored
44 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by those view functions.
c207a60 Robert Newson update README
authored
45
697884b Robert Newson documentation of future features.
authored
46 Declare your functions as follows;
a2e9024 Robert Newson wip
authored
47
697884b Robert Newson documentation of future features.
authored
48 <pre>
49 {
8ff99e1 Robert Newson tidy docs
authored
50 "views": {
51 <i>conventional view code goes here</i>
52 },
697884b Robert Newson documentation of future features.
authored
53 "fulltext": {
54 "by_subject": {
55 "defaults": { "store":"yes" },
56 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
57 },
58 "french_documents": {
59 "defaults": { "language":"fr" },
60 "index":"function(doc) { if (doc.language != "fr") { return null;} var ret=new Document(); <i>etc</i> return ret; }"
61 }
62 }
63 }
64 </pre>
65
66 A fulltext object contains multiple index view declarations. An index view consists of;
67
68 <dl>
69 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
70 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
71
72 <h3>The Defaults Object</h3>
73
74 The following indexing options can be defaulted;
75
76 <table>
77 <tr>
78 <th>name</th>
79 <th>description</th>
80 <th>available options</th>
81 <th>default</th>
82 </tr>
83 <tr>
a40523d Robert Newson documentation of future features.
authored
84 <th>field</th>
85 <td>the field name to index under</td>
86 <td>user-defined</td>
87 <td>default</td>
88 </tr>
89 <tr>
697884b Robert Newson documentation of future features.
authored
90 <th>store</th>
91 <td>whether the data is stored</td>
92 <td>yes, no</td>
93 <td>no</td>
94 </tr>
95 <tr>
96 <th>index</th>
97 <td>whether (and how) the data is indexed</td>
8328332 Robert Newson typo
authored
98 <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
697884b Robert Newson documentation of future features.
authored
99 <td>analyzed</td>
100 </tr>
101 <tr>
102 <th>analyzer</th>
103 <td>how the data is analyzed</td>
104 <td>simple, standard</td>
105 <td>standard</td>
106 </tr>
107 <tr>
108 <th>language</th>
109 <td>which language the data is in</td>
110 <td>br, cjk, cn, cz, de, el, en, fr, nl, ru, th</td>
111 <td>en</td>
112 </tr>
113 </table>
087dcec Robert Newson update documentation.
authored
114
115 <h3>The Document class</h3>
116
117 You may construct a new Document instance with;
118
119 <pre>
120 var doc = new Document();
121 </pre>
122
a40523d Robert Newson documentation of future features.
authored
123 Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
087dcec Robert Newson update documentation.
authored
124
125 <pre>
a40523d Robert Newson documentation of future features.
authored
126 // Add with all the defaults.
127 doc.add("value");
128
129 // Add a subject field.
130 doc.add("this is the subject line.", {"field":"subject"});
9a71557 Robert Newson formatting
authored
131
a40523d Robert Newson documentation of future features.
authored
132 // Add but ensure it's stored.
133 doc.add("value", {"store":"yes"});
9a71557 Robert Newson formatting
authored
134
a40523d Robert Newson documentation of future features.
authored
135 // Add but don't analyze.
136 doc.add("don't analyze me", {"index":"not_analyzed"});
9a71557 Robert Newson formatting
authored
137
138 // Extract text from the named attachment and index it (but not store it).
a40523d Robert Newson documentation of future features.
authored
139 doc.attachment("attachment name", {"field":"attachments"});
9a71557 Robert Newson formatting
authored
140
141 // Interpret "value" as a date using the default date formats.
a40523d Robert Newson documentation of future features.
authored
142 doc.add("2009-01-01T00:00:00Z", {"type":"date"});
9a71557 Robert Newson formatting
authored
143
144 // intrepret "value" as a date using the supplied format string
145 // (see Java's SimpleDateFormat class for the syntax).
8ff99e1 Robert Newson tidy docs
authored
146 doc.add("2009-01-01", {"type":"date", "format":"YYYY-MM-dd"});
147
148 // intrepret "value" as a number.
149 doc.add("100", {"type":"number"});
087dcec Robert Newson update documentation.
authored
150 </pre>
151
ccb81a8 Robert Newson add example transforms section.
authored
152 <h3>Example Transforms</h3>
153
390858a Robert Newson re-add Index Everything example.
authored
154 <h4>Index Everything</h4>
155
156 <pre>
157 function(doc) {
158 var ret = new Document();
159
160 function idx(obj) {
161 for (var key in obj) {
162 switch (typeof obj[key]) {
163 case 'object':
164 idx(obj[key]);
165 break;
166 case 'function':
167 break;
168 default:
169 ret.field(key, obj[key]);
0b6780f Robert Newson expand index-everything example
authored
170 /* Uncomment next line to include
171 * all attributes into a single field.
172 */
173 // ret.field("all", obj[key]);
390858a Robert Newson re-add Index Everything example.
authored
174 break;
175 }
176 }
177 }
178
0b6780f Robert Newson expand index-everything example
authored
179 // Index all attributes
390858a Robert Newson re-add Index Everything example.
authored
180 idx(doc);
0b6780f Robert Newson expand index-everything example
authored
181
182 // Index all attachments
183 for(var a in doc._attachments) {
184 ret.attachment("attachment", a);
185 }
186
390858a Robert Newson re-add Index Everything example.
authored
187 return ret;
188 }
189 </pre>
190
ccb81a8 Robert Newson add example transforms section.
authored
191 <h4>Index Nothing</h4>
192
193 <pre>
194 function(doc) {
195 return null;
196 }
197 </pre>
198
c207a60 Robert Newson update README
authored
199 <h4>Index Select Fields</h4>
ccb81a8 Robert Newson add example transforms section.
authored
200
201 <pre>
202 function(doc) {
c207a60 Robert Newson update README
authored
203 var result = new Document();
f59999b Robert Newson improve examples
authored
204 result.field("subject", doc.subject, "yes");
205 result.field("content", doc.content);
5ff4cda Robert Newson add date example.
authored
206 result.date("indexed_at", new Date());
c207a60 Robert Newson update README
authored
207 return result;
ccb81a8 Robert Newson add example transforms section.
authored
208 }
209 </pre>
210
c207a60 Robert Newson update README
authored
211 <h4>Index Attachments</h4>
ccb81a8 Robert Newson add example transforms section.
authored
212
213 <pre>
214 function(doc) {
c207a60 Robert Newson update README
authored
215 var result = new Document();
216 for(var a in doc._attachments) {
217 result.attachment("attachment", a);
ccb81a8 Robert Newson add example transforms section.
authored
218 }
c207a60 Robert Newson update README
authored
219 return result;
220 }
221 </pre>
222
223 <h4>A More Complex Example</h4>
224
225 <pre>
226 function(doc) {
227 var mk = function(name, value, group) {
228 var ret = new Document(name, value, "yes");
229 ret.field("group", group, "yes");
230 return ret;
231 };
232 var ret = [];
233 if(doc.type != "reference") return null;
234 for(var g in doc.groups) {
235 ret.push(mk("library", doc.groups[g].library, g));
236 ret.push(mk("method", doc.groups[g].method, g));
237 ret.push(mk("target", doc.groups[g].target, g));
238 }
239 return ret;
240 }
241 </pre>
b207965 Robert Newson improve README readability.
authored
242
4a60080 Robert Newson use couchdb's content_type rather than auto-detect.
authored
243 <h2>Attachment Indexing</h2>
244
8059ce0 Robert Newson s/couchdb/couchdb-lucene
authored
245 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 Robert Newson use couchdb's content_type rather than auto-detect.
authored
246
ec94e21 Robert Newson updated README.md
authored
247 <h3>Supported Formats</h3>
248
4a60080 Robert Newson use couchdb's content_type rather than auto-detect.
authored
249 <ul>
250 <li>Excel spreadsheets (application/vnd.ms-excel)
251 <li>Word documents (application/msword)
252 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
253 <li>Visio (application/vnd.visio)
254 <li>Outlook (application/vnd.ms-outlook)
255 <li>XML (application/xml)
256 <li>HTML (text/html)
257 <li>Images (image/*)
258 <li>Java class files
259 <li>Java jar archives
260 <li>MP3 (audio/mp3)
261 <li>OpenDocument (application/vnd.oasis.opendocument.*)
262 <li>Plain text (text/plain)
263 <li>PDF (application/pdf)
264 <li>RTF (application/rtf)
265 </ul>
266
b207965 Robert Newson improve README readability.
authored
267 <h1>Searching with couchdb-lucene</h1>
268
39b22c8 Robert Newson document that default search field is the _body field that attachment te...
authored
269 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 Robert Newson improve README readability.
authored
270
271 <dl>
f9c61e3 Robert Newson format README
authored
272 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
273 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
274 <dt>limit</dt><dd>the maximum number of results to return</dd>
275 <dt>skip</dt><dd>the number of results to skip</dd>
276 <dt>include_docs</dt><dd>whether to include the source docs</dd>
277 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
278 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
279 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f Robert Newson tweak README.md
authored
280 </dl>
b207965 Robert Newson improve README readability.
authored
281
282 <i>All parameters except 'q' are optional.</i>
283
ec94e21 Robert Newson updated README.md
authored
284 <h2>Special Fields</h2>
285
286 <dl>
f9c61e3 Robert Newson format README
authored
287 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec Robert Newson update documentation.
authored
288 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 Robert Newson include all DC attributes, if present.
authored
289 </dl>
290
291 <h2>Dublin Core</h2>
292
293 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
294
295 <dl>
f9c61e3 Robert Newson format README
authored
296 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
297 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
298 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
299 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
300 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
301 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
302 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
303 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
304 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
305 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
306 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
307 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
308 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
309 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
310 <dt>dc.title</dt><dd>A name given to the resource.</dd>
311 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 Robert Newson updated README.md
authored
312 </dl>
313
b207965 Robert Newson improve README readability.
authored
314 <h2>Examples</h2>
315
316 <pre>
317 http://localhost:5984/dbname/_fti?q=field_name:value
318 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
319 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
320 </pre>
321
322 <h2>Search Results Format</h2>
323
fd16315 Robert Newson update README.md
authored
324 Here's an example of a JSON response without sorting;
b207965 Robert Newson improve README readability.
authored
325
118d28e Robert Newson JSON example output.
authored
326 <pre>
327 {
fd16315 Robert Newson update README.md
authored
328 "q": "+_db:enron +content:enron",
329 "skip": 0,
330 "limit": 2,
331 "total_rows": 176852,
332 "search_duration": 518,
333 "fetch_duration": 4,
334 "rows": [
335 {
336 "_id": "hain-m-all_documents-257.",
337 "score": 1.601625680923462
338 },
339 {
340 "_id": "hain-m-notes_inbox-257.",
341 "score": 1.601625680923462
342 }
118d28e Robert Newson JSON example output.
authored
343 ]
344 }
345 </pre>
346
fd16315 Robert Newson update README.md
authored
347 And the same with sorting;
348
118d28e Robert Newson JSON example output.
authored
349 <pre>
350 {
fd16315 Robert Newson update README.md
authored
351 "q": "+_db:enron +content:enron",
352 "skip": 0,
353 "limit": 3,
354 "total_rows": 176852,
355 "search_duration": 660,
356 "fetch_duration": 4,
357 "sort_order": [
358 {
359 "field": "source",
360 "reverse": false,
361 "type": "string"
362 },
363 {
364 "reverse": false,
365 "type": "doc"
366 }
118d28e Robert Newson JSON example output.
authored
367 ],
fd16315 Robert Newson update README.md
authored
368 "rows": [
369 {
370 "_id": "shankman-j-inbox-105.",
371 "score": 0.6131107211112976,
372 "sort_order": [
373 "enron",
374 6
375 ]
376 },
377 {
378 "_id": "shankman-j-inbox-8.",
379 "score": 0.7492915391921997,
380 "sort_order": [
381 "enron",
382 7
383 ]
384 },
385 {
386 "_id": "shankman-j-inbox-30.",
387 "score": 0.507369875907898,
388 "sort_order": [
389 "enron",
390 8
391 ]
392 }
118d28e Robert Newson JSON example output.
authored
393 ]
394 }
395 </pre>
396
139a78c Robert Newson add info retrieval.
authored
397 <h1>Fetching information about the index</h1>
398
399 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
400
401 <pre>
402 http://127.0.0.1:5984/enron/_fti
403 </pre>
404
405 returns;
406
407 <pre>
408 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
409 </pre>
410
b207965 Robert Newson improve README readability.
authored
411 <h1>Working With The Source</h1>
412
413 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
414
415 <pre>
490ae39 Robert Newson break long lines in README.md
authored
416 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 Robert Newson move to com.github.rnewson package.
authored
417 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 Robert Newson improve README readability.
authored
418 </pre>
419
420 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
421
422 <h1>Configuration</h1>
423
424 couchdb-lucene respects several system properties;
425
426 <dl>
f9c61e3 Robert Newson format README
authored
427 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
428 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 Robert Newson enhanced logging.
authored
429 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b207965 Robert Newson improve README readability.
authored
430 </dl>
431
432 You can override these properties like this;
433
434 <pre>
fe20455 Robert Newson fix typo in documentation [#7 state:resolved]
authored
435 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 Robert Newson break long lines in README.md
authored
436 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
437 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 Robert Newson move to com.github.rnewson package.
authored
438 com.github.rnewson.couchdb.lucene.Main
b207965 Robert Newson improve README readability.
authored
439 </pre>
b2d01cc Robert Newson update README for basic auth.
authored
440
441 <h2>Basic Authentication</h2>
442
443 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
444
445 <dl>
f9c61e3 Robert Newson format README
authored
446 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
447 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc Robert Newson update README for basic auth.
authored
448 </dl>
ccb3c81 Robert Newson add note about ipv6 localhost workaround. [#12 state:resolved]
authored
449
450 <h2>IPv6</h2>
451
452 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.