Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 440 lines (351 sloc) 13.0 kb
5d4e56a @rnewson update readme.
authored
1 <h1>News</h1>
2
5e4e181 @rnewson Add documentation on proposed enhancements to the indexing API for 0.3.
authored
3 The indexing API in 0.3 will change once again to allow multiple design documents and "views" into Lucene. It will also move much of the Lucene-specific stuff into an options object. Please read the TODO for details.
4
5 The indexing API in 0.2 has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
764563b @rnewson update news in README.
authored
6
6b2b22c @rnewson add lighthouseapp link.
authored
7 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a @rnewson update readme.
authored
8
ef3f787 @rnewson add sysreq for Sun JDK.
authored
9 <h1>System Requirements</h1>
10
11 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
12
5220b65 @rnewson tweak README.md
authored
13 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
14
15 <ol>
16 <li>Install Maven 2.
17 <li>checkout repository
18 <li>type 'mvn'
19 <li>configure couchdb (see below)
20 </ol>
21
22 <h1>Configure CouchDB</h1>
23
24 <pre>
0563120 @rnewson fixes.
authored
25 [couchdb]
26 os_process_timeout=60000 ; increase the timeout from 5 seconds.
27
b207965 @rnewson improve README readability.
authored
28 [external]
77d4f67 @rnewson fix readme.
authored
29 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
30
31 [update_notification]
32 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
33
34 [httpd_db_handlers]
35 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
36 </pre>
37
38 <h1>Indexing Strategy</h1>
39
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
40 <h2>Document Indexing</h2>
41
697884b @rnewson documentation of future features.
authored
42 You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.
a2e9024 @rnewson wip
authored
43
697884b @rnewson documentation of future features.
authored
44 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by those view functions.
c207a60 @rnewson update README
authored
45
697884b @rnewson documentation of future features.
authored
46 Declare your functions as follows;
a2e9024 @rnewson wip
authored
47
697884b @rnewson documentation of future features.
authored
48 <pre>
49 {
50 "map": <i>conventional view code goes here</i>",
51
52 "fulltext": {
53 "by_subject": {
54 "defaults": { "store":"yes" },
55 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
56 },
57 "french_documents": {
58 "defaults": { "language":"fr" },
59 "index":"function(doc) { if (doc.language != "fr") { return null;} var ret=new Document(); <i>etc</i> return ret; }"
60 }
61 }
62 }
63 </pre>
64
65 A fulltext object contains multiple index view declarations. An index view consists of;
66
67 <dl>
68 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
69 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
70
71 <h3>The Defaults Object</h3>
72
73 The following indexing options can be defaulted;
74
75 <table>
76 <tr>
77 <th>name</th>
78 <th>description</th>
79 <th>available options</th>
80 <th>default</th>
81 </tr>
82 <tr>
83 <th>store</th>
84 <td>whether the data is stored</td>
85 <td>yes, no</td>
86 <td>no</td>
87 </tr>
88 <tr>
89 <th>index</th>
90 <td>whether (and how) the data is indexed</td>
91 <td>analyzed, analyzed_no_norms, no, not_analyzer, not_analyzer_no_norms</td>
92 <td>analyzed</td>
93 </tr>
94 <tr>
95 <th>analyzer</th>
96 <td>how the data is analyzed</td>
97 <td>simple, standard</td>
98 <td>standard</td>
99 </tr>
100 <tr>
101 <th>language</th>
102 <td>which language the data is in</td>
103 <td>br, cjk, cn, cz, de, el, en, fr, nl, ru, th</td>
104 <td>en</td>
105 </tr>
106 </table>
087dcec @rnewson update documentation.
authored
107
108 <h3>The Document class</h3>
109
110 You may construct a new Document instance with;
111
112 <pre>
113 var doc = new Document();
114 </pre>
115
116 Several functions are available that populate a Document.
117
118 <pre>
9a71557 @rnewson formatting
authored
119 // Indexed, analyzed but not stored.
120 doc.field("name", "value");
121
122 // Indexed, analyzed and stored.
123 doc.field("name", "value", "yes");
124
125 // Indexed, stored but not analyzed.
126 doc.field("name", "value", "yes", "not_analyzed");
127
128 // Extract text from the named attachment and index it (but not store it).
129 doc.attachment("name", "attachment name");
130
131 // Interpret "value" as a date using the default date formats.
132 doc.date("name", "value");
133
134 // intrepret "value" as a date using the supplied format string
135 // (see Java's SimpleDateFormat class for the syntax).
136 doc.date("name", "value", "format");
087dcec @rnewson update documentation.
authored
137 </pre>
138
ccb81a8 @rnewson add example transforms section.
authored
139 <h3>Example Transforms</h3>
140
390858a @rnewson re-add Index Everything example.
authored
141 <h4>Index Everything</h4>
142
143 <pre>
144 function(doc) {
145 var ret = new Document();
146
147 function idx(obj) {
148 for (var key in obj) {
149 switch (typeof obj[key]) {
150 case 'object':
151 idx(obj[key]);
152 break;
153 case 'function':
154 break;
155 default:
156 ret.field(key, obj[key]);
0b6780f @rnewson expand index-everything example
authored
157 /* Uncomment next line to include
158 * all attributes into a single field.
159 */
160 // ret.field("all", obj[key]);
390858a @rnewson re-add Index Everything example.
authored
161 break;
162 }
163 }
164 }
165
0b6780f @rnewson expand index-everything example
authored
166 // Index all attributes
390858a @rnewson re-add Index Everything example.
authored
167 idx(doc);
0b6780f @rnewson expand index-everything example
authored
168
169 // Index all attachments
170 for(var a in doc._attachments) {
171 ret.attachment("attachment", a);
172 }
173
390858a @rnewson re-add Index Everything example.
authored
174 return ret;
175 }
176 </pre>
177
ccb81a8 @rnewson add example transforms section.
authored
178 <h4>Index Nothing</h4>
179
180 <pre>
181 function(doc) {
182 return null;
183 }
184 </pre>
185
c207a60 @rnewson update README
authored
186 <h4>Index Select Fields</h4>
ccb81a8 @rnewson add example transforms section.
authored
187
188 <pre>
189 function(doc) {
c207a60 @rnewson update README
authored
190 var result = new Document();
f59999b @rnewson improve examples
authored
191 result.field("subject", doc.subject, "yes");
192 result.field("content", doc.content);
5ff4cda @rnewson add date example.
authored
193 result.date("indexed_at", new Date());
c207a60 @rnewson update README
authored
194 return result;
ccb81a8 @rnewson add example transforms section.
authored
195 }
196 </pre>
197
c207a60 @rnewson update README
authored
198 <h4>Index Attachments</h4>
ccb81a8 @rnewson add example transforms section.
authored
199
200 <pre>
201 function(doc) {
c207a60 @rnewson update README
authored
202 var result = new Document();
203 for(var a in doc._attachments) {
204 result.attachment("attachment", a);
ccb81a8 @rnewson add example transforms section.
authored
205 }
c207a60 @rnewson update README
authored
206 return result;
207 }
208 </pre>
209
210 <h4>A More Complex Example</h4>
211
212 <pre>
213 function(doc) {
214 var mk = function(name, value, group) {
215 var ret = new Document(name, value, "yes");
216 ret.field("group", group, "yes");
217 return ret;
218 };
219 var ret = [];
220 if(doc.type != "reference") return null;
221 for(var g in doc.groups) {
222 ret.push(mk("library", doc.groups[g].library, g));
223 ret.push(mk("method", doc.groups[g].method, g));
224 ret.push(mk("target", doc.groups[g].target, g));
225 }
226 return ret;
227 }
228 </pre>
b207965 @rnewson improve README readability.
authored
229
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
230 <h2>Attachment Indexing</h2>
231
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
232 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
233
ec94e21 @rnewson updated README.md
authored
234 <h3>Supported Formats</h3>
235
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
236 <ul>
237 <li>Excel spreadsheets (application/vnd.ms-excel)
238 <li>Word documents (application/msword)
239 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
240 <li>Visio (application/vnd.visio)
241 <li>Outlook (application/vnd.ms-outlook)
242 <li>XML (application/xml)
243 <li>HTML (text/html)
244 <li>Images (image/*)
245 <li>Java class files
246 <li>Java jar archives
247 <li>MP3 (audio/mp3)
248 <li>OpenDocument (application/vnd.oasis.opendocument.*)
249 <li>Plain text (text/plain)
250 <li>PDF (application/pdf)
251 <li>RTF (application/rtf)
252 </ul>
253
b207965 @rnewson improve README readability.
authored
254 <h1>Searching with couchdb-lucene</h1>
255
39b22c8 @rnewson document that default search field is the _body field that attachment…
authored
256 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 @rnewson improve README readability.
authored
257
258 <dl>
f9c61e3 @rnewson format README
authored
259 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
260 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
261 <dt>limit</dt><dd>the maximum number of results to return</dd>
262 <dt>skip</dt><dd>the number of results to skip</dd>
263 <dt>include_docs</dt><dd>whether to include the source docs</dd>
264 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
265 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
266 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f @rnewson tweak README.md
authored
267 </dl>
b207965 @rnewson improve README readability.
authored
268
269 <i>All parameters except 'q' are optional.</i>
270
ec94e21 @rnewson updated README.md
authored
271 <h2>Special Fields</h2>
272
273 <dl>
f9c61e3 @rnewson format README
authored
274 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec @rnewson update documentation.
authored
275 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 @rnewson include all DC attributes, if present.
authored
276 </dl>
277
278 <h2>Dublin Core</h2>
279
280 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
281
282 <dl>
f9c61e3 @rnewson format README
authored
283 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
284 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
285 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
286 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
287 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
288 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
289 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
290 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
291 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
292 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
293 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
294 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
295 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
296 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
297 <dt>dc.title</dt><dd>A name given to the resource.</dd>
298 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 @rnewson updated README.md
authored
299 </dl>
300
b207965 @rnewson improve README readability.
authored
301 <h2>Examples</h2>
302
303 <pre>
304 http://localhost:5984/dbname/_fti?q=field_name:value
305 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
306 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
307 </pre>
308
309 <h2>Search Results Format</h2>
310
fd16315 @rnewson update README.md
authored
311 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
312
118d28e @rnewson JSON example output.
authored
313 <pre>
314 {
fd16315 @rnewson update README.md
authored
315 "q": "+_db:enron +content:enron",
316 "skip": 0,
317 "limit": 2,
318 "total_rows": 176852,
319 "search_duration": 518,
320 "fetch_duration": 4,
321 "rows": [
322 {
323 "_id": "hain-m-all_documents-257.",
324 "score": 1.601625680923462
325 },
326 {
327 "_id": "hain-m-notes_inbox-257.",
328 "score": 1.601625680923462
329 }
118d28e @rnewson JSON example output.
authored
330 ]
331 }
332 </pre>
333
fd16315 @rnewson update README.md
authored
334 And the same with sorting;
335
118d28e @rnewson JSON example output.
authored
336 <pre>
337 {
fd16315 @rnewson update README.md
authored
338 "q": "+_db:enron +content:enron",
339 "skip": 0,
340 "limit": 3,
341 "total_rows": 176852,
342 "search_duration": 660,
343 "fetch_duration": 4,
344 "sort_order": [
345 {
346 "field": "source",
347 "reverse": false,
348 "type": "string"
349 },
350 {
351 "reverse": false,
352 "type": "doc"
353 }
118d28e @rnewson JSON example output.
authored
354 ],
fd16315 @rnewson update README.md
authored
355 "rows": [
356 {
357 "_id": "shankman-j-inbox-105.",
358 "score": 0.6131107211112976,
359 "sort_order": [
360 "enron",
361 6
362 ]
363 },
364 {
365 "_id": "shankman-j-inbox-8.",
366 "score": 0.7492915391921997,
367 "sort_order": [
368 "enron",
369 7
370 ]
371 },
372 {
373 "_id": "shankman-j-inbox-30.",
374 "score": 0.507369875907898,
375 "sort_order": [
376 "enron",
377 8
378 ]
379 }
118d28e @rnewson JSON example output.
authored
380 ]
381 }
382 </pre>
383
139a78c @rnewson add info retrieval.
authored
384 <h1>Fetching information about the index</h1>
385
386 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
387
388 <pre>
389 http://127.0.0.1:5984/enron/_fti
390 </pre>
391
392 returns;
393
394 <pre>
395 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
396 </pre>
397
b207965 @rnewson improve README readability.
authored
398 <h1>Working With The Source</h1>
399
400 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
401
402 <pre>
490ae39 @rnewson break long lines in README.md
authored
403 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
404 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
405 </pre>
406
407 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
408
409 <h1>Configuration</h1>
410
411 couchdb-lucene respects several system properties;
412
413 <dl>
f9c61e3 @rnewson format README
authored
414 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
415 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 @rnewson enhanced logging.
authored
416 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b207965 @rnewson improve README readability.
authored
417 </dl>
418
419 You can override these properties like this;
420
421 <pre>
fe20455 @rnewson fix typo in documentation [#7 state:resolved]
authored
422 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 @rnewson break long lines in README.md
authored
423 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
424 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
425 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
426 </pre>
b2d01cc @rnewson update README for basic auth.
authored
427
428 <h2>Basic Authentication</h2>
429
430 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
431
432 <dl>
f9c61e3 @rnewson format README
authored
433 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
434 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc @rnewson update README for basic auth.
authored
435 </dl>
ccb3c81 @rnewson add note about ipv6 localhost workaround. [#12 state:resolved]
authored
436
437 <h2>IPv6</h2>
438
439 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.