Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 449 lines (359 sloc) 13.344 kb
5d4e56a Robert Newson update readme.
authored
1 <h1>News</h1>
2
5e4e181 Robert Newson Add documentation on proposed enhancements to the indexing API for 0.3.
authored
3 The indexing API in 0.3 will change once again to allow multiple design documents and "views" into Lucene. It will also move much of the Lucene-specific stuff into an options object. Please read the TODO for details.
4
5 The indexing API in 0.2 has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
764563b Robert Newson update news in README.
authored
6
6b2b22c Robert Newson add lighthouseapp link.
authored
7 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a Robert Newson update readme.
authored
8
ef3f787 Robert Newson add sysreq for Sun JDK.
authored
9 <h1>System Requirements</h1>
10
11 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
12
5220b65 Robert Newson tweak README.md
authored
13 <h1>Build couchdb-lucene</h1>
b207965 Robert Newson improve README readability.
authored
14
15 <ol>
16 <li>Install Maven 2.
17 <li>checkout repository
18 <li>type 'mvn'
19 <li>configure couchdb (see below)
20 </ol>
21
22 <h1>Configure CouchDB</h1>
23
24 <pre>
0563120 Robert Newson fixes.
authored
25 [couchdb]
26 os_process_timeout=60000 ; increase the timeout from 5 seconds.
27
b207965 Robert Newson improve README readability.
authored
28 [external]
77d4f67 Robert Newson fix readme.
authored
29 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 Robert Newson wip
authored
30
31 [update_notification]
32 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 Robert Newson improve README readability.
authored
33
34 [httpd_db_handlers]
35 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
36 </pre>
37
38 <h1>Indexing Strategy</h1>
39
4a60080 Robert Newson use couchdb's content_type rather than auto-detect.
authored
40 <h2>Document Indexing</h2>
41
697884b Robert Newson documentation of future features.
authored
42 You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.
a2e9024 Robert Newson wip
authored
43
697884b Robert Newson documentation of future features.
authored
44 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by those view functions.
c207a60 Robert Newson update README
authored
45
697884b Robert Newson documentation of future features.
authored
46 Declare your functions as follows;
a2e9024 Robert Newson wip
authored
47
697884b Robert Newson documentation of future features.
authored
48 <pre>
49 {
50 "map": <i>conventional view code goes here</i>",
51
52 "fulltext": {
53 "by_subject": {
54 "defaults": { "store":"yes" },
55 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
56 },
57 "french_documents": {
58 "defaults": { "language":"fr" },
59 "index":"function(doc) { if (doc.language != "fr") { return null;} var ret=new Document(); <i>etc</i> return ret; }"
60 }
61 }
62 }
63 </pre>
64
65 A fulltext object contains multiple index view declarations. An index view consists of;
66
67 <dl>
68 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
69 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
70
71 <h3>The Defaults Object</h3>
72
73 The following indexing options can be defaulted;
74
75 <table>
76 <tr>
77 <th>name</th>
78 <th>description</th>
79 <th>available options</th>
80 <th>default</th>
81 </tr>
82 <tr>
a40523d Robert Newson documentation of future features.
authored
83 <th>field</th>
84 <td>the field name to index under</td>
85 <td>user-defined</td>
86 <td>default</td>
87 </tr>
88 <tr>
697884b Robert Newson documentation of future features.
authored
89 <th>store</th>
90 <td>whether the data is stored</td>
91 <td>yes, no</td>
92 <td>no</td>
93 </tr>
94 <tr>
95 <th>index</th>
96 <td>whether (and how) the data is indexed</td>
8328332 Robert Newson typo
authored
97 <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
697884b Robert Newson documentation of future features.
authored
98 <td>analyzed</td>
99 </tr>
100 <tr>
101 <th>analyzer</th>
102 <td>how the data is analyzed</td>
103 <td>simple, standard</td>
104 <td>standard</td>
105 </tr>
106 <tr>
107 <th>language</th>
108 <td>which language the data is in</td>
109 <td>br, cjk, cn, cz, de, el, en, fr, nl, ru, th</td>
110 <td>en</td>
111 </tr>
112 </table>
087dcec Robert Newson update documentation.
authored
113
114 <h3>The Document class</h3>
115
116 You may construct a new Document instance with;
117
118 <pre>
119 var doc = new Document();
120 </pre>
121
a40523d Robert Newson documentation of future features.
authored
122 Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
087dcec Robert Newson update documentation.
authored
123
124 <pre>
a40523d Robert Newson documentation of future features.
authored
125 // Add with all the defaults.
126 doc.add("value");
127
128 // Add a subject field.
129 doc.add("this is the subject line.", {"field":"subject"});
9a71557 Robert Newson formatting
authored
130
a40523d Robert Newson documentation of future features.
authored
131 // Add but ensure it's stored.
132 doc.add("value", {"store":"yes"});
9a71557 Robert Newson formatting
authored
133
a40523d Robert Newson documentation of future features.
authored
134 // Add but don't analyze.
135 doc.add("don't analyze me", {"index":"not_analyzed"});
9a71557 Robert Newson formatting
authored
136
137 // Extract text from the named attachment and index it (but not store it).
a40523d Robert Newson documentation of future features.
authored
138 doc.attachment("attachment name", {"field":"attachments"});
9a71557 Robert Newson formatting
authored
139
140 // Interpret "value" as a date using the default date formats.
a40523d Robert Newson documentation of future features.
authored
141 doc.add("2009-01-01T00:00:00Z", {"type":"date"});
9a71557 Robert Newson formatting
authored
142
143 // intrepret "value" as a date using the supplied format string
144 // (see Java's SimpleDateFormat class for the syntax).
a40523d Robert Newson documentation of future features.
authored
145 doc.add("2009-01-01", {"type":"date", "date_format":"YYYY-MM-dd"});
087dcec Robert Newson update documentation.
authored
146 </pre>
147
ccb81a8 Robert Newson add example transforms section.
authored
148 <h3>Example Transforms</h3>
149
390858a Robert Newson re-add Index Everything example.
authored
150 <h4>Index Everything</h4>
151
152 <pre>
153 function(doc) {
154 var ret = new Document();
155
156 function idx(obj) {
157 for (var key in obj) {
158 switch (typeof obj[key]) {
159 case 'object':
160 idx(obj[key]);
161 break;
162 case 'function':
163 break;
164 default:
165 ret.field(key, obj[key]);
0b6780f Robert Newson expand index-everything example
authored
166 /* Uncomment next line to include
167 * all attributes into a single field.
168 */
169 // ret.field("all", obj[key]);
390858a Robert Newson re-add Index Everything example.
authored
170 break;
171 }
172 }
173 }
174
0b6780f Robert Newson expand index-everything example
authored
175 // Index all attributes
390858a Robert Newson re-add Index Everything example.
authored
176 idx(doc);
0b6780f Robert Newson expand index-everything example
authored
177
178 // Index all attachments
179 for(var a in doc._attachments) {
180 ret.attachment("attachment", a);
181 }
182
390858a Robert Newson re-add Index Everything example.
authored
183 return ret;
184 }
185 </pre>
186
ccb81a8 Robert Newson add example transforms section.
authored
187 <h4>Index Nothing</h4>
188
189 <pre>
190 function(doc) {
191 return null;
192 }
193 </pre>
194
c207a60 Robert Newson update README
authored
195 <h4>Index Select Fields</h4>
ccb81a8 Robert Newson add example transforms section.
authored
196
197 <pre>
198 function(doc) {
c207a60 Robert Newson update README
authored
199 var result = new Document();
f59999b Robert Newson improve examples
authored
200 result.field("subject", doc.subject, "yes");
201 result.field("content", doc.content);
5ff4cda Robert Newson add date example.
authored
202 result.date("indexed_at", new Date());
c207a60 Robert Newson update README
authored
203 return result;
ccb81a8 Robert Newson add example transforms section.
authored
204 }
205 </pre>
206
c207a60 Robert Newson update README
authored
207 <h4>Index Attachments</h4>
ccb81a8 Robert Newson add example transforms section.
authored
208
209 <pre>
210 function(doc) {
c207a60 Robert Newson update README
authored
211 var result = new Document();
212 for(var a in doc._attachments) {
213 result.attachment("attachment", a);
ccb81a8 Robert Newson add example transforms section.
authored
214 }
c207a60 Robert Newson update README
authored
215 return result;
216 }
217 </pre>
218
219 <h4>A More Complex Example</h4>
220
221 <pre>
222 function(doc) {
223 var mk = function(name, value, group) {
224 var ret = new Document(name, value, "yes");
225 ret.field("group", group, "yes");
226 return ret;
227 };
228 var ret = [];
229 if(doc.type != "reference") return null;
230 for(var g in doc.groups) {
231 ret.push(mk("library", doc.groups[g].library, g));
232 ret.push(mk("method", doc.groups[g].method, g));
233 ret.push(mk("target", doc.groups[g].target, g));
234 }
235 return ret;
236 }
237 </pre>
b207965 Robert Newson improve README readability.
authored
238
4a60080 Robert Newson use couchdb's content_type rather than auto-detect.
authored
239 <h2>Attachment Indexing</h2>
240
8059ce0 Robert Newson s/couchdb/couchdb-lucene
authored
241 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 Robert Newson use couchdb's content_type rather than auto-detect.
authored
242
ec94e21 Robert Newson updated README.md
authored
243 <h3>Supported Formats</h3>
244
4a60080 Robert Newson use couchdb's content_type rather than auto-detect.
authored
245 <ul>
246 <li>Excel spreadsheets (application/vnd.ms-excel)
247 <li>Word documents (application/msword)
248 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
249 <li>Visio (application/vnd.visio)
250 <li>Outlook (application/vnd.ms-outlook)
251 <li>XML (application/xml)
252 <li>HTML (text/html)
253 <li>Images (image/*)
254 <li>Java class files
255 <li>Java jar archives
256 <li>MP3 (audio/mp3)
257 <li>OpenDocument (application/vnd.oasis.opendocument.*)
258 <li>Plain text (text/plain)
259 <li>PDF (application/pdf)
260 <li>RTF (application/rtf)
261 </ul>
262
b207965 Robert Newson improve README readability.
authored
263 <h1>Searching with couchdb-lucene</h1>
264
39b22c8 Robert Newson document that default search field is the _body field that attachment te...
authored
265 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 Robert Newson improve README readability.
authored
266
267 <dl>
f9c61e3 Robert Newson format README
authored
268 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
269 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
270 <dt>limit</dt><dd>the maximum number of results to return</dd>
271 <dt>skip</dt><dd>the number of results to skip</dd>
272 <dt>include_docs</dt><dd>whether to include the source docs</dd>
273 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
274 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
275 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f Robert Newson tweak README.md
authored
276 </dl>
b207965 Robert Newson improve README readability.
authored
277
278 <i>All parameters except 'q' are optional.</i>
279
ec94e21 Robert Newson updated README.md
authored
280 <h2>Special Fields</h2>
281
282 <dl>
f9c61e3 Robert Newson format README
authored
283 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec Robert Newson update documentation.
authored
284 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 Robert Newson include all DC attributes, if present.
authored
285 </dl>
286
287 <h2>Dublin Core</h2>
288
289 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
290
291 <dl>
f9c61e3 Robert Newson format README
authored
292 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
293 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
294 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
295 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
296 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
297 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
298 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
299 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
300 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
301 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
302 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
303 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
304 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
305 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
306 <dt>dc.title</dt><dd>A name given to the resource.</dd>
307 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 Robert Newson updated README.md
authored
308 </dl>
309
b207965 Robert Newson improve README readability.
authored
310 <h2>Examples</h2>
311
312 <pre>
313 http://localhost:5984/dbname/_fti?q=field_name:value
314 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
315 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
316 </pre>
317
318 <h2>Search Results Format</h2>
319
fd16315 Robert Newson update README.md
authored
320 Here's an example of a JSON response without sorting;
b207965 Robert Newson improve README readability.
authored
321
118d28e Robert Newson JSON example output.
authored
322 <pre>
323 {
fd16315 Robert Newson update README.md
authored
324 "q": "+_db:enron +content:enron",
325 "skip": 0,
326 "limit": 2,
327 "total_rows": 176852,
328 "search_duration": 518,
329 "fetch_duration": 4,
330 "rows": [
331 {
332 "_id": "hain-m-all_documents-257.",
333 "score": 1.601625680923462
334 },
335 {
336 "_id": "hain-m-notes_inbox-257.",
337 "score": 1.601625680923462
338 }
118d28e Robert Newson JSON example output.
authored
339 ]
340 }
341 </pre>
342
fd16315 Robert Newson update README.md
authored
343 And the same with sorting;
344
118d28e Robert Newson JSON example output.
authored
345 <pre>
346 {
fd16315 Robert Newson update README.md
authored
347 "q": "+_db:enron +content:enron",
348 "skip": 0,
349 "limit": 3,
350 "total_rows": 176852,
351 "search_duration": 660,
352 "fetch_duration": 4,
353 "sort_order": [
354 {
355 "field": "source",
356 "reverse": false,
357 "type": "string"
358 },
359 {
360 "reverse": false,
361 "type": "doc"
362 }
118d28e Robert Newson JSON example output.
authored
363 ],
fd16315 Robert Newson update README.md
authored
364 "rows": [
365 {
366 "_id": "shankman-j-inbox-105.",
367 "score": 0.6131107211112976,
368 "sort_order": [
369 "enron",
370 6
371 ]
372 },
373 {
374 "_id": "shankman-j-inbox-8.",
375 "score": 0.7492915391921997,
376 "sort_order": [
377 "enron",
378 7
379 ]
380 },
381 {
382 "_id": "shankman-j-inbox-30.",
383 "score": 0.507369875907898,
384 "sort_order": [
385 "enron",
386 8
387 ]
388 }
118d28e Robert Newson JSON example output.
authored
389 ]
390 }
391 </pre>
392
139a78c Robert Newson add info retrieval.
authored
393 <h1>Fetching information about the index</h1>
394
395 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
396
397 <pre>
398 http://127.0.0.1:5984/enron/_fti
399 </pre>
400
401 returns;
402
403 <pre>
404 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
405 </pre>
406
b207965 Robert Newson improve README readability.
authored
407 <h1>Working With The Source</h1>
408
409 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
410
411 <pre>
490ae39 Robert Newson break long lines in README.md
authored
412 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 Robert Newson move to com.github.rnewson package.
authored
413 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 Robert Newson improve README readability.
authored
414 </pre>
415
416 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
417
418 <h1>Configuration</h1>
419
420 couchdb-lucene respects several system properties;
421
422 <dl>
f9c61e3 Robert Newson format README
authored
423 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
424 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 Robert Newson enhanced logging.
authored
425 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b207965 Robert Newson improve README readability.
authored
426 </dl>
427
428 You can override these properties like this;
429
430 <pre>
fe20455 Robert Newson fix typo in documentation [#7 state:resolved]
authored
431 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 Robert Newson break long lines in README.md
authored
432 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
433 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 Robert Newson move to com.github.rnewson package.
authored
434 com.github.rnewson.couchdb.lucene.Main
b207965 Robert Newson improve README readability.
authored
435 </pre>
b2d01cc Robert Newson update README for basic auth.
authored
436
437 <h2>Basic Authentication</h2>
438
439 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
440
441 <dl>
f9c61e3 Robert Newson format README
authored
442 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
443 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc Robert Newson update README for basic auth.
authored
444 </dl>
ccb3c81 Robert Newson add note about ipv6 localhost workaround. [#12 state:resolved]
authored
445
446 <h2>IPv6</h2>
447
448 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.