Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 460 lines (371 sloc) 13.582 kb
5d4e56a @rnewson update readme.
authored
1 <h1>News</h1>
2
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
3 The indexing API in 0.3 has changed since 0.2 to allow multiple design documents and "views" into Lucene. It will moves the Lucene-specific stuff into an options object.
764563b @rnewson update news in README.
authored
4
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
5 <h1>Issue Tracking</h1>
6b2b22c @rnewson add lighthouseapp link.
authored
6 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a @rnewson update readme.
authored
7
ef3f787 @rnewson add sysreq for Sun JDK.
authored
8 <h1>System Requirements</h1>
9
10 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
11
5220b65 @rnewson tweak README.md
authored
12 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
13
14 <ol>
15 <li>Install Maven 2.
16 <li>checkout repository
17 <li>type 'mvn'
18 <li>configure couchdb (see below)
19 </ol>
20
21 <h1>Configure CouchDB</h1>
22
23 <pre>
0563120 @rnewson fixes.
authored
24 [couchdb]
25 os_process_timeout=60000 ; increase the timeout from 5 seconds.
26
b207965 @rnewson improve README readability.
authored
27 [external]
77d4f67 @rnewson fix readme.
authored
28 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
29
30 [update_notification]
31 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
32
33 [httpd_db_handlers]
34 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
35 </pre>
36
37 <h1>Indexing Strategy</h1>
38
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
39 <h2>Document Indexing</h2>
40
697884b @rnewson documentation of future features.
authored
41 You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.
a2e9024 @rnewson wip
authored
42
697884b @rnewson documentation of future features.
authored
43 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by those view functions.
c207a60 @rnewson update README
authored
44
697884b @rnewson documentation of future features.
authored
45 Declare your functions as follows;
a2e9024 @rnewson wip
authored
46
697884b @rnewson documentation of future features.
authored
47 <pre>
48 {
8ff99e1 @rnewson tidy docs
authored
49 "views": {
50 <i>conventional view code goes here</i>
51 },
697884b @rnewson documentation of future features.
authored
52 "fulltext": {
53 "by_subject": {
54 "defaults": { "store":"yes" },
55 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
56 },
57 "french_documents": {
58 "defaults": { "language":"fr" },
59 "index":"function(doc) { if (doc.language != "fr") { return null;} var ret=new Document(); <i>etc</i> return ret; }"
60 }
61 }
62 }
63 </pre>
64
65 A fulltext object contains multiple index view declarations. An index view consists of;
66
67 <dl>
68 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
69 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
70
71 <h3>The Defaults Object</h3>
72
73 The following indexing options can be defaulted;
74
75 <table>
76 <tr>
77 <th>name</th>
78 <th>description</th>
79 <th>available options</th>
80 <th>default</th>
81 </tr>
82 <tr>
a40523d @rnewson documentation of future features.
authored
83 <th>field</th>
84 <td>the field name to index under</td>
85 <td>user-defined</td>
86 <td>default</td>
87 </tr>
88 <tr>
6f9033e @rnewson document type option
authored
89 <th>type</th>
90 <td>the type of data, which may affect analysis</td>
91 <td>date, number, text</td>
92 <td>text</td>
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
93 </tr>
6f9033e @rnewson document type option
authored
94 <tr>
697884b @rnewson documentation of future features.
authored
95 <th>store</th>
96 <td>whether the data is stored</td>
97 <td>yes, no</td>
98 <td>no</td>
99 </tr>
100 <tr>
101 <th>index</th>
102 <td>whether (and how) the data is indexed</td>
8328332 @rnewson typo
authored
103 <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
697884b @rnewson documentation of future features.
authored
104 <td>analyzed</td>
105 </tr>
106 <tr>
107 <th>analyzer</th>
108 <td>how the data is analyzed</td>
109 <td>simple, standard</td>
110 <td>standard</td>
111 </tr>
112 <tr>
113 <th>language</th>
114 <td>which language the data is in</td>
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
115 <td>auto, br, cjk, cn, cz, de, el, en, fr, nl, ru, th</td>
697884b @rnewson documentation of future features.
authored
116 <td>en</td>
117 </tr>
118 </table>
087dcec @rnewson update documentation.
authored
119
120 <h3>The Document class</h3>
121
122 You may construct a new Document instance with;
123
124 <pre>
125 var doc = new Document();
126 </pre>
127
a40523d @rnewson documentation of future features.
authored
128 Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
087dcec @rnewson update documentation.
authored
129
130 <pre>
a40523d @rnewson documentation of future features.
authored
131 // Add with all the defaults.
132 doc.add("value");
133
134 // Add a subject field.
135 doc.add("this is the subject line.", {"field":"subject"});
9a71557 @rnewson formatting
authored
136
a40523d @rnewson documentation of future features.
authored
137 // Add but ensure it's stored.
138 doc.add("value", {"store":"yes"});
9a71557 @rnewson formatting
authored
139
a40523d @rnewson documentation of future features.
authored
140 // Add but don't analyze.
141 doc.add("don't analyze me", {"index":"not_analyzed"});
9a71557 @rnewson formatting
authored
142
143 // Extract text from the named attachment and index it (but not store it).
a40523d @rnewson documentation of future features.
authored
144 doc.attachment("attachment name", {"field":"attachments"});
9a71557 @rnewson formatting
authored
145
146 // Interpret "value" as a date using the default date formats.
a40523d @rnewson documentation of future features.
authored
147 doc.add("2009-01-01T00:00:00Z", {"type":"date"});
9a71557 @rnewson formatting
authored
148
149 // intrepret "value" as a date using the supplied format string
150 // (see Java's SimpleDateFormat class for the syntax).
8ff99e1 @rnewson tidy docs
authored
151 doc.add("2009-01-01", {"type":"date", "format":"YYYY-MM-dd"});
152
153 // intrepret "value" as a number.
154 doc.add("100", {"type":"number"});
087dcec @rnewson update documentation.
authored
155 </pre>
156
ccb81a8 @rnewson add example transforms section.
authored
157 <h3>Example Transforms</h3>
158
390858a @rnewson re-add Index Everything example.
authored
159 <h4>Index Everything</h4>
160
161 <pre>
162 function(doc) {
163 var ret = new Document();
164
165 function idx(obj) {
166 for (var key in obj) {
167 switch (typeof obj[key]) {
168 case 'object':
169 idx(obj[key]);
170 break;
171 case 'function':
172 break;
173 default:
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
174 ret.add(obj[key], {"field", key});
175 /*
176 * Uncomment next line to include
177 * all attributes into the default field.
0b6780f @rnewson expand index-everything example
authored
178 */
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
179 // ret.add(obj[key]);
390858a @rnewson re-add Index Everything example.
authored
180 break;
181 }
182 }
183 }
184
0b6780f @rnewson expand index-everything example
authored
185 // Index all attributes
390858a @rnewson re-add Index Everything example.
authored
186 idx(doc);
0b6780f @rnewson expand index-everything example
authored
187
188 // Index all attachments
189 for(var a in doc._attachments) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
190 ret.add_attachment(a, {"field", "attachments"});
0b6780f @rnewson expand index-everything example
authored
191 }
192
390858a @rnewson re-add Index Everything example.
authored
193 return ret;
194 }
195 </pre>
196
ccb81a8 @rnewson add example transforms section.
authored
197 <h4>Index Nothing</h4>
198
199 <pre>
200 function(doc) {
201 return null;
202 }
203 </pre>
204
c207a60 @rnewson update README
authored
205 <h4>Index Select Fields</h4>
ccb81a8 @rnewson add example transforms section.
authored
206
207 <pre>
208 function(doc) {
c207a60 @rnewson update README
authored
209 var result = new Document();
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
210 result.add(doc.subject, {"field":"subject", "store":"yes"});
211 result.add(doc.content, {"field":"subject"});
212 result.add({"field":"indexed_at"});
c207a60 @rnewson update README
authored
213 return result;
ccb81a8 @rnewson add example transforms section.
authored
214 }
215 </pre>
216
c207a60 @rnewson update README
authored
217 <h4>Index Attachments</h4>
ccb81a8 @rnewson add example transforms section.
authored
218
219 <pre>
220 function(doc) {
c207a60 @rnewson update README
authored
221 var result = new Document();
222 for(var a in doc._attachments) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
223 result.add_attachment(a, {"field":"attachment"});
ccb81a8 @rnewson add example transforms section.
authored
224 }
c207a60 @rnewson update README
authored
225 return result;
226 }
227 </pre>
228
229 <h4>A More Complex Example</h4>
230
231 <pre>
232 function(doc) {
233 var mk = function(name, value, group) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
234 var ret = new Document();
235 ret.add(value, {"field":group, "store":"yes"}); // ERROR
236 ret.add(group, {"field":"group", "store":"yes"});
c207a60 @rnewson update README
authored
237 return ret;
238 };
239 var ret = [];
240 if(doc.type != "reference") return null;
241 for(var g in doc.groups) {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
242 ret.add(mk("library", doc.groups[g].library, g));
243 ret.add(mk("method", doc.groups[g].method, g));
244 ret.add(mk("target", doc.groups[g].target, g));
c207a60 @rnewson update README
authored
245 }
246 return ret;
247 }
248 </pre>
b207965 @rnewson improve README readability.
authored
249
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
250 <h2>Attachment Indexing</h2>
251
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
252 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
253
ec94e21 @rnewson updated README.md
authored
254 <h3>Supported Formats</h3>
255
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
256 <ul>
257 <li>Excel spreadsheets (application/vnd.ms-excel)
258 <li>Word documents (application/msword)
259 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
260 <li>Visio (application/vnd.visio)
261 <li>Outlook (application/vnd.ms-outlook)
262 <li>XML (application/xml)
263 <li>HTML (text/html)
264 <li>Images (image/*)
265 <li>Java class files
266 <li>Java jar archives
267 <li>MP3 (audio/mp3)
268 <li>OpenDocument (application/vnd.oasis.opendocument.*)
269 <li>Plain text (text/plain)
270 <li>PDF (application/pdf)
271 <li>RTF (application/rtf)
272 </ul>
273
b207965 @rnewson improve README readability.
authored
274 <h1>Searching with couchdb-lucene</h1>
275
39b22c8 @rnewson document that default search field is the _body field that attachment te...
authored
276 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 @rnewson improve README readability.
authored
277
278 <dl>
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
279 <dt>q</dt><dd>the query to run (e.g, subject:hello). If not specified, the default field is searched.</dd>
f9c61e3 @rnewson format README
authored
280 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
281 <dt>limit</dt><dd>the maximum number of results to return</dd>
282 <dt>skip</dt><dd>the number of results to skip</dd>
283 <dt>include_docs</dt><dd>whether to include the source docs</dd>
284 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
285 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
286 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f @rnewson tweak README.md
authored
287 </dl>
b207965 @rnewson improve README readability.
authored
288
289 <i>All parameters except 'q' are optional.</i>
290
ec94e21 @rnewson updated README.md
authored
291 <h2>Special Fields</h2>
292
293 <dl>
f9c61e3 @rnewson format README
authored
294 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec @rnewson update documentation.
authored
295 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 @rnewson include all DC attributes, if present.
authored
296 </dl>
297
298 <h2>Dublin Core</h2>
299
300 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
301
302 <dl>
f9c61e3 @rnewson format README
authored
303 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
304 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
305 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
306 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
307 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
308 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
309 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
310 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
311 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
312 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
313 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
314 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
315 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
316 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
317 <dt>dc.title</dt><dd>A name given to the resource.</dd>
318 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 @rnewson updated README.md
authored
319 </dl>
320
b207965 @rnewson improve README readability.
authored
321 <h2>Examples</h2>
322
323 <pre>
324 http://localhost:5984/dbname/_fti?q=field_name:value
325 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
326 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
327 </pre>
328
329 <h2>Search Results Format</h2>
330
fd16315 @rnewson update README.md
authored
331 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
332
118d28e @rnewson JSON example output.
authored
333 <pre>
334 {
c6356fd @rnewson update README.md and TODO to reflect progress.
authored
335 "q": "+content:enron",
fd16315 @rnewson update README.md
authored
336 "skip": 0,
337 "limit": 2,
338 "total_rows": 176852,
339 "search_duration": 518,
340 "fetch_duration": 4,
341 "rows": [
342 {
343 "_id": "hain-m-all_documents-257.",
344 "score": 1.601625680923462
345 },
346 {
347 "_id": "hain-m-notes_inbox-257.",
348 "score": 1.601625680923462
349 }
118d28e @rnewson JSON example output.
authored
350 ]
351 }
352 </pre>
353
fd16315 @rnewson update README.md
authored
354 And the same with sorting;
355
118d28e @rnewson JSON example output.
authored
356 <pre>
357 {
fd16315 @rnewson update README.md
authored
358 "q": "+_db:enron +content:enron",
359 "skip": 0,
360 "limit": 3,
361 "total_rows": 176852,
362 "search_duration": 660,
363 "fetch_duration": 4,
364 "sort_order": [
365 {
366 "field": "source",
367 "reverse": false,
368 "type": "string"
369 },
370 {
371 "reverse": false,
372 "type": "doc"
373 }
118d28e @rnewson JSON example output.
authored
374 ],
fd16315 @rnewson update README.md
authored
375 "rows": [
376 {
377 "_id": "shankman-j-inbox-105.",
378 "score": 0.6131107211112976,
379 "sort_order": [
380 "enron",
381 6
382 ]
383 },
384 {
385 "_id": "shankman-j-inbox-8.",
386 "score": 0.7492915391921997,
387 "sort_order": [
388 "enron",
389 7
390 ]
391 },
392 {
393 "_id": "shankman-j-inbox-30.",
394 "score": 0.507369875907898,
395 "sort_order": [
396 "enron",
397 8
398 ]
399 }
118d28e @rnewson JSON example output.
authored
400 ]
401 }
402 </pre>
403
139a78c @rnewson add info retrieval.
authored
404 <h1>Fetching information about the index</h1>
405
406 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
407
408 <pre>
409 http://127.0.0.1:5984/enron/_fti
410 </pre>
411
412 returns;
413
414 <pre>
415 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
416 </pre>
417
b207965 @rnewson improve README readability.
authored
418 <h1>Working With The Source</h1>
419
420 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
421
422 <pre>
490ae39 @rnewson break long lines in README.md
authored
423 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
424 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
425 </pre>
426
427 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
428
429 <h1>Configuration</h1>
430
431 couchdb-lucene respects several system properties;
432
433 <dl>
f9c61e3 @rnewson format README
authored
434 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
435 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 @rnewson enhanced logging.
authored
436 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b207965 @rnewson improve README readability.
authored
437 </dl>
438
439 You can override these properties like this;
440
441 <pre>
fe20455 @rnewson fix typo in documentation [#7 state:resolved]
authored
442 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 @rnewson break long lines in README.md
authored
443 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
444 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
445 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
446 </pre>
b2d01cc @rnewson update README for basic auth.
authored
447
448 <h2>Basic Authentication</h2>
449
450 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
451
452 <dl>
f9c61e3 @rnewson format README
authored
453 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
454 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc @rnewson update README for basic auth.
authored
455 </dl>
ccb3c81 @rnewson add note about ipv6 localhost workaround. [#12 state:resolved]
authored
456
457 <h2>IPv6</h2>
458
459 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.