Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 449 lines (359 sloc) 13.344 kB
5d4e56a update readme.
Robert Newson authored
1 <h1>News</h1>
2
5e4e181 Add documentation on proposed enhancements to the indexing API for 0.3.
Robert Newson authored
3 The indexing API in 0.3 will change once again to allow multiple design documents and "views" into Lucene. It will also move much of the Lucene-specific stuff into an options object. Please read the TODO for details.
4
5 The indexing API in 0.2 has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
764563b update news in README.
Robert Newson authored
6
6b2b22c add lighthouseapp link.
Robert Newson authored
7 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a update readme.
Robert Newson authored
8
ef3f787 add sysreq for Sun JDK.
Robert Newson authored
9 <h1>System Requirements</h1>
10
11 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
12
5220b65 tweak README.md
Robert Newson authored
13 <h1>Build couchdb-lucene</h1>
b207965 improve README readability.
Robert Newson authored
14
15 <ol>
16 <li>Install Maven 2.
17 <li>checkout repository
18 <li>type 'mvn'
19 <li>configure couchdb (see below)
20 </ol>
21
22 <h1>Configure CouchDB</h1>
23
24 <pre>
0563120 fixes.
Robert Newson authored
25 [couchdb]
26 os_process_timeout=60000 ; increase the timeout from 5 seconds.
27
b207965 improve README readability.
Robert Newson authored
28 [external]
77d4f67 fix readme.
Robert Newson authored
29 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 wip
Robert Newson authored
30
31 [update_notification]
32 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 improve README readability.
Robert Newson authored
33
34 [httpd_db_handlers]
35 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
36 </pre>
37
38 <h1>Indexing Strategy</h1>
39
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored
40 <h2>Document Indexing</h2>
41
697884b documentation of future features.
Robert Newson authored
42 You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.
a2e9024 wip
Robert Newson authored
43
697884b documentation of future features.
Robert Newson authored
44 You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by those view functions.
c207a60 update README
Robert Newson authored
45
697884b documentation of future features.
Robert Newson authored
46 Declare your functions as follows;
a2e9024 wip
Robert Newson authored
47
697884b documentation of future features.
Robert Newson authored
48 <pre>
49 {
50 "map": <i>conventional view code goes here</i>",
51
52 "fulltext": {
53 "by_subject": {
54 "defaults": { "store":"yes" },
55 "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
56 },
57 "french_documents": {
58 "defaults": { "language":"fr" },
59 "index":"function(doc) { if (doc.language != "fr") { return null;} var ret=new Document(); <i>etc</i> return ret; }"
60 }
61 }
62 }
63 </pre>
64
65 A fulltext object contains multiple index view declarations. An index view consists of;
66
67 <dl>
68 <dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
69 <dt>index</dt><dd>The indexing function itself, documented below.</dd>
70
71 <h3>The Defaults Object</h3>
72
73 The following indexing options can be defaulted;
74
75 <table>
76 <tr>
77 <th>name</th>
78 <th>description</th>
79 <th>available options</th>
80 <th>default</th>
81 </tr>
82 <tr>
a40523d documentation of future features.
Robert Newson authored
83 <th>field</th>
84 <td>the field name to index under</td>
85 <td>user-defined</td>
86 <td>default</td>
87 </tr>
88 <tr>
697884b documentation of future features.
Robert Newson authored
89 <th>store</th>
90 <td>whether the data is stored</td>
91 <td>yes, no</td>
92 <td>no</td>
93 </tr>
94 <tr>
95 <th>index</th>
96 <td>whether (and how) the data is indexed</td>
8328332 typo
Robert Newson authored
97 <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
697884b documentation of future features.
Robert Newson authored
98 <td>analyzed</td>
99 </tr>
100 <tr>
101 <th>analyzer</th>
102 <td>how the data is analyzed</td>
103 <td>simple, standard</td>
104 <td>standard</td>
105 </tr>
106 <tr>
107 <th>language</th>
108 <td>which language the data is in</td>
109 <td>br, cjk, cn, cz, de, el, en, fr, nl, ru, th</td>
110 <td>en</td>
111 </tr>
112 </table>
087dcec update documentation.
Robert Newson authored
113
114 <h3>The Document class</h3>
115
116 You may construct a new Document instance with;
117
118 <pre>
119 var doc = new Document();
120 </pre>
121
a40523d documentation of future features.
Robert Newson authored
122 Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
087dcec update documentation.
Robert Newson authored
123
124 <pre>
a40523d documentation of future features.
Robert Newson authored
125 // Add with all the defaults.
126 doc.add("value");
127
128 // Add a subject field.
129 doc.add("this is the subject line.", {"field":"subject"});
9a71557 formatting
Robert Newson authored
130
a40523d documentation of future features.
Robert Newson authored
131 // Add but ensure it's stored.
132 doc.add("value", {"store":"yes"});
9a71557 formatting
Robert Newson authored
133
a40523d documentation of future features.
Robert Newson authored
134 // Add but don't analyze.
135 doc.add("don't analyze me", {"index":"not_analyzed"});
9a71557 formatting
Robert Newson authored
136
137 // Extract text from the named attachment and index it (but not store it).
a40523d documentation of future features.
Robert Newson authored
138 doc.attachment("attachment name", {"field":"attachments"});
9a71557 formatting
Robert Newson authored
139
140 // Interpret "value" as a date using the default date formats.
a40523d documentation of future features.
Robert Newson authored
141 doc.add("2009-01-01T00:00:00Z", {"type":"date"});
9a71557 formatting
Robert Newson authored
142
143 // intrepret "value" as a date using the supplied format string
144 // (see Java's SimpleDateFormat class for the syntax).
a40523d documentation of future features.
Robert Newson authored
145 doc.add("2009-01-01", {"type":"date", "date_format":"YYYY-MM-dd"});
087dcec update documentation.
Robert Newson authored
146 </pre>
147
ccb81a8 add example transforms section.
Robert Newson authored
148 <h3>Example Transforms</h3>
149
390858a re-add Index Everything example.
Robert Newson authored
150 <h4>Index Everything</h4>
151
152 <pre>
153 function(doc) {
154 var ret = new Document();
155
156 function idx(obj) {
157 for (var key in obj) {
158 switch (typeof obj[key]) {
159 case 'object':
160 idx(obj[key]);
161 break;
162 case 'function':
163 break;
164 default:
165 ret.field(key, obj[key]);
0b6780f expand index-everything example
Robert Newson authored
166 /* Uncomment next line to include
167 * all attributes into a single field.
168 */
169 // ret.field("all", obj[key]);
390858a re-add Index Everything example.
Robert Newson authored
170 break;
171 }
172 }
173 }
174
0b6780f expand index-everything example
Robert Newson authored
175 // Index all attributes
390858a re-add Index Everything example.
Robert Newson authored
176 idx(doc);
0b6780f expand index-everything example
Robert Newson authored
177
178 // Index all attachments
179 for(var a in doc._attachments) {
180 ret.attachment("attachment", a);
181 }
182
390858a re-add Index Everything example.
Robert Newson authored
183 return ret;
184 }
185 </pre>
186
ccb81a8 add example transforms section.
Robert Newson authored
187 <h4>Index Nothing</h4>
188
189 <pre>
190 function(doc) {
191 return null;
192 }
193 </pre>
194
c207a60 update README
Robert Newson authored
195 <h4>Index Select Fields</h4>
ccb81a8 add example transforms section.
Robert Newson authored
196
197 <pre>
198 function(doc) {
c207a60 update README
Robert Newson authored
199 var result = new Document();
f59999b improve examples
Robert Newson authored
200 result.field("subject", doc.subject, "yes");
201 result.field("content", doc.content);
5ff4cda add date example.
Robert Newson authored
202 result.date("indexed_at", new Date());
c207a60 update README
Robert Newson authored
203 return result;
ccb81a8 add example transforms section.
Robert Newson authored
204 }
205 </pre>
206
c207a60 update README
Robert Newson authored
207 <h4>Index Attachments</h4>
ccb81a8 add example transforms section.
Robert Newson authored
208
209 <pre>
210 function(doc) {
c207a60 update README
Robert Newson authored
211 var result = new Document();
212 for(var a in doc._attachments) {
213 result.attachment("attachment", a);
ccb81a8 add example transforms section.
Robert Newson authored
214 }
c207a60 update README
Robert Newson authored
215 return result;
216 }
217 </pre>
218
219 <h4>A More Complex Example</h4>
220
221 <pre>
222 function(doc) {
223 var mk = function(name, value, group) {
224 var ret = new Document(name, value, "yes");
225 ret.field("group", group, "yes");
226 return ret;
227 };
228 var ret = [];
229 if(doc.type != "reference") return null;
230 for(var g in doc.groups) {
231 ret.push(mk("library", doc.groups[g].library, g));
232 ret.push(mk("method", doc.groups[g].method, g));
233 ret.push(mk("target", doc.groups[g].target, g));
234 }
235 return ret;
236 }
237 </pre>
b207965 improve README readability.
Robert Newson authored
238
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored
239 <h2>Attachment Indexing</h2>
240
8059ce0 s/couchdb/couchdb-lucene
Robert Newson authored
241 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored
242
ec94e21 updated README.md
Robert Newson authored
243 <h3>Supported Formats</h3>
244
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored
245 <ul>
246 <li>Excel spreadsheets (application/vnd.ms-excel)
247 <li>Word documents (application/msword)
248 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
249 <li>Visio (application/vnd.visio)
250 <li>Outlook (application/vnd.ms-outlook)
251 <li>XML (application/xml)
252 <li>HTML (text/html)
253 <li>Images (image/*)
254 <li>Java class files
255 <li>Java jar archives
256 <li>MP3 (audio/mp3)
257 <li>OpenDocument (application/vnd.oasis.opendocument.*)
258 <li>Plain text (text/plain)
259 <li>PDF (application/pdf)
260 <li>RTF (application/rtf)
261 </ul>
262
b207965 improve README readability.
Robert Newson authored
263 <h1>Searching with couchdb-lucene</h1>
264
39b22c8 document that default search field is the _body field that attachment…
Robert Newson authored
265 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 improve README readability.
Robert Newson authored
266
267 <dl>
f9c61e3 format README
Robert Newson authored
268 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
269 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
270 <dt>limit</dt><dd>the maximum number of results to return</dd>
271 <dt>skip</dt><dd>the number of results to skip</dd>
272 <dt>include_docs</dt><dd>whether to include the source docs</dd>
273 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
274 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
275 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f tweak README.md
Robert Newson authored
276 </dl>
b207965 improve README readability.
Robert Newson authored
277
278 <i>All parameters except 'q' are optional.</i>
279
ec94e21 updated README.md
Robert Newson authored
280 <h2>Special Fields</h2>
281
282 <dl>
f9c61e3 format README
Robert Newson authored
283 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec update documentation.
Robert Newson authored
284 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 include all DC attributes, if present.
Robert Newson authored
285 </dl>
286
287 <h2>Dublin Core</h2>
288
289 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
290
291 <dl>
f9c61e3 format README
Robert Newson authored
292 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
293 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
294 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
295 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
296 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
297 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
298 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
299 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
300 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
301 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
302 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
303 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
304 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
305 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
306 <dt>dc.title</dt><dd>A name given to the resource.</dd>
307 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 updated README.md
Robert Newson authored
308 </dl>
309
b207965 improve README readability.
Robert Newson authored
310 <h2>Examples</h2>
311
312 <pre>
313 http://localhost:5984/dbname/_fti?q=field_name:value
314 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
315 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
316 </pre>
317
318 <h2>Search Results Format</h2>
319
fd16315 update README.md
Robert Newson authored
320 Here's an example of a JSON response without sorting;
b207965 improve README readability.
Robert Newson authored
321
118d28e JSON example output.
Robert Newson authored
322 <pre>
323 {
fd16315 update README.md
Robert Newson authored
324 "q": "+_db:enron +content:enron",
325 "skip": 0,
326 "limit": 2,
327 "total_rows": 176852,
328 "search_duration": 518,
329 "fetch_duration": 4,
330 "rows": [
331 {
332 "_id": "hain-m-all_documents-257.",
333 "score": 1.601625680923462
334 },
335 {
336 "_id": "hain-m-notes_inbox-257.",
337 "score": 1.601625680923462
338 }
118d28e JSON example output.
Robert Newson authored
339 ]
340 }
341 </pre>
342
fd16315 update README.md
Robert Newson authored
343 And the same with sorting;
344
118d28e JSON example output.
Robert Newson authored
345 <pre>
346 {
fd16315 update README.md
Robert Newson authored
347 "q": "+_db:enron +content:enron",
348 "skip": 0,
349 "limit": 3,
350 "total_rows": 176852,
351 "search_duration": 660,
352 "fetch_duration": 4,
353 "sort_order": [
354 {
355 "field": "source",
356 "reverse": false,
357 "type": "string"
358 },
359 {
360 "reverse": false,
361 "type": "doc"
362 }
118d28e JSON example output.
Robert Newson authored
363 ],
fd16315 update README.md
Robert Newson authored
364 "rows": [
365 {
366 "_id": "shankman-j-inbox-105.",
367 "score": 0.6131107211112976,
368 "sort_order": [
369 "enron",
370 6
371 ]
372 },
373 {
374 "_id": "shankman-j-inbox-8.",
375 "score": 0.7492915391921997,
376 "sort_order": [
377 "enron",
378 7
379 ]
380 },
381 {
382 "_id": "shankman-j-inbox-30.",
383 "score": 0.507369875907898,
384 "sort_order": [
385 "enron",
386 8
387 ]
388 }
118d28e JSON example output.
Robert Newson authored
389 ]
390 }
391 </pre>
392
139a78c add info retrieval.
Robert Newson authored
393 <h1>Fetching information about the index</h1>
394
395 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
396
397 <pre>
398 http://127.0.0.1:5984/enron/_fti
399 </pre>
400
401 returns;
402
403 <pre>
404 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
405 </pre>
406
b207965 improve README readability.
Robert Newson authored
407 <h1>Working With The Source</h1>
408
409 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
410
411 <pre>
490ae39 break long lines in README.md
Robert Newson authored
412 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 move to com.github.rnewson package.
Robert Newson authored
413 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 improve README readability.
Robert Newson authored
414 </pre>
415
416 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
417
418 <h1>Configuration</h1>
419
420 couchdb-lucene respects several system properties;
421
422 <dl>
f9c61e3 format README
Robert Newson authored
423 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
424 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 enhanced logging.
Robert Newson authored
425 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b207965 improve README readability.
Robert Newson authored
426 </dl>
427
428 You can override these properties like this;
429
430 <pre>
fe20455 fix typo in documentation [#7 state:resolved]
Robert Newson authored
431 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 break long lines in README.md
Robert Newson authored
432 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
433 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 move to com.github.rnewson package.
Robert Newson authored
434 com.github.rnewson.couchdb.lucene.Main
b207965 improve README readability.
Robert Newson authored
435 </pre>
b2d01cc update README for basic auth.
Robert Newson authored
436
437 <h2>Basic Authentication</h2>
438
439 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
440
441 <dl>
f9c61e3 format README
Robert Newson authored
442 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
443 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc update README for basic auth.
Robert Newson authored
444 </dl>
ccb3c81 add note about ipv6 localhost workaround. [#12 state:resolved]
Robert Newson authored
445
446 <h2>IPv6</h2>
447
448 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.