Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 382 lines (299 sloc) 11.709 kb
5d4e56a @rnewson update readme.
authored
1 <h1>News</h1>
2
5e4e181 @rnewson Add documentation on proposed enhancements to the indexing API for 0.3.
authored
3 The indexing API in 0.3 will change once again to allow multiple design documents and "views" into Lucene. It will also move much of the Lucene-specific stuff into an options object. Please read the TODO for details.
4
5 The indexing API in 0.2 has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
764563b @rnewson update news in README.
authored
6
6b2b22c @rnewson add lighthouseapp link.
authored
7 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a @rnewson update readme.
authored
8
ef3f787 @rnewson add sysreq for Sun JDK.
authored
9 <h1>System Requirements</h1>
10
11 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
12
5220b65 @rnewson tweak README.md
authored
13 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
14
15 <ol>
16 <li>Install Maven 2.
17 <li>checkout repository
18 <li>type 'mvn'
19 <li>configure couchdb (see below)
20 </ol>
21
22 <h1>Configure CouchDB</h1>
23
24 <pre>
0563120 @rnewson fixes.
authored
25 [couchdb]
26 os_process_timeout=60000 ; increase the timeout from 5 seconds.
27
b207965 @rnewson improve README readability.
authored
28 [external]
77d4f67 @rnewson fix readme.
authored
29 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
30
31 [update_notification]
32 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
33
34 [httpd_db_handlers]
35 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
36 </pre>
37
38 <h1>Indexing Strategy</h1>
39
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
40 <h2>Document Indexing</h2>
41
c207a60 @rnewson update README
authored
42 You must supply a transform function in order to enable couchdb-lucene .
a2e9024 @rnewson wip
authored
43
c207a60 @rnewson update README
authored
44 Add a design document called _design/lucene in your database with an attribute called "transform". The value of this attribute is a Javascript function.
45
46 The transform function can return null, to prevent indexing, and either a single Document or an array of Documents.
a2e9024 @rnewson wip
authored
47
087dcec @rnewson update documentation.
authored
48 The transform function is called for each document in the database. To pass information to Lucene, you must populate Document instances with data from the original CouchDB document.
49
50 <h3>The Document class</h3>
51
52 You may construct a new Document instance with;
53
54 <pre>
55 var doc = new Document();
56 </pre>
57
58 Several functions are available that populate a Document.
59
60 <pre>
9a71557 @rnewson formatting
authored
61 // Indexed, analyzed but not stored.
62 doc.field("name", "value");
63
64 // Indexed, analyzed and stored.
65 doc.field("name", "value", "yes");
66
67 // Indexed, stored but not analyzed.
68 doc.field("name", "value", "yes", "not_analyzed");
69
70 // Extract text from the named attachment and index it (but not store it).
71 doc.attachment("name", "attachment name");
72
73 // Interpret "value" as a date using the default date formats.
74 doc.date("name", "value");
75
76 // intrepret "value" as a date using the supplied format string
77 // (see Java's SimpleDateFormat class for the syntax).
78 doc.date("name", "value", "format");
087dcec @rnewson update documentation.
authored
79 </pre>
80
ccb81a8 @rnewson add example transforms section.
authored
81 <h3>Example Transforms</h3>
82
390858a @rnewson re-add Index Everything example.
authored
83 <h4>Index Everything</h4>
84
85 <pre>
86 function(doc) {
87 var ret = new Document();
88
89 function idx(obj) {
90 for (var key in obj) {
91 switch (typeof obj[key]) {
92 case 'object':
93 idx(obj[key]);
94 break;
95 case 'function':
96 break;
97 default:
98 ret.field(key, obj[key]);
0b6780f @rnewson expand index-everything example
authored
99 /* Uncomment next line to include
100 * all attributes into a single field.
101 */
102 // ret.field("all", obj[key]);
390858a @rnewson re-add Index Everything example.
authored
103 break;
104 }
105 }
106 }
107
0b6780f @rnewson expand index-everything example
authored
108 // Index all attributes
390858a @rnewson re-add Index Everything example.
authored
109 idx(doc);
0b6780f @rnewson expand index-everything example
authored
110
111 // Index all attachments
112 for(var a in doc._attachments) {
113 ret.attachment("attachment", a);
114 }
115
390858a @rnewson re-add Index Everything example.
authored
116 return ret;
117 }
118 </pre>
119
ccb81a8 @rnewson add example transforms section.
authored
120 <h4>Index Nothing</h4>
121
122 <pre>
123 function(doc) {
124 return null;
125 }
126 </pre>
127
c207a60 @rnewson update README
authored
128 <h4>Index Select Fields</h4>
ccb81a8 @rnewson add example transforms section.
authored
129
130 <pre>
131 function(doc) {
c207a60 @rnewson update README
authored
132 var result = new Document();
f59999b @rnewson improve examples
authored
133 result.field("subject", doc.subject, "yes");
134 result.field("content", doc.content);
5ff4cda @rnewson add date example.
authored
135 result.date("indexed_at", new Date());
c207a60 @rnewson update README
authored
136 return result;
ccb81a8 @rnewson add example transforms section.
authored
137 }
138 </pre>
139
c207a60 @rnewson update README
authored
140 <h4>Index Attachments</h4>
ccb81a8 @rnewson add example transforms section.
authored
141
142 <pre>
143 function(doc) {
c207a60 @rnewson update README
authored
144 var result = new Document();
145 for(var a in doc._attachments) {
146 result.attachment("attachment", a);
ccb81a8 @rnewson add example transforms section.
authored
147 }
c207a60 @rnewson update README
authored
148 return result;
149 }
150 </pre>
151
152 <h4>A More Complex Example</h4>
153
154 <pre>
155 function(doc) {
156 var mk = function(name, value, group) {
157 var ret = new Document(name, value, "yes");
158 ret.field("group", group, "yes");
159 return ret;
160 };
161 var ret = [];
162 if(doc.type != "reference") return null;
163 for(var g in doc.groups) {
164 ret.push(mk("library", doc.groups[g].library, g));
165 ret.push(mk("method", doc.groups[g].method, g));
166 ret.push(mk("target", doc.groups[g].target, g));
167 }
168 return ret;
169 }
170 </pre>
b207965 @rnewson improve README readability.
authored
171
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
172 <h2>Attachment Indexing</h2>
173
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
174 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
175
ec94e21 @rnewson updated README.md
authored
176 <h3>Supported Formats</h3>
177
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
178 <ul>
179 <li>Excel spreadsheets (application/vnd.ms-excel)
180 <li>Word documents (application/msword)
181 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
182 <li>Visio (application/vnd.visio)
183 <li>Outlook (application/vnd.ms-outlook)
184 <li>XML (application/xml)
185 <li>HTML (text/html)
186 <li>Images (image/*)
187 <li>Java class files
188 <li>Java jar archives
189 <li>MP3 (audio/mp3)
190 <li>OpenDocument (application/vnd.oasis.opendocument.*)
191 <li>Plain text (text/plain)
192 <li>PDF (application/pdf)
193 <li>RTF (application/rtf)
194 </ul>
195
b207965 @rnewson improve README readability.
authored
196 <h1>Searching with couchdb-lucene</h1>
197
39b22c8 @rnewson document that default search field is the _body field that attachment…
authored
198 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 @rnewson improve README readability.
authored
199
200 <dl>
f9c61e3 @rnewson format README
authored
201 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
202 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
203 <dt>limit</dt><dd>the maximum number of results to return</dd>
204 <dt>skip</dt><dd>the number of results to skip</dd>
205 <dt>include_docs</dt><dd>whether to include the source docs</dd>
206 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
207 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
208 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f @rnewson tweak README.md
authored
209 </dl>
b207965 @rnewson improve README readability.
authored
210
211 <i>All parameters except 'q' are optional.</i>
212
ec94e21 @rnewson updated README.md
authored
213 <h2>Special Fields</h2>
214
215 <dl>
f9c61e3 @rnewson format README
authored
216 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec @rnewson update documentation.
authored
217 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 @rnewson include all DC attributes, if present.
authored
218 </dl>
219
220 <h2>Dublin Core</h2>
221
222 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
223
224 <dl>
f9c61e3 @rnewson format README
authored
225 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
226 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
227 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
228 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
229 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
230 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
231 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
232 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
233 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
234 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
235 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
236 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
237 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
238 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
239 <dt>dc.title</dt><dd>A name given to the resource.</dd>
240 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 @rnewson updated README.md
authored
241 </dl>
242
b207965 @rnewson improve README readability.
authored
243 <h2>Examples</h2>
244
245 <pre>
246 http://localhost:5984/dbname/_fti?q=field_name:value
247 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
248 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
249 </pre>
250
251 <h2>Search Results Format</h2>
252
fd16315 @rnewson update README.md
authored
253 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
254
118d28e @rnewson JSON example output.
authored
255 <pre>
256 {
fd16315 @rnewson update README.md
authored
257 "q": "+_db:enron +content:enron",
258 "skip": 0,
259 "limit": 2,
260 "total_rows": 176852,
261 "search_duration": 518,
262 "fetch_duration": 4,
263 "rows": [
264 {
265 "_id": "hain-m-all_documents-257.",
266 "score": 1.601625680923462
267 },
268 {
269 "_id": "hain-m-notes_inbox-257.",
270 "score": 1.601625680923462
271 }
118d28e @rnewson JSON example output.
authored
272 ]
273 }
274 </pre>
275
fd16315 @rnewson update README.md
authored
276 And the same with sorting;
277
118d28e @rnewson JSON example output.
authored
278 <pre>
279 {
fd16315 @rnewson update README.md
authored
280 "q": "+_db:enron +content:enron",
281 "skip": 0,
282 "limit": 3,
283 "total_rows": 176852,
284 "search_duration": 660,
285 "fetch_duration": 4,
286 "sort_order": [
287 {
288 "field": "source",
289 "reverse": false,
290 "type": "string"
291 },
292 {
293 "reverse": false,
294 "type": "doc"
295 }
118d28e @rnewson JSON example output.
authored
296 ],
fd16315 @rnewson update README.md
authored
297 "rows": [
298 {
299 "_id": "shankman-j-inbox-105.",
300 "score": 0.6131107211112976,
301 "sort_order": [
302 "enron",
303 6
304 ]
305 },
306 {
307 "_id": "shankman-j-inbox-8.",
308 "score": 0.7492915391921997,
309 "sort_order": [
310 "enron",
311 7
312 ]
313 },
314 {
315 "_id": "shankman-j-inbox-30.",
316 "score": 0.507369875907898,
317 "sort_order": [
318 "enron",
319 8
320 ]
321 }
118d28e @rnewson JSON example output.
authored
322 ]
323 }
324 </pre>
325
139a78c @rnewson add info retrieval.
authored
326 <h1>Fetching information about the index</h1>
327
328 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
329
330 <pre>
331 http://127.0.0.1:5984/enron/_fti
332 </pre>
333
334 returns;
335
336 <pre>
337 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
338 </pre>
339
b207965 @rnewson improve README readability.
authored
340 <h1>Working With The Source</h1>
341
342 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
343
344 <pre>
490ae39 @rnewson break long lines in README.md
authored
345 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
346 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
347 </pre>
348
349 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
350
351 <h1>Configuration</h1>
352
353 couchdb-lucene respects several system properties;
354
355 <dl>
f9c61e3 @rnewson format README
authored
356 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
357 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 @rnewson enhanced logging.
authored
358 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b207965 @rnewson improve README readability.
authored
359 </dl>
360
361 You can override these properties like this;
362
363 <pre>
fe20455 @rnewson fix typo in documentation [#7 state:resolved]
authored
364 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 @rnewson break long lines in README.md
authored
365 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
366 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
367 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
368 </pre>
b2d01cc @rnewson update README for basic auth.
authored
369
370 <h2>Basic Authentication</h2>
371
372 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
373
374 <dl>
f9c61e3 @rnewson format README
authored
375 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
376 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc @rnewson update README for basic auth.
authored
377 </dl>
ccb3c81 @rnewson add note about ipv6 localhost workaround. [#12 state:resolved]
authored
378
379 <h2>IPv6</h2>
380
381 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.