Skip to content
Newer
Older
100644 382 lines (299 sloc) 11.4 KB
5d4e56a update readme.
Robert Newson authored Mar 6, 2009
1 <h1>News</h1>
2
5e4e181 Add documentation on proposed enhancements to the indexing API for 0.3.
Robert Newson authored Apr 10, 2009
3 The indexing API in 0.3 will change once again to allow multiple design documents and "views" into Lucene. It will also move much of the Lucene-specific stuff into an options object. Please read the TODO for details.
4
5 The indexing API in 0.2 has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
764563b update news in README.
Robert Newson authored Apr 4, 2009
6
6b2b22c add lighthouseapp link.
Robert Newson authored Mar 16, 2009
7 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a update readme.
Robert Newson authored Mar 6, 2009
8
ef3f787 add sysreq for Sun JDK.
Robert Newson authored Apr 6, 2009
9 <h1>System Requirements</h1>
10
11 Sun JDK 5 or higher is necessary. Couchdb-lucene is known to be incompatible with OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library.
12
5220b65 tweak README.md
Robert Newson authored Feb 14, 2009
13 <h1>Build couchdb-lucene</h1>
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
14
15 <ol>
16 <li>Install Maven 2.
17 <li>checkout repository
18 <li>type 'mvn'
19 <li>configure couchdb (see below)
20 </ol>
21
22 <h1>Configure CouchDB</h1>
23
24 <pre>
0563120 fixes.
Robert Newson authored Mar 7, 2009
25 [couchdb]
26 os_process_timeout=60000 ; increase the timeout from 5 seconds.
27
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
28 [external]
77d4f67 fix readme.
Robert Newson authored Mar 7, 2009
29 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 wip
Robert Newson authored Mar 6, 2009
30
31 [update_notification]
32 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
33
34 [httpd_db_handlers]
35 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
36 </pre>
37
38 <h1>Indexing Strategy</h1>
39
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
40 <h2>Document Indexing</h2>
41
c207a60 update README
Robert Newson authored Apr 4, 2009
42 You must supply a transform function in order to enable couchdb-lucene .
a2e9024 wip
Robert Newson authored Mar 6, 2009
43
c207a60 update README
Robert Newson authored Apr 4, 2009
44 Add a design document called _design/lucene in your database with an attribute called "transform". The value of this attribute is a Javascript function.
45
46 The transform function can return null, to prevent indexing, and either a single Document or an array of Documents.
a2e9024 wip
Robert Newson authored Mar 6, 2009
47
087dcec update documentation.
Robert Newson authored Apr 4, 2009
48 The transform function is called for each document in the database. To pass information to Lucene, you must populate Document instances with data from the original CouchDB document.
49
50 <h3>The Document class</h3>
51
52 You may construct a new Document instance with;
53
54 <pre>
55 var doc = new Document();
56 </pre>
57
58 Several functions are available that populate a Document.
59
60 <pre>
9a71557 formatting
Robert Newson authored Apr 5, 2009
61 // Indexed, analyzed but not stored.
62 doc.field("name", "value");
63
64 // Indexed, analyzed and stored.
65 doc.field("name", "value", "yes");
66
67 // Indexed, stored but not analyzed.
68 doc.field("name", "value", "yes", "not_analyzed");
69
70 // Extract text from the named attachment and index it (but not store it).
71 doc.attachment("name", "attachment name");
72
73 // Interpret "value" as a date using the default date formats.
74 doc.date("name", "value");
75
76 // intrepret "value" as a date using the supplied format string
77 // (see Java's SimpleDateFormat class for the syntax).
78 doc.date("name", "value", "format");
087dcec update documentation.
Robert Newson authored Apr 4, 2009
79 </pre>
80
ccb81a8 add example transforms section.
Robert Newson authored Mar 20, 2009
81 <h3>Example Transforms</h3>
82
390858a re-add Index Everything example.
Robert Newson authored Apr 5, 2009
83 <h4>Index Everything</h4>
84
85 <pre>
86 function(doc) {
87 var ret = new Document();
88
89 function idx(obj) {
90 for (var key in obj) {
91 switch (typeof obj[key]) {
92 case 'object':
93 idx(obj[key]);
94 break;
95 case 'function':
96 break;
97 default:
98 ret.field(key, obj[key]);
0b6780f expand index-everything example
Robert Newson authored Apr 5, 2009
99 /* Uncomment next line to include
100 * all attributes into a single field.
101 */
102 // ret.field("all", obj[key]);
390858a re-add Index Everything example.
Robert Newson authored Apr 5, 2009
103 break;
104 }
105 }
106 }
107
0b6780f expand index-everything example
Robert Newson authored Apr 5, 2009
108 // Index all attributes
390858a re-add Index Everything example.
Robert Newson authored Apr 5, 2009
109 idx(doc);
0b6780f expand index-everything example
Robert Newson authored Apr 5, 2009
110
111 // Index all attachments
112 for(var a in doc._attachments) {
113 ret.attachment("attachment", a);
114 }
115
390858a re-add Index Everything example.
Robert Newson authored Apr 5, 2009
116 return ret;
117 }
118 </pre>
119
ccb81a8 add example transforms section.
Robert Newson authored Mar 20, 2009
120 <h4>Index Nothing</h4>
121
122 <pre>
123 function(doc) {
124 return null;
125 }
126 </pre>
127
c207a60 update README
Robert Newson authored Apr 4, 2009
128 <h4>Index Select Fields</h4>
ccb81a8 add example transforms section.
Robert Newson authored Mar 20, 2009
129
130 <pre>
131 function(doc) {
c207a60 update README
Robert Newson authored Apr 4, 2009
132 var result = new Document();
f59999b improve examples
Robert Newson authored Apr 4, 2009
133 result.field("subject", doc.subject, "yes");
134 result.field("content", doc.content);
5ff4cda add date example.
Robert Newson authored Apr 4, 2009
135 result.date("indexed_at", new Date());
c207a60 update README
Robert Newson authored Apr 4, 2009
136 return result;
ccb81a8 add example transforms section.
Robert Newson authored Mar 20, 2009
137 }
138 </pre>
139
c207a60 update README
Robert Newson authored Apr 4, 2009
140 <h4>Index Attachments</h4>
ccb81a8 add example transforms section.
Robert Newson authored Mar 20, 2009
141
142 <pre>
143 function(doc) {
c207a60 update README
Robert Newson authored Apr 4, 2009
144 var result = new Document();
145 for(var a in doc._attachments) {
146 result.attachment("attachment", a);
ccb81a8 add example transforms section.
Robert Newson authored Mar 20, 2009
147 }
c207a60 update README
Robert Newson authored Apr 4, 2009
148 return result;
149 }
150 </pre>
151
152 <h4>A More Complex Example</h4>
153
154 <pre>
155 function(doc) {
156 var mk = function(name, value, group) {
157 var ret = new Document(name, value, "yes");
158 ret.field("group", group, "yes");
159 return ret;
160 };
161 var ret = [];
162 if(doc.type != "reference") return null;
163 for(var g in doc.groups) {
164 ret.push(mk("library", doc.groups[g].library, g));
165 ret.push(mk("method", doc.groups[g].method, g));
166 ret.push(mk("target", doc.groups[g].target, g));
167 }
168 return ret;
169 }
170 </pre>
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
171
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
172 <h2>Attachment Indexing</h2>
173
8059ce0 s/couchdb/couchdb-lucene
Robert Newson authored Mar 8, 2009
174 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
175
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
176 <h3>Supported Formats</h3>
177
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
178 <ul>
179 <li>Excel spreadsheets (application/vnd.ms-excel)
180 <li>Word documents (application/msword)
181 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
182 <li>Visio (application/vnd.visio)
183 <li>Outlook (application/vnd.ms-outlook)
184 <li>XML (application/xml)
185 <li>HTML (text/html)
186 <li>Images (image/*)
187 <li>Java class files
188 <li>Java jar archives
189 <li>MP3 (audio/mp3)
190 <li>OpenDocument (application/vnd.oasis.opendocument.*)
191 <li>Plain text (text/plain)
192 <li>PDF (application/pdf)
193 <li>RTF (application/rtf)
194 </ul>
195
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
196 <h1>Searching with couchdb-lucene</h1>
197
39b22c8 document that default search field is the _body field that attachment…
Robert Newson authored Apr 1, 2009
198 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
199
200 <dl>
f9c61e3 format README
Robert Newson authored Mar 22, 2009
201 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
202 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
203 <dt>limit</dt><dd>the maximum number of results to return</dd>
204 <dt>skip</dt><dd>the number of results to skip</dd>
205 <dt>include_docs</dt><dd>whether to include the source docs</dd>
206 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
207 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
208 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f tweak README.md
Robert Newson authored Feb 14, 2009
209 </dl>
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
210
211 <i>All parameters except 'q' are optional.</i>
212
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
213 <h2>Special Fields</h2>
214
215 <dl>
f9c61e3 format README
Robert Newson authored Mar 22, 2009
216 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec update documentation.
Robert Newson authored Apr 4, 2009
217 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 include all DC attributes, if present.
Robert Newson authored Mar 8, 2009
218 </dl>
219
220 <h2>Dublin Core</h2>
221
222 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
223
224 <dl>
f9c61e3 format README
Robert Newson authored Mar 22, 2009
225 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
226 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
227 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
228 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
229 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
230 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
231 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
232 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
233 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
234 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
235 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
236 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
237 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
238 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
239 <dt>dc.title</dt><dd>A name given to the resource.</dd>
240 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
241 </dl>
242
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
243 <h2>Examples</h2>
244
245 <pre>
246 http://localhost:5984/dbname/_fti?q=field_name:value
247 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
248 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
249 </pre>
250
251 <h2>Search Results Format</h2>
252
fd16315 update README.md
Robert Newson authored Mar 7, 2009
253 Here's an example of a JSON response without sorting;
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
254
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
255 <pre>
256 {
fd16315 update README.md
Robert Newson authored Mar 7, 2009
257 "q": "+_db:enron +content:enron",
258 "skip": 0,
259 "limit": 2,
260 "total_rows": 176852,
261 "search_duration": 518,
262 "fetch_duration": 4,
263 "rows": [
264 {
265 "_id": "hain-m-all_documents-257.",
266 "score": 1.601625680923462
267 },
268 {
269 "_id": "hain-m-notes_inbox-257.",
270 "score": 1.601625680923462
271 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
272 ]
273 }
274 </pre>
275
fd16315 update README.md
Robert Newson authored Mar 7, 2009
276 And the same with sorting;
277
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
278 <pre>
279 {
fd16315 update README.md
Robert Newson authored Mar 7, 2009
280 "q": "+_db:enron +content:enron",
281 "skip": 0,
282 "limit": 3,
283 "total_rows": 176852,
284 "search_duration": 660,
285 "fetch_duration": 4,
286 "sort_order": [
287 {
288 "field": "source",
289 "reverse": false,
290 "type": "string"
291 },
292 {
293 "reverse": false,
294 "type": "doc"
295 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
296 ],
fd16315 update README.md
Robert Newson authored Mar 7, 2009
297 "rows": [
298 {
299 "_id": "shankman-j-inbox-105.",
300 "score": 0.6131107211112976,
301 "sort_order": [
302 "enron",
303 6
304 ]
305 },
306 {
307 "_id": "shankman-j-inbox-8.",
308 "score": 0.7492915391921997,
309 "sort_order": [
310 "enron",
311 7
312 ]
313 },
314 {
315 "_id": "shankman-j-inbox-30.",
316 "score": 0.507369875907898,
317 "sort_order": [
318 "enron",
319 8
320 ]
321 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
322 ]
323 }
324 </pre>
325
139a78c add info retrieval.
Robert Newson authored Mar 9, 2009
326 <h1>Fetching information about the index</h1>
327
328 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
329
330 <pre>
331 http://127.0.0.1:5984/enron/_fti
332 </pre>
333
334 returns;
335
336 <pre>
337 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
338 </pre>
339
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
340 <h1>Working With The Source</h1>
341
342 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
343
344 <pre>
490ae39 break long lines in README.md
Robert Newson authored Feb 14, 2009
345 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 move to com.github.rnewson package.
Robert Newson authored Mar 18, 2009
346 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
347 </pre>
348
349 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
350
351 <h1>Configuration</h1>
352
353 couchdb-lucene respects several system properties;
354
355 <dl>
f9c61e3 format README
Robert Newson authored Mar 22, 2009
356 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
357 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
2b375b4 enhanced logging.
Robert Newson authored Apr 17, 2009
358 <dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
359 </dl>
360
361 You can override these properties like this;
362
363 <pre>
fe20455 fix typo in documentation [#7 state:resolved]
Robert Newson authored Apr 1, 2009
364 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 break long lines in README.md
Robert Newson authored Feb 14, 2009
365 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
366 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 move to com.github.rnewson package.
Robert Newson authored Mar 18, 2009
367 com.github.rnewson.couchdb.lucene.Main
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
368 </pre>
b2d01cc update README for basic auth.
Robert Newson authored Mar 16, 2009
369
370 <h2>Basic Authentication</h2>
371
372 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
373
374 <dl>
f9c61e3 format README
Robert Newson authored Mar 22, 2009
375 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
376 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc update README for basic auth.
Robert Newson authored Mar 16, 2009
377 </dl>
ccb3c81 add note about ipv6 localhost workaround. [#12 state:resolved]
Robert Newson authored Apr 13, 2009
378
379 <h2>IPv6</h2>
380
381 The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.
Something went wrong with that request. Please try again.