Skip to content
This repository
Newer
Older
100644 339 lines (269 sloc) 10.504 kb
5d4e56ad »
2009-03-06 update readme.
1 <h1>News</h1>
2
764563b5 »
2009-04-04 update news in README.
3 The indexing API has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
4
6b2b22c4 »
2009-03-16 add lighthouseapp link.
5 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56ad »
2009-03-06 update readme.
6
5220b654 »
2009-02-14 tweak README.md
7 <h1>Build couchdb-lucene</h1>
b2079657 »
2009-02-14 improve README readability.
8
9 <ol>
10 <li>Install Maven 2.
11 <li>checkout repository
12 <li>type 'mvn'
13 <li>configure couchdb (see below)
14 </ol>
15
16 <h1>Configure CouchDB</h1>
17
18 <pre>
05631204 »
2009-03-07 fixes.
19 [couchdb]
20 os_process_timeout=60000 ; increase the timeout from 5 seconds.
21
b2079657 »
2009-02-14 improve README readability.
22 [external]
77d4f67e »
2009-03-07 fix readme.
23 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024b »
2009-03-06 wip
24
25 [update_notification]
26 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b2079657 »
2009-02-14 improve README readability.
27
28 [httpd_db_handlers]
29 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
30 </pre>
31
32 <h1>Indexing Strategy</h1>
33
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
34 <h2>Document Indexing</h2>
35
c207a604 »
2009-04-04 update README
36 You must supply a transform function in order to enable couchdb-lucene .
a2e9024b »
2009-03-06 wip
37
c207a604 »
2009-04-04 update README
38 Add a design document called _design/lucene in your database with an attribute called "transform". The value of this attribute is a Javascript function.
39
40 The transform function can return null, to prevent indexing, and either a single Document or an array of Documents.
a2e9024b »
2009-03-06 wip
41
087dcec0 »
2009-04-04 update documentation.
42 The transform function is called for each document in the database. To pass information to Lucene, you must populate Document instances with data from the original CouchDB document.
43
44 <h3>The Document class</h3>
45
46 You may construct a new Document instance with;
47
48 <pre>
49 var doc = new Document();
50 </pre>
51
52 Several functions are available that populate a Document.
53
54 <pre>
55 doc.field("name", "value"); // Indexed, analyzed but not stored.
56 doc.field("name", "value", "yes"); // Indexed, analyzed and stored.
57 doc.field("name", "value", "yes", "not_analyzed"); // Indexed, stored but not analyzed.
58 doc.attachment("name", "attachment name"); // Extract text from the named attachment and index it (but not store it).
59 doc.date("name", "value"); // Interpret "value" as a date using the default date formats.
60 doc.date("name", "value", "format"); // intrepret "value" as a date using the supplied format string (see Java's SimpleDateFormat class for the syntax).
61 </pre>
62
ccb81a8a »
2009-03-20 add example transforms section.
63 <h3>Example Transforms</h3>
64
c207a604 »
2009-04-04 update README
65 <h4>Index Everything</h4>
ccb81a8a »
2009-03-20 add example transforms section.
66
67 <pre>
68 function(doc) {
c207a604 »
2009-04-04 update README
69 return new Document(doc);
ccb81a8a »
2009-03-20 add example transforms section.
70 }
71 </pre>
72
73 <h4>Index Nothing</h4>
74
75 <pre>
76 function(doc) {
77 return null;
78 }
79 </pre>
80
c207a604 »
2009-04-04 update README
81 <h4>Index Select Fields</h4>
ccb81a8a »
2009-03-20 add example transforms section.
82
83 <pre>
84 function(doc) {
c207a604 »
2009-04-04 update README
85 var result = new Document();
86 result.subject = doc.subject;
87 result.content = doc.content;
88 return result;
ccb81a8a »
2009-03-20 add example transforms section.
89 }
90 </pre>
91
c207a604 »
2009-04-04 update README
92 <h4>Index Attachments</h4>
ccb81a8a »
2009-03-20 add example transforms section.
93
94 <pre>
95 function(doc) {
c207a604 »
2009-04-04 update README
96 var result = new Document();
97 for(var a in doc._attachments) {
98 result.attachment("attachment", a);
ccb81a8a »
2009-03-20 add example transforms section.
99 }
c207a604 »
2009-04-04 update README
100 return result;
101 }
102 </pre>
103
104 <h4>Multiple Documents</h4>
ccb81a8a »
2009-03-20 add example transforms section.
105
c207a604 »
2009-04-04 update README
106 <pre>
107 function(doc) {
108 var result = [];
109 result.subject = doc.subject;
110 result.content = doc.content;
111 return result;
ccb81a8a »
2009-03-20 add example transforms section.
112 }
113 </pre>
114
c207a604 »
2009-04-04 update README
115 <h4>A More Complex Example</h4>
116
117 <pre>
118 function(doc) {
119 var mk = function(name, value, group) {
120 var ret = new Document(name, value, "yes");
121 ret.field("group", group, "yes");
122 return ret;
123 };
124 var ret = [];
125 if(doc.type != "reference") return null;
126 for(var g in doc.groups) {
127 ret.push(mk("library", doc.groups[g].library, g));
128 ret.push(mk("method", doc.groups[g].method, g));
129 ret.push(mk("target", doc.groups[g].target, g));
130 }
131 return ret;
132 }
133 </pre>
b2079657 »
2009-02-14 improve README readability.
134
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
135 <h2>Attachment Indexing</h2>
136
8059ce07 »
2009-03-08 s/couchdb/couchdb-lucene
137 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
138
ec94e218 »
2009-02-18 updated README.md
139 <h3>Supported Formats</h3>
140
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
141 <ul>
142 <li>Excel spreadsheets (application/vnd.ms-excel)
143 <li>Word documents (application/msword)
144 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
145 <li>Visio (application/vnd.visio)
146 <li>Outlook (application/vnd.ms-outlook)
147 <li>XML (application/xml)
148 <li>HTML (text/html)
149 <li>Images (image/*)
150 <li>Java class files
151 <li>Java jar archives
152 <li>MP3 (audio/mp3)
153 <li>OpenDocument (application/vnd.oasis.opendocument.*)
154 <li>Plain text (text/plain)
155 <li>PDF (application/pdf)
156 <li>RTF (application/rtf)
157 </ul>
158
b2079657 »
2009-02-14 improve README readability.
159 <h1>Searching with couchdb-lucene</h1>
160
39b22c82 »
2009-04-01 document that default search field is the _body field that attachment…
161 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b2079657 »
2009-02-14 improve README readability.
162
163 <dl>
f9c61e32 »
2009-03-22 format README
164 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
165 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
166 <dt>limit</dt><dd>the maximum number of results to return</dd>
167 <dt>skip</dt><dd>the number of results to skip</dd>
168 <dt>include_docs</dt><dd>whether to include the source docs</dd>
169 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
170 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
171 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f2 »
2009-02-14 tweak README.md
172 </dl>
b2079657 »
2009-02-14 improve README readability.
173
174 <i>All parameters except 'q' are optional.</i>
175
ec94e218 »
2009-02-18 updated README.md
176 <h2>Special Fields</h2>
177
178 <dl>
f9c61e32 »
2009-03-22 format README
179 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec0 »
2009-04-04 update documentation.
180 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a371 »
2009-03-08 include all DC attributes, if present.
181 </dl>
182
183 <h2>Dublin Core</h2>
184
185 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
186
187 <dl>
f9c61e32 »
2009-03-22 format README
188 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
189 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
190 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
191 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
192 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
193 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
194 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
195 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
196 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
197 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
198 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
199 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
200 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
201 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
202 <dt>dc.title</dt><dd>A name given to the resource.</dd>
203 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e218 »
2009-02-18 updated README.md
204 </dl>
205
b2079657 »
2009-02-14 improve README readability.
206 <h2>Examples</h2>
207
208 <pre>
209 http://localhost:5984/dbname/_fti?q=field_name:value
210 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
211 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
212 </pre>
213
214 <h2>Search Results Format</h2>
215
fd163159 »
2009-03-07 update README.md
216 Here's an example of a JSON response without sorting;
b2079657 »
2009-02-14 improve README readability.
217
118d28eb »
2009-02-17 JSON example output.
218 <pre>
219 {
fd163159 »
2009-03-07 update README.md
220 "q": "+_db:enron +content:enron",
221 "skip": 0,
222 "limit": 2,
223 "total_rows": 176852,
224 "search_duration": 518,
225 "fetch_duration": 4,
226 "rows": [
227 {
228 "_id": "hain-m-all_documents-257.",
229 "score": 1.601625680923462
230 },
231 {
232 "_id": "hain-m-notes_inbox-257.",
233 "score": 1.601625680923462
234 }
118d28eb »
2009-02-17 JSON example output.
235 ]
236 }
237 </pre>
238
fd163159 »
2009-03-07 update README.md
239 And the same with sorting;
240
118d28eb »
2009-02-17 JSON example output.
241 <pre>
242 {
fd163159 »
2009-03-07 update README.md
243 "q": "+_db:enron +content:enron",
244 "skip": 0,
245 "limit": 3,
246 "total_rows": 176852,
247 "search_duration": 660,
248 "fetch_duration": 4,
249 "sort_order": [
250 {
251 "field": "source",
252 "reverse": false,
253 "type": "string"
254 },
255 {
256 "reverse": false,
257 "type": "doc"
258 }
118d28eb »
2009-02-17 JSON example output.
259 ],
fd163159 »
2009-03-07 update README.md
260 "rows": [
261 {
262 "_id": "shankman-j-inbox-105.",
263 "score": 0.6131107211112976,
264 "sort_order": [
265 "enron",
266 6
267 ]
268 },
269 {
270 "_id": "shankman-j-inbox-8.",
271 "score": 0.7492915391921997,
272 "sort_order": [
273 "enron",
274 7
275 ]
276 },
277 {
278 "_id": "shankman-j-inbox-30.",
279 "score": 0.507369875907898,
280 "sort_order": [
281 "enron",
282 8
283 ]
284 }
118d28eb »
2009-02-17 JSON example output.
285 ]
286 }
287 </pre>
288
139a78cc »
2009-03-09 add info retrieval.
289 <h1>Fetching information about the index</h1>
290
291 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
292
293 <pre>
294 http://127.0.0.1:5984/enron/_fti
295 </pre>
296
297 returns;
298
299 <pre>
300 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
301 </pre>
302
b2079657 »
2009-02-14 improve README readability.
303 <h1>Working With The Source</h1>
304
305 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
306
307 <pre>
490ae390 »
2009-02-14 break long lines in README.md
308 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29a »
2009-03-18 move to com.github.rnewson package.
309 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b2079657 »
2009-02-14 improve README readability.
310 </pre>
311
312 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
313
314 <h1>Configuration</h1>
315
316 couchdb-lucene respects several system properties;
317
318 <dl>
f9c61e32 »
2009-03-22 format README
319 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
320 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
b2079657 »
2009-02-14 improve README readability.
321 </dl>
322
323 You can override these properties like this;
324
325 <pre>
fe204556 »
2009-04-01 fix typo in documentation [#7 state:resolved]
326 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae390 »
2009-02-14 break long lines in README.md
327 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
328 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29a »
2009-03-18 move to com.github.rnewson package.
329 com.github.rnewson.couchdb.lucene.Main
b2079657 »
2009-02-14 improve README readability.
330 </pre>
b2d01ccc »
2009-03-16 update README for basic auth.
331
332 <h2>Basic Authentication</h2>
333
334 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
335
336 <dl>
f9c61e32 »
2009-03-22 format README
337 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
338 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01ccc »
2009-03-16 update README for basic auth.
339 </dl>
Something went wrong with that request. Please try again.