Skip to content
This repository
Newer
Older
100644 334 lines (262 sloc) 10.333 kb
5d4e56ad »
2009-03-06 update readme.
1 <h1>News</h1>
2
764563b5 »
2009-04-04 update news in README.
3 The indexing API has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
4
6b2b22c4 »
2009-03-16 add lighthouseapp link.
5 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56ad »
2009-03-06 update readme.
6
5220b654 »
2009-02-14 tweak README.md
7 <h1>Build couchdb-lucene</h1>
b2079657 »
2009-02-14 improve README readability.
8
9 <ol>
10 <li>Install Maven 2.
11 <li>checkout repository
12 <li>type 'mvn'
13 <li>configure couchdb (see below)
14 </ol>
15
16 <h1>Configure CouchDB</h1>
17
18 <pre>
05631204 »
2009-03-07 fixes.
19 [couchdb]
20 os_process_timeout=60000 ; increase the timeout from 5 seconds.
21
b2079657 »
2009-02-14 improve README readability.
22 [external]
77d4f67e »
2009-03-07 fix readme.
23 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024b »
2009-03-06 wip
24
25 [update_notification]
26 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b2079657 »
2009-02-14 improve README readability.
27
28 [httpd_db_handlers]
29 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
30 </pre>
31
32 <h1>Indexing Strategy</h1>
33
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
34 <h2>Document Indexing</h2>
35
c207a604 »
2009-04-04 update README
36 You must supply a transform function in order to enable couchdb-lucene .
a2e9024b »
2009-03-06 wip
37
c207a604 »
2009-04-04 update README
38 Add a design document called _design/lucene in your database with an attribute called "transform". The value of this attribute is a Javascript function.
39
40 The transform function can return null, to prevent indexing, and either a single Document or an array of Documents.
a2e9024b »
2009-03-06 wip
41
087dcec0 »
2009-04-04 update documentation.
42 The transform function is called for each document in the database. To pass information to Lucene, you must populate Document instances with data from the original CouchDB document.
43
44 <h3>The Document class</h3>
45
46 You may construct a new Document instance with;
47
48 <pre>
49 var doc = new Document();
50 </pre>
51
52 Several functions are available that populate a Document.
53
54 <pre>
9a715570 »
2009-04-05 formatting
55 // Indexed, analyzed but not stored.
56 doc.field("name", "value");
57
58 // Indexed, analyzed and stored.
59 doc.field("name", "value", "yes");
60
61 // Indexed, stored but not analyzed.
62 doc.field("name", "value", "yes", "not_analyzed");
63
64 // Extract text from the named attachment and index it (but not store it).
65 doc.attachment("name", "attachment name");
66
67 // Interpret "value" as a date using the default date formats.
68 doc.date("name", "value");
69
70 // intrepret "value" as a date using the supplied format string
71 // (see Java's SimpleDateFormat class for the syntax).
72 doc.date("name", "value", "format");
087dcec0 »
2009-04-04 update documentation.
73 </pre>
74
ccb81a8a »
2009-03-20 add example transforms section.
75 <h3>Example Transforms</h3>
76
77 <h4>Index Nothing</h4>
78
79 <pre>
80 function(doc) {
81 return null;
82 }
83 </pre>
84
c207a604 »
2009-04-04 update README
85 <h4>Index Select Fields</h4>
ccb81a8a »
2009-03-20 add example transforms section.
86
87 <pre>
88 function(doc) {
c207a604 »
2009-04-04 update README
89 var result = new Document();
f59999b3 »
2009-04-04 improve examples
90 result.field("subject", doc.subject, "yes");
91 result.field("content", doc.content);
5ff4cda4 »
2009-04-04 add date example.
92 result.date("indexed_at", new Date());
c207a604 »
2009-04-04 update README
93 return result;
ccb81a8a »
2009-03-20 add example transforms section.
94 }
95 </pre>
96
c207a604 »
2009-04-04 update README
97 <h4>Index Attachments</h4>
ccb81a8a »
2009-03-20 add example transforms section.
98
99 <pre>
100 function(doc) {
c207a604 »
2009-04-04 update README
101 var result = new Document();
102 for(var a in doc._attachments) {
103 result.attachment("attachment", a);
ccb81a8a »
2009-03-20 add example transforms section.
104 }
c207a604 »
2009-04-04 update README
105 return result;
106 }
107 </pre>
108
109 <h4>A More Complex Example</h4>
110
111 <pre>
112 function(doc) {
113 var mk = function(name, value, group) {
114 var ret = new Document(name, value, "yes");
115 ret.field("group", group, "yes");
116 return ret;
117 };
118 var ret = [];
119 if(doc.type != "reference") return null;
120 for(var g in doc.groups) {
121 ret.push(mk("library", doc.groups[g].library, g));
122 ret.push(mk("method", doc.groups[g].method, g));
123 ret.push(mk("target", doc.groups[g].target, g));
124 }
125 return ret;
126 }
127 </pre>
b2079657 »
2009-02-14 improve README readability.
128
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
129 <h2>Attachment Indexing</h2>
130
8059ce07 »
2009-03-08 s/couchdb/couchdb-lucene
131 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
132
ec94e218 »
2009-02-18 updated README.md
133 <h3>Supported Formats</h3>
134
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
135 <ul>
136 <li>Excel spreadsheets (application/vnd.ms-excel)
137 <li>Word documents (application/msword)
138 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
139 <li>Visio (application/vnd.visio)
140 <li>Outlook (application/vnd.ms-outlook)
141 <li>XML (application/xml)
142 <li>HTML (text/html)
143 <li>Images (image/*)
144 <li>Java class files
145 <li>Java jar archives
146 <li>MP3 (audio/mp3)
147 <li>OpenDocument (application/vnd.oasis.opendocument.*)
148 <li>Plain text (text/plain)
149 <li>PDF (application/pdf)
150 <li>RTF (application/rtf)
151 </ul>
152
b2079657 »
2009-02-14 improve README readability.
153 <h1>Searching with couchdb-lucene</h1>
154
39b22c82 »
2009-04-01 document that default search field is the _body field that attachment…
155 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b2079657 »
2009-02-14 improve README readability.
156
157 <dl>
f9c61e32 »
2009-03-22 format README
158 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
159 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
160 <dt>limit</dt><dd>the maximum number of results to return</dd>
161 <dt>skip</dt><dd>the number of results to skip</dd>
162 <dt>include_docs</dt><dd>whether to include the source docs</dd>
163 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
164 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
165 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f2 »
2009-02-14 tweak README.md
166 </dl>
b2079657 »
2009-02-14 improve README readability.
167
168 <i>All parameters except 'q' are optional.</i>
169
ec94e218 »
2009-02-18 updated README.md
170 <h2>Special Fields</h2>
171
172 <dl>
f9c61e32 »
2009-03-22 format README
173 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec0 »
2009-04-04 update documentation.
174 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a371 »
2009-03-08 include all DC attributes, if present.
175 </dl>
176
177 <h2>Dublin Core</h2>
178
179 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
180
181 <dl>
f9c61e32 »
2009-03-22 format README
182 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
183 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
184 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
185 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
186 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
187 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
188 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
189 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
190 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
191 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
192 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
193 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
194 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
195 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
196 <dt>dc.title</dt><dd>A name given to the resource.</dd>
197 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e218 »
2009-02-18 updated README.md
198 </dl>
199
b2079657 »
2009-02-14 improve README readability.
200 <h2>Examples</h2>
201
202 <pre>
203 http://localhost:5984/dbname/_fti?q=field_name:value
204 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
205 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
206 </pre>
207
208 <h2>Search Results Format</h2>
209
fd163159 »
2009-03-07 update README.md
210 Here's an example of a JSON response without sorting;
b2079657 »
2009-02-14 improve README readability.
211
118d28eb »
2009-02-17 JSON example output.
212 <pre>
213 {
fd163159 »
2009-03-07 update README.md
214 "q": "+_db:enron +content:enron",
215 "skip": 0,
216 "limit": 2,
217 "total_rows": 176852,
218 "search_duration": 518,
219 "fetch_duration": 4,
220 "rows": [
221 {
222 "_id": "hain-m-all_documents-257.",
223 "score": 1.601625680923462
224 },
225 {
226 "_id": "hain-m-notes_inbox-257.",
227 "score": 1.601625680923462
228 }
118d28eb »
2009-02-17 JSON example output.
229 ]
230 }
231 </pre>
232
fd163159 »
2009-03-07 update README.md
233 And the same with sorting;
234
118d28eb »
2009-02-17 JSON example output.
235 <pre>
236 {
fd163159 »
2009-03-07 update README.md
237 "q": "+_db:enron +content:enron",
238 "skip": 0,
239 "limit": 3,
240 "total_rows": 176852,
241 "search_duration": 660,
242 "fetch_duration": 4,
243 "sort_order": [
244 {
245 "field": "source",
246 "reverse": false,
247 "type": "string"
248 },
249 {
250 "reverse": false,
251 "type": "doc"
252 }
118d28eb »
2009-02-17 JSON example output.
253 ],
fd163159 »
2009-03-07 update README.md
254 "rows": [
255 {
256 "_id": "shankman-j-inbox-105.",
257 "score": 0.6131107211112976,
258 "sort_order": [
259 "enron",
260 6
261 ]
262 },
263 {
264 "_id": "shankman-j-inbox-8.",
265 "score": 0.7492915391921997,
266 "sort_order": [
267 "enron",
268 7
269 ]
270 },
271 {
272 "_id": "shankman-j-inbox-30.",
273 "score": 0.507369875907898,
274 "sort_order": [
275 "enron",
276 8
277 ]
278 }
118d28eb »
2009-02-17 JSON example output.
279 ]
280 }
281 </pre>
282
139a78cc »
2009-03-09 add info retrieval.
283 <h1>Fetching information about the index</h1>
284
285 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
286
287 <pre>
288 http://127.0.0.1:5984/enron/_fti
289 </pre>
290
291 returns;
292
293 <pre>
294 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
295 </pre>
296
b2079657 »
2009-02-14 improve README readability.
297 <h1>Working With The Source</h1>
298
299 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
300
301 <pre>
490ae390 »
2009-02-14 break long lines in README.md
302 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29a »
2009-03-18 move to com.github.rnewson package.
303 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b2079657 »
2009-02-14 improve README readability.
304 </pre>
305
306 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
307
308 <h1>Configuration</h1>
309
310 couchdb-lucene respects several system properties;
311
312 <dl>
f9c61e32 »
2009-03-22 format README
313 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
314 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
b2079657 »
2009-02-14 improve README readability.
315 </dl>
316
317 You can override these properties like this;
318
319 <pre>
fe204556 »
2009-04-01 fix typo in documentation [#7 state:resolved]
320 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae390 »
2009-02-14 break long lines in README.md
321 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
322 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29a »
2009-03-18 move to com.github.rnewson package.
323 com.github.rnewson.couchdb.lucene.Main
b2079657 »
2009-02-14 improve README readability.
324 </pre>
b2d01ccc »
2009-03-16 update README for basic auth.
325
326 <h2>Basic Authentication</h2>
327
328 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
329
330 <dl>
f9c61e32 »
2009-03-22 format README
331 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
332 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01ccc »
2009-03-16 update README for basic auth.
333 </dl>
Something went wrong with that request. Please try again.