Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 330 lines (261 sloc) 10.407 kB
5d4e56a @rnewson update readme.
authored
1 <h1>News</h1>
2
764563b @rnewson update news in README.
authored
3 The indexing API has completely changed, please re-read this document and report any surprises/bugs to the bug tracker;
4
6b2b22c @rnewson add lighthouseapp link.
authored
5 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a @rnewson update readme.
authored
6
5220b65 @rnewson tweak README.md
authored
7 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
8
9 <ol>
10 <li>Install Maven 2.
11 <li>checkout repository
12 <li>type 'mvn'
13 <li>configure couchdb (see below)
14 </ol>
15
16 <h1>Configure CouchDB</h1>
17
18 <pre>
0563120 @rnewson fixes.
authored
19 [couchdb]
20 os_process_timeout=60000 ; increase the timeout from 5 seconds.
21
b207965 @rnewson improve README readability.
authored
22 [external]
77d4f67 @rnewson fix readme.
authored
23 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
24
25 [update_notification]
26 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
27
28 [httpd_db_handlers]
29 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
30 </pre>
31
32 <h1>Indexing Strategy</h1>
33
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
34 <h2>Document Indexing</h2>
35
c207a60 @rnewson update README
authored
36 You must supply a transform function in order to enable couchdb-lucene .
a2e9024 @rnewson wip
authored
37
c207a60 @rnewson update README
authored
38 Add a design document called _design/lucene in your database with an attribute called "transform". The value of this attribute is a Javascript function.
39
40 The transform function can return null, to prevent indexing, and either a single Document or an array of Documents.
a2e9024 @rnewson wip
authored
41
087dcec @rnewson update documentation.
authored
42 The transform function is called for each document in the database. To pass information to Lucene, you must populate Document instances with data from the original CouchDB document.
43
44 <h3>The Document class</h3>
45
46 You may construct a new Document instance with;
47
48 <pre>
49 var doc = new Document();
50 </pre>
51
52 Several functions are available that populate a Document.
53
54 <pre>
55 doc.field("name", "value"); // Indexed, analyzed but not stored.
56 doc.field("name", "value", "yes"); // Indexed, analyzed and stored.
57 doc.field("name", "value", "yes", "not_analyzed"); // Indexed, stored but not analyzed.
58 doc.attachment("name", "attachment name"); // Extract text from the named attachment and index it (but not store it).
59 doc.date("name", "value"); // Interpret "value" as a date using the default date formats.
60 doc.date("name", "value", "format"); // intrepret "value" as a date using the supplied format string (see Java's SimpleDateFormat class for the syntax).
61 </pre>
62
ccb81a8 @rnewson add example transforms section.
authored
63 <h3>Example Transforms</h3>
64
c207a60 @rnewson update README
authored
65 <h4>Index Everything</h4>
ccb81a8 @rnewson add example transforms section.
authored
66
67 <pre>
68 function(doc) {
c207a60 @rnewson update README
authored
69 return new Document(doc);
ccb81a8 @rnewson add example transforms section.
authored
70 }
71 </pre>
72
73 <h4>Index Nothing</h4>
74
75 <pre>
76 function(doc) {
77 return null;
78 }
79 </pre>
80
c207a60 @rnewson update README
authored
81 <h4>Index Select Fields</h4>
ccb81a8 @rnewson add example transforms section.
authored
82
83 <pre>
84 function(doc) {
c207a60 @rnewson update README
authored
85 var result = new Document();
f59999b @rnewson improve examples
authored
86 result.field("subject", doc.subject, "yes");
87 result.field("content", doc.content);
5ff4cda @rnewson add date example.
authored
88 result.date("indexed_at", new Date());
c207a60 @rnewson update README
authored
89 return result;
ccb81a8 @rnewson add example transforms section.
authored
90 }
91 </pre>
92
c207a60 @rnewson update README
authored
93 <h4>Index Attachments</h4>
ccb81a8 @rnewson add example transforms section.
authored
94
95 <pre>
96 function(doc) {
c207a60 @rnewson update README
authored
97 var result = new Document();
98 for(var a in doc._attachments) {
99 result.attachment("attachment", a);
ccb81a8 @rnewson add example transforms section.
authored
100 }
c207a60 @rnewson update README
authored
101 return result;
102 }
103 </pre>
104
105 <h4>A More Complex Example</h4>
106
107 <pre>
108 function(doc) {
109 var mk = function(name, value, group) {
110 var ret = new Document(name, value, "yes");
111 ret.field("group", group, "yes");
112 return ret;
113 };
114 var ret = [];
115 if(doc.type != "reference") return null;
116 for(var g in doc.groups) {
117 ret.push(mk("library", doc.groups[g].library, g));
118 ret.push(mk("method", doc.groups[g].method, g));
119 ret.push(mk("target", doc.groups[g].target, g));
120 }
121 return ret;
122 }
123 </pre>
b207965 @rnewson improve README readability.
authored
124
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
125 <h2>Attachment Indexing</h2>
126
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
127 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
128
ec94e21 @rnewson updated README.md
authored
129 <h3>Supported Formats</h3>
130
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
131 <ul>
132 <li>Excel spreadsheets (application/vnd.ms-excel)
133 <li>Word documents (application/msword)
134 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
135 <li>Visio (application/vnd.visio)
136 <li>Outlook (application/vnd.ms-outlook)
137 <li>XML (application/xml)
138 <li>HTML (text/html)
139 <li>Images (image/*)
140 <li>Java class files
141 <li>Java jar archives
142 <li>MP3 (audio/mp3)
143 <li>OpenDocument (application/vnd.oasis.opendocument.*)
144 <li>Plain text (text/plain)
145 <li>PDF (application/pdf)
146 <li>RTF (application/rtf)
147 </ul>
148
b207965 @rnewson improve README readability.
authored
149 <h1>Searching with couchdb-lucene</h1>
150
39b22c8 @rnewson document that default search field is the _body field that attachment…
authored
151 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
b207965 @rnewson improve README readability.
authored
152
153 <dl>
f9c61e3 @rnewson format README
authored
154 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
155 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
156 <dt>limit</dt><dd>the maximum number of results to return</dd>
157 <dt>skip</dt><dd>the number of results to skip</dd>
158 <dt>include_docs</dt><dd>whether to include the source docs</dd>
159 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
160 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
161 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f @rnewson tweak README.md
authored
162 </dl>
b207965 @rnewson improve README readability.
authored
163
164 <i>All parameters except 'q' are optional.</i>
165
ec94e21 @rnewson updated README.md
authored
166 <h2>Special Fields</h2>
167
168 <dl>
f9c61e3 @rnewson format README
authored
169 <dt>_db</dt><dd>The source database of the document.</dd>
087dcec @rnewson update documentation.
authored
170 <dt>_id</dt><dd>The _id of the document.</dd>
46a3a37 @rnewson include all DC attributes, if present.
authored
171 </dl>
172
173 <h2>Dublin Core</h2>
174
175 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
176
177 <dl>
f9c61e3 @rnewson format README
authored
178 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
179 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
180 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
181 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
182 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
183 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
184 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
185 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
186 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
187 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
188 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
189 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
190 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
191 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
192 <dt>dc.title</dt><dd>A name given to the resource.</dd>
193 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 @rnewson updated README.md
authored
194 </dl>
195
b207965 @rnewson improve README readability.
authored
196 <h2>Examples</h2>
197
198 <pre>
199 http://localhost:5984/dbname/_fti?q=field_name:value
200 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
201 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
202 </pre>
203
204 <h2>Search Results Format</h2>
205
fd16315 @rnewson update README.md
authored
206 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
207
118d28e @rnewson JSON example output.
authored
208 <pre>
209 {
fd16315 @rnewson update README.md
authored
210 "q": "+_db:enron +content:enron",
211 "skip": 0,
212 "limit": 2,
213 "total_rows": 176852,
214 "search_duration": 518,
215 "fetch_duration": 4,
216 "rows": [
217 {
218 "_id": "hain-m-all_documents-257.",
219 "score": 1.601625680923462
220 },
221 {
222 "_id": "hain-m-notes_inbox-257.",
223 "score": 1.601625680923462
224 }
118d28e @rnewson JSON example output.
authored
225 ]
226 }
227 </pre>
228
fd16315 @rnewson update README.md
authored
229 And the same with sorting;
230
118d28e @rnewson JSON example output.
authored
231 <pre>
232 {
fd16315 @rnewson update README.md
authored
233 "q": "+_db:enron +content:enron",
234 "skip": 0,
235 "limit": 3,
236 "total_rows": 176852,
237 "search_duration": 660,
238 "fetch_duration": 4,
239 "sort_order": [
240 {
241 "field": "source",
242 "reverse": false,
243 "type": "string"
244 },
245 {
246 "reverse": false,
247 "type": "doc"
248 }
118d28e @rnewson JSON example output.
authored
249 ],
fd16315 @rnewson update README.md
authored
250 "rows": [
251 {
252 "_id": "shankman-j-inbox-105.",
253 "score": 0.6131107211112976,
254 "sort_order": [
255 "enron",
256 6
257 ]
258 },
259 {
260 "_id": "shankman-j-inbox-8.",
261 "score": 0.7492915391921997,
262 "sort_order": [
263 "enron",
264 7
265 ]
266 },
267 {
268 "_id": "shankman-j-inbox-30.",
269 "score": 0.507369875907898,
270 "sort_order": [
271 "enron",
272 8
273 ]
274 }
118d28e @rnewson JSON example output.
authored
275 ]
276 }
277 </pre>
278
139a78c @rnewson add info retrieval.
authored
279 <h1>Fetching information about the index</h1>
280
281 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
282
283 <pre>
284 http://127.0.0.1:5984/enron/_fti
285 </pre>
286
287 returns;
288
289 <pre>
290 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
291 </pre>
292
b207965 @rnewson improve README readability.
authored
293 <h1>Working With The Source</h1>
294
295 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
296
297 <pre>
490ae39 @rnewson break long lines in README.md
authored
298 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
299 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
300 </pre>
301
302 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
303
304 <h1>Configuration</h1>
305
306 couchdb-lucene respects several system properties;
307
308 <dl>
f9c61e3 @rnewson format README
authored
309 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
310 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
b207965 @rnewson improve README readability.
authored
311 </dl>
312
313 You can override these properties like this;
314
315 <pre>
fe20455 @rnewson fix typo in documentation [#7 state:resolved]
authored
316 fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
490ae39 @rnewson break long lines in README.md
authored
317 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
318 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
319 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
320 </pre>
b2d01cc @rnewson update README for basic auth.
authored
321
322 <h2>Basic Authentication</h2>
323
324 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
325
326 <dl>
f9c61e3 @rnewson format README
authored
327 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
328 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc @rnewson update README for basic auth.
authored
329 </dl>
Something went wrong with that request. Please try again.