Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 300 lines (239 sloc) 9.068 kb
5d4e56a @rnewson update readme.
authored
1 <h1>News</h1>
2
6b2b22c @rnewson add lighthouseapp link.
authored
3 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a @rnewson update readme.
authored
4
5220b65 @rnewson tweak README.md
authored
5 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
6
7 <ol>
8 <li>Install Maven 2.
9 <li>checkout repository
10 <li>type 'mvn'
11 <li>configure couchdb (see below)
12 </ol>
13
14 <h1>Configure CouchDB</h1>
15
16 <pre>
0563120 @rnewson fixes.
authored
17 [couchdb]
18 os_process_timeout=60000 ; increase the timeout from 5 seconds.
19
b207965 @rnewson improve README readability.
authored
20 [external]
77d4f67 @rnewson fix readme.
authored
21 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
22
23 [update_notification]
24 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
25
26 [httpd_db_handlers]
27 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
28 </pre>
29
30 <h1>Indexing Strategy</h1>
31
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
32 <h2>Document Indexing</h2>
33
ccb81a8 @rnewson add example transforms section.
authored
34 By default all attributes are indexed. You can customize this process by adding a design document at _design/lucene. You must supply an attribute called "transform" which takes and returns a document.
a2e9024 @rnewson wip
authored
35
36 <pre>
37 {
fd16315 @rnewson update README.md
authored
38 "transform":"function(doc) { return doc; }"
a2e9024 @rnewson wip
authored
39 }
40 </pre>
41
ccb81a8 @rnewson add example transforms section.
authored
42 <h3>Example Transforms</h3>
43
44 <h4>Index Everything (supplying no _design/lucene is equivalent and faster)</h4>
45
46 <pre>
47 function(doc) {
48 return doc;
49 }
50 </pre>
51
52 <h4>Index Nothing</h4>
53
54 <pre>
55 function(doc) {
56 return null;
57 }
58 </pre>
59
60 <h4>Don't Index Confidential Fields</h4>
61
62 <pre>
63 function(doc) {
64 delete doc.social_security_number;
65 delete doc.date_of_birth;
66 return doc;
67 }
68 </pre>
69
70 <h4>Search Across All Properties</h4>
71
72 <pre>
73 function(doc) {
74 function DumpObject(obj) {
75 var result = "";
76 for (var property in obj) {
77 var value=obj[property];
78 if (typeof value == 'object') {
79 result += DumpObject(value) + " ";
80 } else {
81 result += value + " ";
82 }
83 }
84 return result;
85 }
86
87 doc.all=DumpObject(doc);
88 return doc;
89 }
90 </pre>
91
a2e9024 @rnewson wip
authored
92 The function is evaluated by <a href="http://www.mozilla.org/rhino/">Rhino</a>. You may add, modify and remove any attributes. Additionally, returning null will exclude the document from indexing entirely.
b207965 @rnewson improve README readability.
authored
93
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
94 <h2>Attachment Indexing</h2>
95
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
96 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
97
ec94e21 @rnewson updated README.md
authored
98 <h3>Supported Formats</h3>
99
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
100 <ul>
101 <li>Excel spreadsheets (application/vnd.ms-excel)
102 <li>Word documents (application/msword)
103 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
104 <li>Visio (application/vnd.visio)
105 <li>Outlook (application/vnd.ms-outlook)
106 <li>XML (application/xml)
107 <li>HTML (text/html)
108 <li>Images (image/*)
109 <li>Java class files
110 <li>Java jar archives
111 <li>MP3 (audio/mp3)
112 <li>OpenDocument (application/vnd.oasis.opendocument.*)
113 <li>Plain text (text/plain)
114 <li>PDF (application/pdf)
115 <li>RTF (application/rtf)
116 </ul>
117
b207965 @rnewson improve README readability.
authored
118 <h1>Searching with couchdb-lucene</h1>
119
120 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The following parameters can be passed for more sophisticated searches;
121
122 <dl>
f9c61e3 @rnewson format README
authored
123 <dt>q</dt><dd>the query to run (e.g, subject:hello)</dd>
124 <dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
125 <dt>limit</dt><dd>the maximum number of results to return</dd>
126 <dt>skip</dt><dd>the number of results to skip</dd>
127 <dt>include_docs</dt><dd>whether to include the source docs</dd>
128 <dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
129 <dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
130 <dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
ad9096f @rnewson tweak README.md
authored
131 </dl>
b207965 @rnewson improve README readability.
authored
132
133 <i>All parameters except 'q' are optional.</i>
134
ec94e21 @rnewson updated README.md
authored
135 <h2>Special Fields</h2>
136
137 <dl>
f9c61e3 @rnewson format README
authored
138 <dt>_id</dt><dd>The _id of the document.</dd>
139 <dt>_db</dt><dd>The source database of the document.</dd>
140 <dt>_body</dt><dd>Any text extracted from any attachment.</dd>
46a3a37 @rnewson include all DC attributes, if present.
authored
141 </dl>
142
143 <h2>Dublin Core</h2>
144
145 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
146
147 <dl>
f9c61e3 @rnewson format README
authored
148 <dt>dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
149 <dt>dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
150 <dt>dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
151 <dt>dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
152 <dt>dc.description</dt><dd>An account of the content of the resource.</dd>
153 <dt>dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
154 <dt>dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
155 <dt>dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
156 <dt>dc.modified</dt><dd>Date on which the resource was changed.</dd>
157 <dt>dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
158 <dt>dc.relation</dt><dd>A reference to a related resource.</dd>
159 <dt>dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
160 <dt>dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
161 <dt>dc.subject</dt><dd>The topic of the content of the resource.</dd>
162 <dt>dc.title</dt><dd>A name given to the resource.</dd>
163 <dt>dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
ec94e21 @rnewson updated README.md
authored
164 </dl>
165
b207965 @rnewson improve README readability.
authored
166 <h2>Examples</h2>
167
168 <pre>
169 http://localhost:5984/dbname/_fti?q=field_name:value
170 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
171 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
172 </pre>
173
174 <h2>Search Results Format</h2>
175
fd16315 @rnewson update README.md
authored
176 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
177
118d28e @rnewson JSON example output.
authored
178 <pre>
179 {
fd16315 @rnewson update README.md
authored
180 "q": "+_db:enron +content:enron",
181 "skip": 0,
182 "limit": 2,
183 "total_rows": 176852,
184 "search_duration": 518,
185 "fetch_duration": 4,
186 "rows": [
187 {
188 "_id": "hain-m-all_documents-257.",
189 "score": 1.601625680923462
190 },
191 {
192 "_id": "hain-m-notes_inbox-257.",
193 "score": 1.601625680923462
194 }
118d28e @rnewson JSON example output.
authored
195 ]
196 }
197 </pre>
198
fd16315 @rnewson update README.md
authored
199 And the same with sorting;
200
118d28e @rnewson JSON example output.
authored
201 <pre>
202 {
fd16315 @rnewson update README.md
authored
203 "q": "+_db:enron +content:enron",
204 "skip": 0,
205 "limit": 3,
206 "total_rows": 176852,
207 "search_duration": 660,
208 "fetch_duration": 4,
209 "sort_order": [
210 {
211 "field": "source",
212 "reverse": false,
213 "type": "string"
214 },
215 {
216 "reverse": false,
217 "type": "doc"
218 }
118d28e @rnewson JSON example output.
authored
219 ],
fd16315 @rnewson update README.md
authored
220 "rows": [
221 {
222 "_id": "shankman-j-inbox-105.",
223 "score": 0.6131107211112976,
224 "sort_order": [
225 "enron",
226 6
227 ]
228 },
229 {
230 "_id": "shankman-j-inbox-8.",
231 "score": 0.7492915391921997,
232 "sort_order": [
233 "enron",
234 7
235 ]
236 },
237 {
238 "_id": "shankman-j-inbox-30.",
239 "score": 0.507369875907898,
240 "sort_order": [
241 "enron",
242 8
243 ]
244 }
118d28e @rnewson JSON example output.
authored
245 ]
246 }
247 </pre>
248
139a78c @rnewson add info retrieval.
authored
249 <h1>Fetching information about the index</h1>
250
251 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
252
253 <pre>
254 http://127.0.0.1:5984/enron/_fti
255 </pre>
256
257 returns;
258
259 <pre>
260 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
261 </pre>
262
b207965 @rnewson improve README readability.
authored
263 <h1>Working With The Source</h1>
264
265 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
266
267 <pre>
490ae39 @rnewson break long lines in README.md
authored
268 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
269 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
270 </pre>
271
272 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
273
274 <h1>Configuration</h1>
275
276 couchdb-lucene respects several system properties;
277
278 <dl>
f9c61e3 @rnewson format README
authored
279 <dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
280 <dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
b207965 @rnewson improve README readability.
authored
281 </dl>
282
283 You can override these properties like this;
284
285 <pre>
490ae39 @rnewson break long lines in README.md
authored
286 fti=/usr/bin/java -D couchdb.lucene.dir=/tmp \
287 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
288 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
289 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
290 </pre>
b2d01cc @rnewson update README for basic auth.
authored
291
292 <h2>Basic Authentication</h2>
293
294 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
295
296 <dl>
f9c61e3 @rnewson format README
authored
297 <dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
298 <dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
b2d01cc @rnewson update README for basic auth.
authored
299 </dl>
Something went wrong with that request. Please try again.