Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 250 lines (199 sloc) 7.995 kb
5d4e56a @rnewson update readme.
authored
1 <h1>News</h1>
2
6b2b22c @rnewson add lighthouseapp link.
authored
3 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a @rnewson update readme.
authored
4
5220b65 @rnewson tweak README.md
authored
5 <h1>Build couchdb-lucene</h1>
b207965 @rnewson improve README readability.
authored
6
7 <ol>
8 <li>Install Maven 2.
9 <li>checkout repository
10 <li>type 'mvn'
11 <li>configure couchdb (see below)
12 </ol>
13
14 <h1>Configure CouchDB</h1>
15
16 <pre>
0563120 @rnewson fixes.
authored
17 [couchdb]
18 os_process_timeout=60000 ; increase the timeout from 5 seconds.
19
b207965 @rnewson improve README readability.
authored
20 [external]
77d4f67 @rnewson fix readme.
authored
21 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 @rnewson wip
authored
22
23 [update_notification]
24 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 @rnewson improve README readability.
authored
25
26 [httpd_db_handlers]
27 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
28 </pre>
29
30 <h1>Indexing Strategy</h1>
31
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
32 <h2>Document Indexing</h2>
33
fd16315 @rnewson update README.md
authored
34 By default all attributes are indexed. You can customize this process by adding a design document at _design/lucene. You must supply an attribute called "transform" which takes and returns a document. For example;
a2e9024 @rnewson wip
authored
35
36 <pre>
37 {
fd16315 @rnewson update README.md
authored
38 "transform":"function(doc) { return doc; }"
a2e9024 @rnewson wip
authored
39 }
40 </pre>
41
42 The function is evaluated by <a href="http://www.mozilla.org/rhino/">Rhino</a>. You may add, modify and remove any attributes. Additionally, returning null will exclude the document from indexing entirely.
b207965 @rnewson improve README readability.
authored
43
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
44 <h2>Attachment Indexing</h2>
45
8059ce0 @rnewson s/couchdb/couchdb-lucene
authored
46 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
47
ec94e21 @rnewson updated README.md
authored
48 <h3>Supported Formats</h3>
49
4a60080 @rnewson use couchdb's content_type rather than auto-detect.
authored
50 <ul>
51 <li>Excel spreadsheets (application/vnd.ms-excel)
52 <li>Word documents (application/msword)
53 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
54 <li>Visio (application/vnd.visio)
55 <li>Outlook (application/vnd.ms-outlook)
56 <li>XML (application/xml)
57 <li>HTML (text/html)
58 <li>Images (image/*)
59 <li>Java class files
60 <li>Java jar archives
61 <li>MP3 (audio/mp3)
62 <li>OpenDocument (application/vnd.oasis.opendocument.*)
63 <li>Plain text (text/plain)
64 <li>PDF (application/pdf)
65 <li>RTF (application/rtf)
66 </ul>
67
b207965 @rnewson improve README readability.
authored
68 <h1>Searching with couchdb-lucene</h1>
69
70 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The following parameters can be passed for more sophisticated searches;
71
72 <dl>
ad9096f @rnewson tweak README.md
authored
73 <dt>q<dd>the query to run (e.g, subject:hello)
c1c1126 @rnewson enhance ability to specify ascending/descending order, now works with…
authored
74 <dt>sort<dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).
b207965 @rnewson improve README readability.
authored
75 <dt>limit<dd>the maximum number of results to return
76 <dt>skip<dd>the number of results to skip
77 <dt>include_docs<dd>whether to include the source docs
5412469 @rnewson fix readme
authored
78 <dt>stale=ok<dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.
b207965 @rnewson improve README readability.
authored
79 <dt>debug<dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.
c4c05f5 @rnewson update README.md to document new rewrite=true option.
authored
80 <dt>rewrite<dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.
ad9096f @rnewson tweak README.md
authored
81 </dl>
b207965 @rnewson improve README readability.
authored
82
83 <i>All parameters except 'q' are optional.</i>
84
ec94e21 @rnewson updated README.md
authored
85 <h2>Special Fields</h2>
86
87 <dl>
88 <dt>_id<dd>The _id of the document.
89 <dt>_db<dd>The source database of the document.
46a3a37 @rnewson include all DC attributes, if present.
authored
90 <dt>_body<dd>Any text extracted from any attachment.
91 </dl>
92
93 <h2>Dublin Core</h2>
94
95 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
96
97 <dl>
98 <dt>dc.contributor<dd> An entity responsible for making contributions to the content of the resource.
99 <dt>dc.coverage<dd>The extent or scope of the content of the resource.
100 <dt>dc.creator<dd>An entity primarily responsible for making the content of the resource.
101 <dt>dc.date<dd>A date associated with an event in the life cycle of the resource.
102 <dt>dc.description<dd>An account of the content of the resource.
103 <dt>dc.format<dd>Typically, Format may include the media-type or dimensions of the resource.
104 <dt>dc.identifier<dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.
105 <dt>dc.language<dd>A language of the intellectual content of the resource.
106 <dt>dc.modified<dd>Date on which the resource was changed.
107 <dt>dc.publisher<dd>An entity responsible for making the resource available.
108 <dt>dc.relation<dd>A reference to a related resource.
109 <dt>dc.rights<dd>Information about rights held in and over the resource.
110 <dt>dc.source<dd>A reference to a resource from which the present resource is derived.
111 <dt>dc.subject<dd>The topic of the content of the resource.
112 <dt>dc.title<dd>A name given to the resource.
113 <dt>dc.type<dd>The nature or genre of the content of the resource.
ec94e21 @rnewson updated README.md
authored
114 </dl>
115
b207965 @rnewson improve README readability.
authored
116 <h2>Examples</h2>
117
118 <pre>
119 http://localhost:5984/dbname/_fti?q=field_name:value
120 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
121 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
122 </pre>
123
124 <h2>Search Results Format</h2>
125
fd16315 @rnewson update README.md
authored
126 Here's an example of a JSON response without sorting;
b207965 @rnewson improve README readability.
authored
127
118d28e @rnewson JSON example output.
authored
128 <pre>
129 {
fd16315 @rnewson update README.md
authored
130 "q": "+_db:enron +content:enron",
131 "skip": 0,
132 "limit": 2,
133 "total_rows": 176852,
134 "search_duration": 518,
135 "fetch_duration": 4,
136 "rows": [
137 {
138 "_id": "hain-m-all_documents-257.",
139 "score": 1.601625680923462
140 },
141 {
142 "_id": "hain-m-notes_inbox-257.",
143 "score": 1.601625680923462
144 }
118d28e @rnewson JSON example output.
authored
145 ]
146 }
147 </pre>
148
fd16315 @rnewson update README.md
authored
149 And the same with sorting;
150
118d28e @rnewson JSON example output.
authored
151 <pre>
152 {
fd16315 @rnewson update README.md
authored
153 "q": "+_db:enron +content:enron",
154 "skip": 0,
155 "limit": 3,
156 "total_rows": 176852,
157 "search_duration": 660,
158 "fetch_duration": 4,
159 "sort_order": [
160 {
161 "field": "source",
162 "reverse": false,
163 "type": "string"
164 },
165 {
166 "reverse": false,
167 "type": "doc"
168 }
118d28e @rnewson JSON example output.
authored
169 ],
fd16315 @rnewson update README.md
authored
170 "rows": [
171 {
172 "_id": "shankman-j-inbox-105.",
173 "score": 0.6131107211112976,
174 "sort_order": [
175 "enron",
176 6
177 ]
178 },
179 {
180 "_id": "shankman-j-inbox-8.",
181 "score": 0.7492915391921997,
182 "sort_order": [
183 "enron",
184 7
185 ]
186 },
187 {
188 "_id": "shankman-j-inbox-30.",
189 "score": 0.507369875907898,
190 "sort_order": [
191 "enron",
192 8
193 ]
194 }
118d28e @rnewson JSON example output.
authored
195 ]
196 }
197 </pre>
198
139a78c @rnewson add info retrieval.
authored
199 <h1>Fetching information about the index</h1>
200
201 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
202
203 <pre>
204 http://127.0.0.1:5984/enron/_fti
205 </pre>
206
207 returns;
208
209 <pre>
210 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
211 </pre>
212
b207965 @rnewson improve README readability.
authored
213 <h1>Working With The Source</h1>
214
215 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
216
217 <pre>
490ae39 @rnewson break long lines in README.md
authored
218 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
219 /path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
220 </pre>
221
222 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
223
224 <h1>Configuration</h1>
225
226 couchdb-lucene respects several system properties;
227
228 <dl>
ad9096f @rnewson tweak README.md
authored
229 <dt>couchdb.url<dd>the url to contact CouchDB with (default is "http://localhost:5984")
230 <dt>couchdb.lucene.dir<dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.
b207965 @rnewson improve README readability.
authored
231 </dl>
232
233 You can override these properties like this;
234
235 <pre>
490ae39 @rnewson break long lines in README.md
authored
236 fti=/usr/bin/java -D couchdb.lucene.dir=/tmp \
237 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
238 /home/rnewson/Source/couchdb-lucene/target/dependency\
5d5eb29 @rnewson move to com.github.rnewson package.
authored
239 com.github.rnewson.couchdb.lucene.Main
b207965 @rnewson improve README readability.
authored
240 </pre>
b2d01cc @rnewson update README for basic auth.
authored
241
242 <h2>Basic Authentication</h2>
243
244 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
245
246 <dl>
247 <dt>couchdb.user<dd>the user to authenticate as.
248 <dt>couchdb.password<dd>the password to authenticate with.
249 </dl>
Something went wrong with that request. Please try again.