Skip to content
Newer
Older
100644 249 lines (198 sloc) 7.61 KB
5d4e56a update readme.
Robert Newson authored Mar 6, 2009
1 <h1>News</h1>
2
6b2b22c add lighthouseapp link.
Robert Newson authored Mar 16, 2009
3 Issue tracking now available at <a href="http://rnewson.lighthouseapp.com/projects/27420-couchdb-lucene"/>lighthouseapp</a>.
5d4e56a update readme.
Robert Newson authored Mar 6, 2009
4
5220b65 tweak README.md
Robert Newson authored Feb 14, 2009
5 <h1>Build couchdb-lucene</h1>
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
6
7 <ol>
8 <li>Install Maven 2.
9 <li>checkout repository
10 <li>type 'mvn'
11 <li>configure couchdb (see below)
12 </ol>
13
14 <h1>Configure CouchDB</h1>
15
16 <pre>
0563120 fixes.
Robert Newson authored Mar 7, 2009
17 [couchdb]
18 os_process_timeout=60000 ; increase the timeout from 5 seconds.
19
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
20 [external]
77d4f67 fix readme.
Robert Newson authored Mar 7, 2009
21 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 wip
Robert Newson authored Mar 6, 2009
22
23 [update_notification]
24 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
25
26 [httpd_db_handlers]
27 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
28 </pre>
29
30 <h1>Indexing Strategy</h1>
31
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
32 <h2>Document Indexing</h2>
33
fd16315 update README.md
Robert Newson authored Mar 7, 2009
34 By default all attributes are indexed. You can customize this process by adding a design document at _design/lucene. You must supply an attribute called "transform" which takes and returns a document. For example;
a2e9024 wip
Robert Newson authored Mar 6, 2009
35
36 <pre>
37 {
fd16315 update README.md
Robert Newson authored Mar 7, 2009
38 "transform":"function(doc) { return doc; }"
a2e9024 wip
Robert Newson authored Mar 6, 2009
39 }
40 </pre>
41
42 The function is evaluated by <a href="http://www.mozilla.org/rhino/">Rhino</a>. You may add, modify and remove any attributes. Additionally, returning null will exclude the document from indexing entirely.
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
43
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
44 <h2>Attachment Indexing</h2>
45
8059ce0 s/couchdb/couchdb-lucene
Robert Newson authored Mar 8, 2009
46 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
47
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
48 <h3>Supported Formats</h3>
49
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
50 <ul>
51 <li>Excel spreadsheets (application/vnd.ms-excel)
52 <li>Word documents (application/msword)
53 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
54 <li>Visio (application/vnd.visio)
55 <li>Outlook (application/vnd.ms-outlook)
56 <li>XML (application/xml)
57 <li>HTML (text/html)
58 <li>Images (image/*)
59 <li>Java class files
60 <li>Java jar archives
61 <li>MP3 (audio/mp3)
62 <li>OpenDocument (application/vnd.oasis.opendocument.*)
63 <li>Plain text (text/plain)
64 <li>PDF (application/pdf)
65 <li>RTF (application/rtf)
66 </ul>
67
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
68 <h1>Searching with couchdb-lucene</h1>
69
70 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The following parameters can be passed for more sophisticated searches;
71
72 <dl>
ad9096f tweak README.md
Robert Newson authored Feb 14, 2009
73 <dt>q<dd>the query to run (e.g, subject:hello)
c1c1126 enhance ability to specify ascending/descending order, now works with…
Robert Newson authored Mar 10, 2009
74 <dt>sort<dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
75 <dt>limit<dd>the maximum number of results to return
76 <dt>skip<dd>the number of results to skip
77 <dt>include_docs<dd>whether to include the source docs
5412469 fix readme
Robert Newson authored Mar 10, 2009
78 <dt>stale=ok<dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
79 <dt>debug<dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.
ad9096f tweak README.md
Robert Newson authored Feb 14, 2009
80 </dl>
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
81
82 <i>All parameters except 'q' are optional.</i>
83
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
84 <h2>Special Fields</h2>
85
86 <dl>
87 <dt>_id<dd>The _id of the document.
88 <dt>_db<dd>The source database of the document.
46a3a37 include all DC attributes, if present.
Robert Newson authored Mar 8, 2009
89 <dt>_body<dd>Any text extracted from any attachment.
90 </dl>
91
92 <h2>Dublin Core</h2>
93
94 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
95
96 <dl>
97 <dt>dc.contributor<dd> An entity responsible for making contributions to the content of the resource.
98 <dt>dc.coverage<dd>The extent or scope of the content of the resource.
99 <dt>dc.creator<dd>An entity primarily responsible for making the content of the resource.
100 <dt>dc.date<dd>A date associated with an event in the life cycle of the resource.
101 <dt>dc.description<dd>An account of the content of the resource.
102 <dt>dc.format<dd>Typically, Format may include the media-type or dimensions of the resource.
103 <dt>dc.identifier<dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.
104 <dt>dc.language<dd>A language of the intellectual content of the resource.
105 <dt>dc.modified<dd>Date on which the resource was changed.
106 <dt>dc.publisher<dd>An entity responsible for making the resource available.
107 <dt>dc.relation<dd>A reference to a related resource.
108 <dt>dc.rights<dd>Information about rights held in and over the resource.
109 <dt>dc.source<dd>A reference to a resource from which the present resource is derived.
110 <dt>dc.subject<dd>The topic of the content of the resource.
111 <dt>dc.title<dd>A name given to the resource.
112 <dt>dc.type<dd>The nature or genre of the content of the resource.
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
113 </dl>
114
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
115 <h2>Examples</h2>
116
117 <pre>
118 http://localhost:5984/dbname/_fti?q=field_name:value
119 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
120 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
121 </pre>
122
123 <h2>Search Results Format</h2>
124
fd16315 update README.md
Robert Newson authored Mar 7, 2009
125 Here's an example of a JSON response without sorting;
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
126
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
127 <pre>
128 {
fd16315 update README.md
Robert Newson authored Mar 7, 2009
129 "q": "+_db:enron +content:enron",
130 "skip": 0,
131 "limit": 2,
132 "total_rows": 176852,
133 "search_duration": 518,
134 "fetch_duration": 4,
135 "rows": [
136 {
137 "_id": "hain-m-all_documents-257.",
138 "score": 1.601625680923462
139 },
140 {
141 "_id": "hain-m-notes_inbox-257.",
142 "score": 1.601625680923462
143 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
144 ]
145 }
146 </pre>
147
fd16315 update README.md
Robert Newson authored Mar 7, 2009
148 And the same with sorting;
149
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
150 <pre>
151 {
fd16315 update README.md
Robert Newson authored Mar 7, 2009
152 "q": "+_db:enron +content:enron",
153 "skip": 0,
154 "limit": 3,
155 "total_rows": 176852,
156 "search_duration": 660,
157 "fetch_duration": 4,
158 "sort_order": [
159 {
160 "field": "source",
161 "reverse": false,
162 "type": "string"
163 },
164 {
165 "reverse": false,
166 "type": "doc"
167 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
168 ],
fd16315 update README.md
Robert Newson authored Mar 7, 2009
169 "rows": [
170 {
171 "_id": "shankman-j-inbox-105.",
172 "score": 0.6131107211112976,
173 "sort_order": [
174 "enron",
175 6
176 ]
177 },
178 {
179 "_id": "shankman-j-inbox-8.",
180 "score": 0.7492915391921997,
181 "sort_order": [
182 "enron",
183 7
184 ]
185 },
186 {
187 "_id": "shankman-j-inbox-30.",
188 "score": 0.507369875907898,
189 "sort_order": [
190 "enron",
191 8
192 ]
193 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
194 ]
195 }
196 </pre>
197
139a78c add info retrieval.
Robert Newson authored Mar 9, 2009
198 <h1>Fetching information about the index</h1>
199
200 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
201
202 <pre>
203 http://127.0.0.1:5984/enron/_fti
204 </pre>
205
206 returns;
207
208 <pre>
209 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
210 </pre>
211
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
212 <h1>Working With The Source</h1>
213
214 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
215
216 <pre>
490ae39 break long lines in README.md
Robert Newson authored Feb 14, 2009
217 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
218 /path/to/couchdb-lucene/target/dependency org.apache.couchdb.lucene.Main
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
219 </pre>
220
221 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
222
223 <h1>Configuration</h1>
224
225 couchdb-lucene respects several system properties;
226
227 <dl>
ad9096f tweak README.md
Robert Newson authored Feb 14, 2009
228 <dt>couchdb.url<dd>the url to contact CouchDB with (default is "http://localhost:5984")
229 <dt>couchdb.lucene.dir<dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
230 </dl>
231
232 You can override these properties like this;
233
234 <pre>
490ae39 break long lines in README.md
Robert Newson authored Feb 14, 2009
235 fti=/usr/bin/java -D couchdb.lucene.dir=/tmp \
236 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
237 /home/rnewson/Source/couchdb-lucene/target/dependency\
238 org.apache.couchdb.lucene.Main
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
239 </pre>
b2d01cc update README for basic auth.
Robert Newson authored Mar 16, 2009
240
241 <h2>Basic Authentication</h2>
242
243 If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.
244
245 <dl>
246 <dt>couchdb.user<dd>the user to authenticate as.
247 <dt>couchdb.password<dd>the password to authenticate with.
248 </dl>
Something went wrong with that request. Please try again.