Skip to content
Newer
Older
100644 253 lines (203 sloc) 7.36 KB
5d4e56a update readme.
Robert Newson authored Mar 6, 2009
1 <h1>News</h1>
2
fd16315 update README.md
Robert Newson authored Mar 7, 2009
3 I've merged the changes from the beta branch which brings many improvements. Notably;
5d4e56a update readme.
Robert Newson authored Mar 6, 2009
4
fd16315 update README.md
Robert Newson authored Mar 7, 2009
5 <ol>
6 <li>Indexing is a separate process to searching and is triggered by update notifications.
7 <li>Rhino integration has landed, user customization of indexing is now possible.
8 </ol>
5d4e56a update readme.
Robert Newson authored Mar 6, 2009
9
fd16315 update README.md
Robert Newson authored Mar 7, 2009
10 You are advised to delete indexes created prior to this update.
5d4e56a update readme.
Robert Newson authored Mar 6, 2009
11
5220b65 tweak README.md
Robert Newson authored Feb 14, 2009
12 <h1>Build couchdb-lucene</h1>
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
13
14 <ol>
15 <li>Install Maven 2.
16 <li>checkout repository
17 <li>type 'mvn'
18 <li>configure couchdb (see below)
19 </ol>
20
21 <h1>Configure CouchDB</h1>
22
23 <pre>
0563120 fixes.
Robert Newson authored Mar 7, 2009
24 [couchdb]
25 os_process_timeout=60000 ; increase the timeout from 5 seconds.
26
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
27 [external]
77d4f67 fix readme.
Robert Newson authored Mar 7, 2009
28 fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
a2e9024 wip
Robert Newson authored Mar 6, 2009
29
30 [update_notification]
31 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
32
33 [httpd_db_handlers]
34 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
35 </pre>
36
37 <h1>Indexing Strategy</h1>
38
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
39 <h2>Document Indexing</h2>
40
fd16315 update README.md
Robert Newson authored Mar 7, 2009
41 By default all attributes are indexed. You can customize this process by adding a design document at _design/lucene. You must supply an attribute called "transform" which takes and returns a document. For example;
a2e9024 wip
Robert Newson authored Mar 6, 2009
42
43 <pre>
44 {
fd16315 update README.md
Robert Newson authored Mar 7, 2009
45 "transform":"function(doc) { return doc; }"
a2e9024 wip
Robert Newson authored Mar 6, 2009
46 }
47 </pre>
48
49 The function is evaluated by <a href="http://www.mozilla.org/rhino/">Rhino</a>. You may add, modify and remove any attributes. Additionally, returning null will exclude the document from indexing entirely.
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
50
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
51 <h2>Attachment Indexing</h2>
52
8059ce0 s/couchdb/couchdb-lucene
Robert Newson authored Mar 8, 2009
53 Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
54
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
55 <h3>Supported Formats</h3>
56
4a60080 use couchdb's content_type rather than auto-detect.
Robert Newson authored Feb 18, 2009
57 <ul>
58 <li>Excel spreadsheets (application/vnd.ms-excel)
59 <li>Word documents (application/msword)
60 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
61 <li>Visio (application/vnd.visio)
62 <li>Outlook (application/vnd.ms-outlook)
63 <li>XML (application/xml)
64 <li>HTML (text/html)
65 <li>Images (image/*)
66 <li>Java class files
67 <li>Java jar archives
68 <li>MP3 (audio/mp3)
69 <li>OpenDocument (application/vnd.oasis.opendocument.*)
70 <li>Plain text (text/plain)
71 <li>PDF (application/pdf)
72 <li>RTF (application/rtf)
73 </ul>
74
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
75 <h1>Searching with couchdb-lucene</h1>
76
77 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The following parameters can be passed for more sophisticated searches;
78
79 <dl>
ad9096f tweak README.md
Robert Newson authored Feb 14, 2009
80 <dt>q<dd>the query to run (e.g, subject:hello)
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
81 <dt>sort<dd>the comma-separated fields to sort on.
82 <dt>asc<dd>sort ascending (true) or descending (false), only when sorting on a single field.
83 <dt>limit<dd>the maximum number of results to return
84 <dt>skip<dd>the number of results to skip
85 <dt>include_docs<dd>whether to include the source docs
86 <dt>debug<dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.
ad9096f tweak README.md
Robert Newson authored Feb 14, 2009
87 </dl>
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
88
89 <i>All parameters except 'q' are optional.</i>
90
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
91 <h2>Special Fields</h2>
92
93 <dl>
94 <dt>_id<dd>The _id of the document.
95 <dt>_rev<dd>The _rev of the document.
96 <dt>_db<dd>The source database of the document.
46a3a37 include all DC attributes, if present.
Robert Newson authored Mar 8, 2009
97 <dt>_body<dd>Any text extracted from any attachment.
98 </dl>
99
100 <h2>Dublin Core</h2>
101
102 All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.
103
104 <dl>
105 <dt>dc.contributor<dd> An entity responsible for making contributions to the content of the resource.
106 <dt>dc.coverage<dd>The extent or scope of the content of the resource.
107 <dt>dc.creator<dd>An entity primarily responsible for making the content of the resource.
108 <dt>dc.date<dd>A date associated with an event in the life cycle of the resource.
109 <dt>dc.description<dd>An account of the content of the resource.
110 <dt>dc.format<dd>Typically, Format may include the media-type or dimensions of the resource.
111 <dt>dc.identifier<dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.
112 <dt>dc.language<dd>A language of the intellectual content of the resource.
113 <dt>dc.modified<dd>Date on which the resource was changed.
114 <dt>dc.publisher<dd>An entity responsible for making the resource available.
115 <dt>dc.relation<dd>A reference to a related resource.
116 <dt>dc.rights<dd>Information about rights held in and over the resource.
117 <dt>dc.source<dd>A reference to a resource from which the present resource is derived.
118 <dt>dc.subject<dd>The topic of the content of the resource.
119 <dt>dc.title<dd>A name given to the resource.
120 <dt>dc.type<dd>The nature or genre of the content of the resource.
ec94e21 updated README.md
Robert Newson authored Feb 18, 2009
121 </dl>
122
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
123 <h2>Examples</h2>
124
125 <pre>
126 http://localhost:5984/dbname/_fti?q=field_name:value
127 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
128 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
129 </pre>
130
131 <h2>Search Results Format</h2>
132
fd16315 update README.md
Robert Newson authored Mar 7, 2009
133 Here's an example of a JSON response without sorting;
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
134
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
135 <pre>
136 {
fd16315 update README.md
Robert Newson authored Mar 7, 2009
137 "q": "+_db:enron +content:enron",
138 "skip": 0,
139 "limit": 2,
140 "total_rows": 176852,
141 "search_duration": 518,
142 "fetch_duration": 4,
143 "rows": [
144 {
145 "_id": "hain-m-all_documents-257.",
146 "_rev": "3750319208",
147 "score": 1.601625680923462
148 },
149 {
150 "_id": "hain-m-notes_inbox-257.",
151 "_rev": "2603032545",
152 "score": 1.601625680923462
153 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
154 ]
155 }
156 </pre>
157
fd16315 update README.md
Robert Newson authored Mar 7, 2009
158 And the same with sorting;
159
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
160 <pre>
161 {
fd16315 update README.md
Robert Newson authored Mar 7, 2009
162 "q": "+_db:enron +content:enron",
163 "skip": 0,
164 "limit": 3,
165 "total_rows": 176852,
166 "search_duration": 660,
167 "fetch_duration": 4,
168 "sort_order": [
169 {
170 "field": "source",
171 "reverse": false,
172 "type": "string"
173 },
174 {
175 "reverse": false,
176 "type": "doc"
177 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
178 ],
fd16315 update README.md
Robert Newson authored Mar 7, 2009
179 "rows": [
180 {
181 "_id": "shankman-j-inbox-105.",
182 "_rev": "4289412378",
183 "score": 0.6131107211112976,
184 "sort_order": [
185 "enron",
186 6
187 ]
188 },
189 {
190 "_id": "shankman-j-inbox-8.",
191 "_rev": "1417542355",
192 "score": 0.7492915391921997,
193 "sort_order": [
194 "enron",
195 7
196 ]
197 },
198 {
199 "_id": "shankman-j-inbox-30.",
200 "_rev": "951793815",
201 "score": 0.507369875907898,
202 "sort_order": [
203 "enron",
204 8
205 ]
206 }
118d28e JSON example output.
Robert Newson authored Feb 17, 2009
207 ]
208 }
209 </pre>
210
139a78c add info retrieval.
Robert Newson authored Mar 9, 2009
211 <h1>Fetching information about the index</h1>
212
213 Calling couchdb-lucene without arguments returns a JSON object with information about the index.
214
215 <pre>
216 http://127.0.0.1:5984/enron/_fti
217 </pre>
218
219 returns;
220
221 <pre>
222 {"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
223 </pre>
224
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
225 <h1>Working With The Source</h1>
226
227 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
228
229 <pre>
490ae39 break long lines in README.md
Robert Newson authored Feb 14, 2009
230 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
231 /path/to/couchdb-lucene/target/dependency org.apache.couchdb.lucene.Main
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
232 </pre>
233
234 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
235
236 <h1>Configuration</h1>
237
238 couchdb-lucene respects several system properties;
239
240 <dl>
ad9096f tweak README.md
Robert Newson authored Feb 14, 2009
241 <dt>couchdb.url<dd>the url to contact CouchDB with (default is "http://localhost:5984")
242 <dt>couchdb.lucene.dir<dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
243 </dl>
244
245 You can override these properties like this;
246
247 <pre>
490ae39 break long lines in README.md
Robert Newson authored Feb 14, 2009
248 fti=/usr/bin/java -D couchdb.lucene.dir=/tmp \
249 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
250 /home/rnewson/Source/couchdb-lucene/target/dependency\
251 org.apache.couchdb.lucene.Main
b207965 improve README readability.
Robert Newson authored Feb 14, 2009
252 </pre>
Something went wrong with that request. Please try again.