Skip to content
This repository
Newer
Older
100644 218 lines (176 sloc) 5.973 kb
5d4e56ad »
2009-03-06 update readme.
1 <h1>News</h1>
2
fd163159 »
2009-03-07 update README.md
3 I've merged the changes from the beta branch which brings many improvements. Notably;
5d4e56ad »
2009-03-06 update readme.
4
fd163159 »
2009-03-07 update README.md
5 <ol>
6 <li>Indexing is a separate process to searching and is triggered by update notifications.
7 <li>Rhino integration has landed, user customization of indexing is now possible.
8 </ol>
5d4e56ad »
2009-03-06 update readme.
9
fd163159 »
2009-03-07 update README.md
10 You are advised to delete indexes created prior to this update.
5d4e56ad »
2009-03-06 update readme.
11
5220b654 »
2009-02-14 tweak README.md
12 <h1>Build couchdb-lucene</h1>
b2079657 »
2009-02-14 improve README readability.
13
14 <ol>
15 <li>Install Maven 2.
16 <li>checkout repository
17 <li>type 'mvn'
18 <li>configure couchdb (see below)
19 </ol>
20
21 <h1>Configure CouchDB</h1>
22
23 <pre>
05631204 »
2009-03-07 fixes.
24 [couchdb]
25 os_process_timeout=60000 ; increase the timeout from 5 seconds.
26
b2079657 »
2009-02-14 improve README readability.
27 [external]
a2e9024b »
2009-03-06 wip
28 searcher=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
29
30 [update_notification]
31 indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
b2079657 »
2009-02-14 improve README readability.
32
33 [httpd_db_handlers]
34 _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
35 </pre>
36
37 <h1>Indexing Strategy</h1>
38
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
39 <h2>Document Indexing</h2>
40
fd163159 »
2009-03-07 update README.md
41 By default all attributes are indexed. You can customize this process by adding a design document at _design/lucene. You must supply an attribute called "transform" which takes and returns a document. For example;
a2e9024b »
2009-03-06 wip
42
43 <pre>
44 {
fd163159 »
2009-03-07 update README.md
45 "transform":"function(doc) { return doc; }"
a2e9024b »
2009-03-06 wip
46 }
47 </pre>
48
49 The function is evaluated by <a href="http://www.mozilla.org/rhino/">Rhino</a>. You may add, modify and remove any attributes. Additionally, returning null will exclude the document from indexing entirely.
b2079657 »
2009-02-14 improve README readability.
50
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
51 <h2>Attachment Indexing</h2>
52
53 CouchDB uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
54
ec94e218 »
2009-02-18 updated README.md
55 <h3>Supported Formats</h3>
56
4a600804 »
2009-02-18 use couchdb's content_type rather than auto-detect.
57 <ul>
58 <li>Excel spreadsheets (application/vnd.ms-excel)
59 <li>Word documents (application/msword)
60 <li>Powerpoint presentations (application/vnd.ms-powerpoint)
61 <li>Visio (application/vnd.visio)
62 <li>Outlook (application/vnd.ms-outlook)
63 <li>XML (application/xml)
64 <li>HTML (text/html)
65 <li>Images (image/*)
66 <li>Java class files
67 <li>Java jar archives
68 <li>MP3 (audio/mp3)
69 <li>OpenDocument (application/vnd.oasis.opendocument.*)
70 <li>Plain text (text/plain)
71 <li>PDF (application/pdf)
72 <li>RTF (application/rtf)
73 </ul>
74
b2079657 »
2009-02-14 improve README readability.
75 <h1>Searching with couchdb-lucene</h1>
76
77 You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The following parameters can be passed for more sophisticated searches;
78
79 <dl>
ad9096f2 »
2009-02-14 tweak README.md
80 <dt>q<dd>the query to run (e.g, subject:hello)
b2079657 »
2009-02-14 improve README readability.
81 <dt>sort<dd>the comma-separated fields to sort on.
82 <dt>asc<dd>sort ascending (true) or descending (false), only when sorting on a single field.
83 <dt>limit<dd>the maximum number of results to return
84 <dt>skip<dd>the number of results to skip
85 <dt>include_docs<dd>whether to include the source docs
86 <dt>debug<dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.
ad9096f2 »
2009-02-14 tweak README.md
87 </dl>
b2079657 »
2009-02-14 improve README readability.
88
89 <i>All parameters except 'q' are optional.</i>
90
ec94e218 »
2009-02-18 updated README.md
91 <h2>Special Fields</h2>
92
93 <dl>
94 <dt>_id<dd>The _id of the document.
95 <dt>_rev<dd>The _rev of the document.
96 <dt>_db<dd>The source database of the document.
97 <dt>_body<dd>Any text extracted from any attachment (name may change).
98 <dt>_author<dd>The author of any attachment (name may change).
99 <dt>_title<dd>The title of any attachment (name may change).
100 </dl>
101
b2079657 »
2009-02-14 improve README readability.
102 <h2>Examples</h2>
103
104 <pre>
105 http://localhost:5984/dbname/_fti?q=field_name:value
106 http://localhost:5984/dbname/_fti?q=field_name:value&sort=other_field
107 http://localhost:5984/dbname/_fti?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
108 </pre>
109
110 <h2>Search Results Format</h2>
111
fd163159 »
2009-03-07 update README.md
112 Here's an example of a JSON response without sorting;
b2079657 »
2009-02-14 improve README readability.
113
118d28eb »
2009-02-17 JSON example output.
114 <pre>
115 {
fd163159 »
2009-03-07 update README.md
116 "q": "+_db:enron +content:enron",
117 "skip": 0,
118 "limit": 2,
119 "total_rows": 176852,
120 "search_duration": 518,
121 "fetch_duration": 4,
122 "rows": [
123 {
124 "_id": "hain-m-all_documents-257.",
125 "_rev": "3750319208",
126 "score": 1.601625680923462
127 },
128 {
129 "_id": "hain-m-notes_inbox-257.",
130 "_rev": "2603032545",
131 "score": 1.601625680923462
132 }
118d28eb »
2009-02-17 JSON example output.
133 ]
134 }
135 </pre>
136
fd163159 »
2009-03-07 update README.md
137 And the same with sorting;
138
118d28eb »
2009-02-17 JSON example output.
139 <pre>
140 {
fd163159 »
2009-03-07 update README.md
141 "q": "+_db:enron +content:enron",
142 "skip": 0,
143 "limit": 3,
144 "total_rows": 176852,
145 "search_duration": 660,
146 "fetch_duration": 4,
147 "sort_order": [
148 {
149 "field": "source",
150 "reverse": false,
151 "type": "string"
152 },
153 {
154 "reverse": false,
155 "type": "doc"
156 }
118d28eb »
2009-02-17 JSON example output.
157 ],
fd163159 »
2009-03-07 update README.md
158 "rows": [
159 {
160 "_id": "shankman-j-inbox-105.",
161 "_rev": "4289412378",
162 "score": 0.6131107211112976,
163 "sort_order": [
164 "enron",
165 6
166 ]
167 },
168 {
169 "_id": "shankman-j-inbox-8.",
170 "_rev": "1417542355",
171 "score": 0.7492915391921997,
172 "sort_order": [
173 "enron",
174 7
175 ]
176 },
177 {
178 "_id": "shankman-j-inbox-30.",
179 "_rev": "951793815",
180 "score": 0.507369875907898,
181 "sort_order": [
182 "enron",
183 8
184 ]
185 }
118d28eb »
2009-02-17 JSON example output.
186 ]
187 }
188 </pre>
189
b2079657 »
2009-02-14 improve README readability.
190 <h1>Working With The Source</h1>
191
192 To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;
193
194 <pre>
490ae390 »
2009-02-14 break long lines in README.md
195 fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
196 /path/to/couchdb-lucene/target/dependency org.apache.couchdb.lucene.Main
b2079657 »
2009-02-14 improve README readability.
197 </pre>
198
199 You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.
200
201 <h1>Configuration</h1>
202
203 couchdb-lucene respects several system properties;
204
205 <dl>
ad9096f2 »
2009-02-14 tweak README.md
206 <dt>couchdb.url<dd>the url to contact CouchDB with (default is "http://localhost:5984")
207 <dt>couchdb.lucene.dir<dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.
b2079657 »
2009-02-14 improve README readability.
208 </dl>
209
210 You can override these properties like this;
211
212 <pre>
490ae390 »
2009-02-14 break long lines in README.md
213 fti=/usr/bin/java -D couchdb.lucene.dir=/tmp \
214 -cp /home/rnewson/Source/couchdb-lucene/target/classes:\
215 /home/rnewson/Source/couchdb-lucene/target/dependency\
216 org.apache.couchdb.lucene.Main
b2079657 »
2009-02-14 improve README readability.
217 </pre>
Something went wrong with that request. Please try again.