Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

just redirect.

  • Loading branch information...
commit 1acef15cb9811aa73bb710fde3396e3a0050b416 1 parent 5584bfd
authored June 23, 2009

Showing 1 changed file with 2 additions and 518 deletions. Show diff stats Hide diff stats

  1. 520  index.html
520  index.html
@@ -4,527 +4,11 @@
4 4
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
5 5
 <head>
6 6
 	<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
7  
-
  7
+	<meta http-equiv="refresh" content="1;url=http://github.com/rnewson/couchdb-lucene/"/>
8 8
 	<title>rnewson/couchdb-lucene @ GitHub</title>
9  
-    <style type="text/css">
10  
-      body {
11  
-      margin: 1em 5em;
12  
-      }
13  
-      h1 {
14  
-      font-size: 150%;
15  
-      }
16  
-      h2,h3 {
17  
-      font-size: 120%;
18  
-      }
19  
-    </style>	
20 9
 </head>
21 10
 
22 11
 <body>
23  
-  <a href="http://github.com/rnewson/couchdb-lucene"><img style="position: absolute; top: 0; right: 0; border: 0;" src="http://s3.amazonaws.com/github/ribbons/forkme_right_darkblue_121621.png" alt="Fork me on GitHub" /></a>
24  
-  
25  
-  <div id="container">
26  
-
27  
-    <h1><a href="http://github.com/rnewson/couchdb-lucene">couchdb-lucene</a> 
28  
-      <span class="small">by <a href="http://github.com/rnewson">rnewson</a></small></h1>
29  
-
30  
-    <div class="description">
31  
-      Enables full-text searching of CouchDB documents using Lucene
32  
-    </div>
33  
-
34  
-<h2>License</h2>
35  
-<p>Apache Software License v2</p>
36  
-<h2>Authors</h2>
37  
-<p>Robert Newson (robert.newson _at_ gmail.com)
38  
-<br/>
39  
-<br/>      </p>
40  
-<h2>Contact</h2>
41  
-<p>Robert Newson (robert.newson _at_ gmail.com)
42  
-<br/>      </p>
43  
-
44  
-    <h2>Download</h2>
45  
-    <p>
46  
-      You can download this project in either
47  
-      <a href="http://github.com/rnewson/couchdb-lucene/zipball/master">zip</a> or
48  
-      <a href="http://github.com/rnewson/couchdb-lucene/tarball/master">tar</a> formats.
49  
-    </p>
50  
-    <p>You can also clone the project with <a href="http://git-scm.com">Git</a>
51  
-      by running:
52  
-      <pre>$ git clone git://github.com/rnewson/couchdb-lucene</pre>
53  
-    </p>
54  
-      
55  
-    <div class="footer">
56  
-      get the source code on GitHub : <a href="http://github.com/rnewson/couchdb-lucene">rnewson/couchdb-lucene</a>
57  
-    </div>
58  
-
59  
-<h1>NOTES</h1>
60  
-
61  
-<p>This documentation is slightly ahead of the code; the "language" and "analyzer" options are not yet available.</p>
62  
-
63  
-<h1>News</h1>
64  
-
65  
-<p>The indexing API in 0.3 has changed since 0.2 to  allow multiple design documents and "views" into Lucene. It will moves the Lucene-specific stuff into an options object.</p>
66  
-
67  
-<h1>Issue Tracking</h1>
68  
-
69  
-<p>Issue tracking at <a href="http://github.com/rnewson/couchdb-lucene/issues">github</a>.</p>
70  
-
71  
-<h1>System Requirements</h1>
72  
-
73  
-<p>Sun JDK 5 or higher is recommended. </p>
74  
-
75  
-<p>Couchdb-lucene is known to be incompatible with some versions of OpenJDK as it includes an earlier, and incompatible, version of the Rhino Javascript library. The version in Ubuntu 8.10 (6b12-0ubuntu6.4) is known to work and it uses Rhino 1.7R1.</p>
76  
-
77  
-<h1>Build couchdb-lucene</h1>
78  
-
79  
-<ol>
80  
-<li>Install Maven 2.
81  
-<li>checkout repository
82  
-<li>type 'mvn'
83  
-<li>configure couchdb (see below)
84  
-</ol>
85  
-
86  
-<h1>Configure CouchDB</h1>
87  
-
88  
-<pre>
89  
-[couchdb]
90  
-os_process_timeout=60000 ; increase the timeout from 5 seconds.
91  
-
92  
-[external]
93  
-fti=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -search
94  
-
95  
-[update_notification]
96  
-indexer=/usr/bin/java -jar /path/to/couchdb-lucene*-jar-with-dependencies.jar -index
97  
-
98  
-[httpd_db_handlers]
99  
-_fti = {couch_httpd_external, handle_external_req, <<"fti">>}
100  
-</pre>
101  
-
102  
-<h1>Indexing Strategy</h1>
103  
-
104  
-<h2>Document Indexing</h2>
105  
-
106  
-<p>You must supply a index function in order to enable couchdb-lucene as by default, nothing will be indexed.</p>
107  
-
108  
-<p>You may add any number of index views in any number of design documents. All searches will be constrained to documents emitted by the index functions.</p>
109  
-
110  
-<p>Declare your functions as follows;</p>
111  
-
112  
-<pre>
113  
-{
114  
-  "fulltext": {
115  
-    "by_subject": {
116  
-      "defaults": { "store":"yes" },
117  
-      "index":"function(doc) { var ret=new Document(); ret.add(doc.subject); return ret }"
118  
-    },
119  
-    "french_documents": {
120  
-      "defaults": { "language":"fr" },
121  
-      "index":"function(doc) { if (doc.language != "fr") { return null;} var ret=new Document(); <i>etc</i> return ret;  }"
122  
-    }
123  
-  }
124  
-}
125  
-</pre>
126  
-
127  
-<p>A fulltext object contains multiple index view declarations. An index view consists of;</p>
128  
-
129  
-<dl>
130  
-<dt>defaults</dt><dd>The default for numerous indexing options can be overridden here. A full list of options follows.</dd>
131  
-<dt>index</dt><dd>The indexing function itself, documented below.</dd>
132  
-
133  
-<h3>The Defaults Object</h3>
134  
-
135  
-The following indexing options can be defaulted;
136  
-
137  
-<table>
138  
-  <tr>
139  
-    <th>name</th>
140  
-    <th>description</th>
141  
-    <th>available options</th>
142  
-    <th>default</th>
143  
-  </tr>
144  
-  <tr>
145  
-    <th>field</th>
146  
-    <td>the field name to index under</td>
147  
-    <td>user-defined</td>
148  
-    <td>default</td>
149  
-  </tr> 
150  
-  <tr>
151  
-    <th>store</th>
152  
-    <td>whether the data is stored. The value will be returned in the search result.</td>
153  
-    <td>yes, no</td>
154  
-    <td>no</td>
155  
-  </tr> 
156  
-  <tr>
157  
-    <th>index</th>
158  
-    <td>whether (and how) the data is indexed</td>
159  
-    <td>analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms</td>
160  
-    <td>analyzed</td>
161  
-  </tr> 
162  
-  <tr>
163  
-    <th>analyzer</th>
164  
-    <td>how the data is analyzed</td>
165  
-    <td>auto, simple, standard</td>
166  
-    <td>auto</td>
167  
-  </tr> 
168  
-  <tr>
169  
-    <th>language</th>
170  
-    <td>which language the data is in</td>
171  
-    <td>auto, br, cjk, cn, cz, de, el, en, fr, nl, ru, th</td>
172  
-    <td>en</td>
173  
-  </tr> 
174  
-</table>
175  
-
176  
-<h3>The Document class</h3>
177  
-
178  
-You may construct a new Document instance with;
179  
-
180  
-<pre>
181  
-var doc = new Document();
182  
-</pre>
183  
-
184  
-Data may be added to this document with the add method which takes an optional second object argument that can override any of the above default values.
185  
-
186  
-The data is usually interpreted as a String but couchdb-lucene provides special handling if a Javascript Date object is passed. Specifically, the date is indexed as a numeric value, which allows correct sorting, and stored (if requested) in ISO 8601 format (with a timezone marker).
187  
-
188  
-<pre>
189  
-// Add with all the defaults.
190  
-doc.add("value");
191  
-
192  
-// Add a subject field.
193  
-doc.add("this is the subject line.", {"field":"subject"});
194  
-
195  
-// Add but ensure it's stored.
196  
-doc.add("value", {"store":"yes"});
197  
-
198  
-// Add but don't analyze.
199  
-doc.add("don't analyze me", {"index":"not_analyzed"});
200  
-
201  
-// Extract text from the named attachment and index it (but not store it).
202  
-doc.attachment("attachment name", {"field":"attachments"});
203  
-</pre>
204  
-
205  
-<h3>Example Transforms</h3>
206  
-
207  
-<h4>Index Everything</h4>
208  
-
209  
-<pre>
210  
-function(doc) {
211  
-    var ret = new Document();
212  
-
213  
-    function idx(obj) {
214  
-    for (var key in obj) {
215  
-        switch (typeof obj[key]) {
216  
-        case 'object':
217  
-        idx(obj[key]);
218  
-        break;
219  
-        case 'function':
220  
-        break;
221  
-        default:
222  
-        ret.add(obj[key]);
223  
-        break;
224  
-        }
225  
-    }
226  
-    };
227  
-
228  
-    idx(doc);
229  
-
230  
-    if (doc._attachments) {
231  
-    for (var i in doc._attachments) {
232  
-        ret.attachment("attachment", i);
233  
-    }
234  
-    }
235  
-
236  
-    return ret;
237  
-}
238  
-</pre>
239  
-
240  
-<h4>Index Nothing</h4>
241  
-
242  
-<pre>
243  
-function(doc) {
244  
-  return null;
245  
-}
246  
-</pre>
247  
-
248  
-<h4>Index Select Fields</h4>
249  
-
250  
-<pre>
251  
-function(doc) {
252  
-  var result = new Document();
253  
-  result.add(doc.subject, {"field":"subject", "store":"yes"});
254  
-  result.add(doc.content, {"field":"subject"});
255  
-  result.add({"field":"indexed_at"});
256  
-  return result;
257  
-}
258  
-</pre>
259  
-
260  
-<h4>Index Attachments</h4>
261  
-
262  
-<pre>
263  
-function(doc) {
264  
-  var result = new Document();
265  
-  for(var a in doc._attachments) {
266  
-    result.add_attachment(a, {"field":"attachment"});
267  
-  }
268  
-  return result;
269  
-}
270  
-</pre>
271  
-
272  
-<h4>A More Complex Example</h4>
273  
-
274  
-<pre>
275  
-function(doc) {
276  
-    var mk = function(name, value, group) {
277  
-        var ret = new Document();
278  
-        ret.add(value, {"field": group, "store":"yes"});
279  
-        ret.add(group, {"field":"group", "store":"yes"});
280  
-        return ret;
281  
-    };
282  
-    var ret = [];
283  
-    if(doc.type != "reference") return null;
284  
-    for(var g in doc.groups) {
285  
-        ret.add(mk("library", doc.groups[g].library, g));
286  
-        ret.add(mk("method", doc.groups[g].method, g));
287  
-        ret.add(mk("target", doc.groups[g].target, g));
288  
-    }
289  
-    return ret;
290  
-}
291  
-</pre>
292  
-
293  
-<h2>Attachment Indexing</h2>
294  
-
295  
-Couchdb-lucene uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
296  
-
297  
-<h3>Supported Formats</h3>
298  
-
299  
-<ul>
300  
-<li>Excel spreadsheets (application/vnd.ms-excel)
301  
-<li>Word documents (application/msword)
302  
-<li>Powerpoint presentations (application/vnd.ms-powerpoint)
303  
-<li>Visio (application/vnd.visio)
304  
-<li>Outlook (application/vnd.ms-outlook)
305  
-<li>XML (application/xml)
306  
-<li>HTML (text/html)
307  
-<li>Images (image/*)
308  
-<li>Java class files
309  
-<li>Java jar archives
310  
-<li>MP3 (audio/mp3)
311  
-<li>OpenDocument (application/vnd.oasis.opendocument.*)
312  
-<li>Plain text (text/plain)
313  
-<li>PDF (application/pdf)
314  
-<li>RTF (application/rtf)
315  
-</ul>
316  
-
317  
-<h1>Searching with couchdb-lucene</h1>
318  
-
319  
-You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The _body field is searched by default which will include the extracted text from all attachments. The following parameters can be passed for more sophisticated searches;
320  
-
321  
-<dl>
322  
-<dt>q</dt><dd>the query to run (e.g, subject:hello). If not specified, the default field is searched.</dd>
323  
-<dt>lang</dt><dd>The language that the query parameter is in. Available options, and the default if not specified, are identical to the language option specified above.</dd>
324  
-<dt>sort</dt><dd>the comma-separated fields to sort on. Prefix with / for ascending order and \ for descending order (ascending is the default if not specified).</dd>
325  
-<dt>limit</dt><dd>the maximum number of results to return</dd>
326  
-<dt>skip</dt><dd>the number of results to skip</dd>
327  
-<dt>include_docs</dt><dd>whether to include the source docs</dd>
328  
-<dt>stale=ok</dt><dd>If you set the <i>stale</i> option <i>ok</i>, couchdb-lucene may not perform any refreshing on the index. Searches may be faster as Lucene caches important data (especially for sorting). A query without stale=ok will use the latest data committed to the index.</dd>
329  
-<dt>debug</dt><dd>if false, a normal application/json response with results appears. if true, an pretty-printed HTML blob is returned instead.</dd>
330  
-<dt>rewrite</dt><dd>(EXPERT) if true, returns a json response with a rewritten query and term frequencies. This allows correct distributed scoring when combining the results from multiple nodes.</dd>
331  
-</dl>
332  
-
333  
-<p><i>All parameters except 'q' are optional.</i></p>
334  
-
335  
-<h2>Special Fields</h2>
336  
-
337  
-<dl>
338  
-<dt>_db</dt><dd>The source database of the document.</dd>
339  
-<dt>_id</dt><dd>The _id of the document.</dd>
340  
-</dl>
341  
-
342  
-<h2>Dublin Core</h2>
343  
-
344  
-<p>All Dublin Core attributes are indexed and stored if detected in the attachment. Descriptions of the fields come from the Tika javadocs.</p>
345  
-
346  
-<dl>
347  
-<dt>_dc.contributor</dt><dd> An entity responsible for making contributions to the content of the resource.</dd>
348  
-<dt>_dc.coverage</dt><dd>The extent or scope of the content of the resource.</dd>
349  
-<dt>_dc.creator</dt><dd>An entity primarily responsible for making the content of the resource.</dd>
350  
-<dt>_dc.date</dt><dd>A date associated with an event in the life cycle of the resource.</dd>
351  
-<dt>_dc.description</dt><dd>An account of the content of the resource.</dd>
352  
-<dt>_dc.format</dt><dd>Typically, Format may include the media-type or dimensions of the resource.</dd>
353  
-<dt>_dc.identifier</dt><dd>Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.</dd>
354  
-<dt>_dc.language</dt><dd>A language of the intellectual content of the resource.</dd>
355  
-<dt>_dc.modified</dt><dd>Date on which the resource was changed.</dd>
356  
-<dt>_dc.publisher</dt><dd>An entity responsible for making the resource available.</dd>
357  
-<dt>_dc.relation</dt><dd>A reference to a related resource.</dd>
358  
-<dt>_dc.rights</dt><dd>Information about rights held in and over the resource.</dd>
359  
-<dt>_dc.source</dt><dd>A reference to a resource from which the present resource is derived.</dd>
360  
-<dt>_dc.subject</dt><dd>The topic of the content of the resource.</dd>
361  
-<dt>_dc.title</dt><dd>A name given to the resource.</dd>
362  
-<dt>_dc.type</dt><dd>The nature or genre of the content of the resource.</dd>
363  
-</dl>
364  
-
365  
-<h2>Examples</h2>
366  
-
367  
-<pre>
368  
-http://localhost:5984/dbname/_fti/design_doc/view_name?q=field_name:value
369  
-http://localhost:5984/dbname/_fti/design_doc/view_name?q=field_name:value&sort=other_field
370  
-http://localhost:5984/dbname/_fti/design_doc/view_name?debug=true&sort=billing_size&q=body:document AND customer:[A TO C]
371  
-</pre>
372  
-
373  
-<h2>Search Results Format</h2>
374  
-
375  
-<p>The search result contains a number of fields at the top level, in addition to your search results.</p>
376  
-
377  
-<dl>
378  
-<dt>q</dt><dd>The query that was executed.</dd>
379  
-<dt>etag</dt><dd>An opaque token that reflects the current version of the index. This value is also returned in an ETag header to facilitate HTTP caching.</dd>
380  
-<dt>skip</dt><dd>The number of initial matches that was skipped.</dd>
381  
-<dt>limit</dt><dd>The maximum number of results that can appear.</dd>
382  
-<dt>total_rows</dt><dd>The total number of matches for this query.</dd>
383  
-<dt>search_duration</dt><dd>The number of milliseconds spent performing the search.</dd>
384  
-<dt>fetch_duration</dt><dd>The number of milliseconds spent retrieving the documents.</dd>
385  
-<dt>rows</dt><dd>The search results array, described below.</dd>
386  
-</dl>
387  
-
388  
-<h2>The search results array</h2>
389  
-
390  
-<p>The search results arrays consists of zero, one or more objects with the following fields;</p>
391  
-
392  
-<dl>
393  
-<dt>id</dt><dd>The unique identifier for this match.</dd>
394  
-<dt>score</dt><dd>The normalized score (0.0-1.0, inclusive) for this match</dd>
395  
-<dt>fields</dt><dd>All the fields that were stored with this match</dd>
396  
-<dt>doc</dt><dd>The original document from couch, if requested with include_docs=true</dd>
397  
-</dl>
398  
-
399  
-<p>Here's an example of a JSON response without sorting;</p>
400  
-
401  
-<pre>
402  
-{
403  
-  "q": "+content:enron",
404  
-  "skip": 0,
405  
-  "limit": 2,
406  
-  "total_rows": 176852,
407  
-  "search_duration": 518,
408  
-  "fetch_duration": 4,
409  
-  "rows":   [
410  
-        {
411  
-      "id": "hain-m-all_documents-257.",
412  
-      "score": 1.601625680923462
413  
-    },
414  
-        {
415  
-      "id": "hain-m-notes_inbox-257.",
416  
-      "score": 1.601625680923462
417  
-    }
418  
-  ]
419  
-}
420  
-</pre>
421  
-
422  
-<p>And the same with sorting;</p>
423  
-
424  
-<pre>
425  
-{
426  
-  "q": "+content:enron",
427  
-  "skip": 0,
428  
-  "limit": 3,
429  
-  "total_rows": 176852,
430  
-  "search_duration": 660,
431  
-  "fetch_duration": 4,
432  
-  "sort_order":   [
433  
-        {
434  
-      "field": "source",
435  
-      "reverse": false,
436  
-      "type": "string"
437  
-    },
438  
-        {
439  
-      "reverse": false,
440  
-      "type": "doc"
441  
-    }
442  
-  ],
443  
-  "rows":   [
444  
-        {
445  
-      "id": "shankman-j-inbox-105.",
446  
-      "score": 0.6131107211112976,
447  
-      "sort_order":       [
448  
-        "enron",
449  
-        6
450  
-      ]
451  
-    },
452  
-        {
453  
-      "id": "shankman-j-inbox-8.",
454  
-      "score": 0.7492915391921997,
455  
-      "sort_order":       [
456  
-        "enron",
457  
-        7
458  
-      ]
459  
-    },
460  
-        {
461  
-      "id": "shankman-j-inbox-30.",
462  
-      "score": 0.507369875907898,
463  
-      "sort_order":       [
464  
-        "enron",
465  
-        8
466  
-      ]
467  
-    }
468  
-  ]
469  
-}
470  
-</pre>
471  
-
472  
-<h1>Fetching information about the index</h1>
473  
-
474  
-<p>Calling couchdb-lucene without arguments returns a JSON object with information about the <i>whole</i> index.</p>
475  
-
476  
-<pre>
477  
-http://127.0.0.1:5984/enron/_fti
478  
-</pre>
479  
-
480  
-<p>returns;</p>
481  
-
482  
-<pre>
483  
-{"doc_count":517350,"doc_del_count":1,"disk_size":318543045}
484  
-</pre>
485  
-
486  
-<h1>Working With The Source</h1>
487  
-
488  
-<p>To develop "live", type "mvn dependency:unpack-dependencies" and change the external line to something like this;</p>
489  
-
490  
-<pre>
491  
-fti=/usr/bin/java -cp /path/to/couchdb-lucene/target/classes:\
492  
-/path/to/couchdb-lucene/target/dependency com.github.rnewson.couchdb.lucene.Main
493  
-</pre>
494  
-
495  
-<p>You will need to restart CouchDB if you change couchdb-lucene source code but this is very fast.</p>
496  
-
497  
-<h1>Configuration</h1>
498  
-
499  
-<p>couchdb-lucene respects several system properties;</p>
500  
-
501  
-<dl>
502  
-<dt>couchdb.url</dt><dd>the url to contact CouchDB with (default is "http://localhost:5984")</dd>
503  
-<dt>couchdb.lucene.dir</dt><dd>specify the path to the lucene indexes (the default is to make a directory called 'lucene' relative to couchdb's current working directory.</dd>
504  
-<dt>couchdb.log.dir</dt><dd>specify the directory of the log file (which is called couchdb-lucene.log), defaults to the platform-specific temp directory.</dd>
505  
-</dl>
506  
-
507  
-<p>You can override these properties like this;</p>
508  
-
509  
-<pre>
510  
-fti=/usr/bin/java -Dcouchdb.lucene.dir=/tmp \
511  
--cp /home/rnewson/Source/couchdb-lucene/target/classes:\
512  
-/home/rnewson/Source/couchdb-lucene/target/dependency\
513  
-com.github.rnewson.couchdb.lucene.Main
514  
-</pre>
515  
-
516  
-<h2>Basic Authentication</h2>
517  
-
518  
-<p>If you put couchdb behind an authenticating proxy you can still configure couchdb-lucene to pull from it by specifying additional system properties. Currently only Basic authentication is supported.</p>
519  
-
520  
-<dl>
521  
-<dt>couchdb.user</dt><dd>the user to authenticate as.</dd>
522  
-<dt>couchdb.password</dt><dd>the password to authenticate with.</dd>
523  
-</dl>
524  
-
525  
-<h2>IPv6</h2>
526  
-
527  
-<p>The default for couchdb.url is problematic on an IPv6 system. Specify -Dcouchdb.url=http://[::1]:5984 to resolve it.</p>
528  
-  
  12
+  See <a href="http://github.com/rnewson/couchdb-lucene/">this page</a> for more details.
529 13
 </body>
530 14
 </html>

0 notes on commit 1acef15

Please sign in to comment.
Something went wrong with that request. Please try again.