Permalink
Browse files

use couchdb's content_type rather than auto-detect.

  • Loading branch information...
1 parent 2a4e767 commit 4a60080428527a134f77b7f62365f7245d60d80b Robert Newson committed Feb 18, 2009
Showing with 25 additions and 4 deletions.
  1. +24 −0 README.md
  2. +1 −4 src/main/java/org/apache/couchdb/lucene/Tika.java
View
@@ -19,8 +19,32 @@ _fti = {couch_httpd_external, handle_external_req, <<"fti">>}
<h1>Indexing Strategy</h1>
+<h2>Document Indexing</h2>
+
Currently all fields of all documents are indexed, javascript control coming soon.
+<h2>Attachment Indexing</h2>
+
+CouchDB uses <a href="http://lucene.apache.org/tika/">Apache Tika</a> to index attachments of the following types, assuming the correct content_type is set in couchdb;
+
+<ul>
+<li>Excel spreadsheets (application/vnd.ms-excel)
+<li>Word documents (application/msword)
+<li>Powerpoint presentations (application/vnd.ms-powerpoint)
+<li>Visio (application/vnd.visio)
+<li>Outlook (application/vnd.ms-outlook)
+<li>XML (application/xml)
+<li>HTML (text/html)
+<li>Images (image/*)
+<li>Java class files
+<li>Java jar archives
+<li>MP3 (audio/mp3)
+<li>OpenDocument (application/vnd.oasis.opendocument.*)
+<li>Plain text (text/plain)
+<li>PDF (application/pdf)
+<li>RTF (application/rtf)
+</ul>
+
<h1>Searching with couchdb-lucene</h1>
You can perform all types of queries using Lucene's default <a href="http://lucene.apache.org/java/2_4_0/queryparsersyntax.html">query syntax</a>. The following parameters can be passed for more sophisticated searches;
@@ -10,15 +10,14 @@
import org.apache.lucene.document.Document;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
-import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ParsingReader;
public final class Tika {
public void parse(final InputStream in, final String contentType, final Document doc) {
final AutoDetectParser parser = new AutoDetectParser();
final Metadata md = new Metadata();
-
+ md.set(Metadata.CONTENT_TYPE, contentType);
final Reader reader = new ParsingReader(parser, in, md);
final String body;
try {
@@ -31,8 +30,6 @@ public void parse(final InputStream in, final String contentType, final Document
return;
}
- System.err.printf("body: %s, md: %s\n", body, md);
-
doc.add(text(Config.BODY, body, false));
if (md.get(Metadata.TITLE) != null) {

0 comments on commit 4a60080

Please sign in to comment.