Example project to show using Tika with Lucene
Java Shell
Switch branches/tags
Nothing to show
Latest commit 7527248 Feb 24, 2012 @rickcrawford updated readme
Permalink
Failed to load latest commit information.
documents initial import Feb 24, 2012
script initial commit Feb 24, 2012
src/main/java updated changes Feb 24, 2012
.gitignore initial commit Feb 24, 2012
README.md updated readme Feb 24, 2012
pom.xml initial import Feb 24, 2012

README.md

Lucene Tika search example project

This is a sample project for loading full text of documents into lucene using Tika.

There are 2 classes, SearchIndex.java and WriteIndex.java. For this project I included a script to create a ramdisk for the index, which will disappear once you disconnect or restart. The INDEX_DIRECTORY constant has the location of the search index.

WriteIndex.java writes the index to the directory. Metadata stores the document metadata, ContentHandler stores the text content. The parser loads the file based on the content type. Tika understands how to pull the content form many different file types.

	Metadata metadata = new Metadata();
	ContentHandler handler = new BodyContentHandler();
	ParseContext context = new ParseContext();
	Parser parser = new AutoDetectParser();
	InputStream stream = new FileInputStream(file);
	try {
		parser.parse(stream, handler, metadata, context);
	}
	catch (TikaException e) {
		e.printStackTrace();
	} catch (SAXException e) {
		e.printStackTrace();
	}
	finally {
		stream.close();
	}

SearchIndex.java searches the index.

	File indexDir = new File(WriteIndex.INDEX_DIRECTORY);
	
	Directory index = FSDirectory.open(indexDir);
	
	// Build a Query object
	Query query;
	try {
		query = new QueryParser(Version.LUCENE_35, "text", new StandardAnalyzer(Version.LUCENE_35)).parse("rick crawford");
	} catch (ParseException e) {
		e.printStackTrace();
		return;
	}
	
	int hitsPerPage = 10;
	IndexReader reader = IndexReader.open(index);
	IndexSearcher searcher = new IndexSearcher(reader);
	TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
	searcher.search(query, collector);
	
	System.out.println("total hits: " + collector.getTotalHits());		
	
	ScoreDoc[] hits = collector.topDocs().scoreDocs;
	for (ScoreDoc hit : hits) {
		Document doc = reader.document(hit.doc);
		System.out.println(doc.get("file") + "  (" + hit.score + ")");
	}