Adds lucene input / output formats and pig load / store funcs #276

isnotinvain · 2012-11-29T20:33:12Z

This commit adds an abstract OutputFormat + StoreFunc for creating lucene indexes in HDFS and an abstract InputFormat + LoadFunc for searching them.

It also adds a tool for merging indexes in HDFS.

These base classes make it easy to index + search from an MR task or pig script.

A lot of this is code migration with some generalization and cleanup.

I've tried to do the Right Thing™ with the maven pom files but I'm new to maven so let me know if it could be better.

There are some TODOs left in the code, most of them are questions that hopefully I can get answered on this pull request and fix them before submitting.

rangadi · 2012-11-29T22:00:12Z

core/src/main/java/com/twitter/elephantbird/util/HdfsUtils.java

+
+    walkPath(path, filter, conf, accumulateDirectories, new PathVisitor() {
+      @Override
+      public void visit(FileStatus fileStatus) {


accumulateDirectories is not checked.

oh, it is checked in walkPath

I thought about removing that flag and then leaving the option up to the
path filter, but then the pathfilter needs to do figure out if a path is a
directory which requires a FileSystem and config.
On Nov 29, 2012 2:01 PM, "Raghu Angadi" notifications@github.com wrote:

In core/src/main/java/com/twitter/elephantbird/util/HdfsUtils.java:

/**

* Recursively walk a path, adding paths that are accepted by filter to accumulator

* @param path root path to begin walking, will be added to accumulator

* @param filter filter to determine which paths to accept

* @param conf hadoop conf

* @param accumulateDirectories whether or not to accumulate directories

* @param accumulator all paths accepted will be added to accumulator

* @throws IOException

*/

public static void collectPaths(Path path, PathFilter filter, Configuration conf,

boolean accumulateDirectories, final List<Path> accumulator) throws IOException {

walkPath(path, filter, conf, accumulateDirectories, new PathVisitor() {

@Override

public void visit(FileStatus fileStatus) {

oh, it is checked in walkPath

—
Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/pull/276/files#r2271629.

billonahill · 2012-12-04T16:34:56Z

core/src/main/java/com/twitter/elephantbird/util/HadoopUtils.java

+   * @param conf to write to
+   * @throws IOException
+   */
+  public static void writeObjectToConfig(String key, Object obj, Configuration conf) throws IOException {


This method name and its reader should be more descriptive (i.e. writeObjectToConfAsBase64) to allow for other methods with different serialization schemes (i.e. writeObjectToConfAsJson).

OK, makes sense.

rangadi · 2012-12-05T18:56:11Z

core/src/main/java/com/twitter/elephantbird/util/HdfsUtils.java

+   * @throws IOException
+   */
+  public static long getDirectorySize(Path path, FileSystem fs) throws IOException {
+    return getDirectorySize(path, fs, PathFilters.ACCEPT_ALL_PATHS_FILTER);


fyi: if we counting everything, fs.getContentSummary(path) would be the most efficient.

thanks I'll use that.

rangadi · 2012-12-05T19:07:09Z

lucene/src/main/java/com/twitter/elephantbird/mapreduce/input/LuceneHdfsDirectory.java

@@ -0,0 +1,147 @@
+package com.twitter.elephantbird.mapreduce.input;


the package name is bit odd. should it be under '...elephantbird.lucene...'? almost always 'mapreduce.input' deals with MR inputformats

We decided that since it's only used by an InputFormat that it can stay here.

…ug where directories' sizes were being counted

isnotinvain · 2012-12-06T04:59:13Z

lucene/src/main/java/com/twitter/elephantbird/mapreduce/output/LuceneIndexOutputFormat.java

+  public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException {
+    FileOutputCommitter committer = (FileOutputCommitter) this.getOutputCommitter(job);
+
+    File tmpDirFile = Files.createTempDir();


Should this be the current working directory for this task instead of a jvm temp dir?

I was wondering the same thing. I thought you'd want to create a temp dir under the location "mapred.child.tmp" refers to, but it seems Hadoop sets java.io.tmpdir for the task so you're ok as implemented.

http://lucene.472066.n3.nabble.com/Creating-and-working-with-temporary-file-in-a-map-function-td3893392.html

Yes, and I confirmed it on our cluster so seems fine.
I had thought to use context.getWorkingDir() but that returns an HDFS path.

sagemintblue · 2012-12-12T23:10:11Z

pom.xml

+      <dependency>
+        <groupId>org.apache.lucene</groupId>
+        <artifactId>lucene-queryparser</artifactId>
+        <scope>test</scope>


just double checking-- this artifact is commonly used only for testing?

for consistency may want to swap ordering of and elements here.

In general it's not a testing utility, but I'm using it in a test. We don't want to bundle it because people may not use it.
Am I missing something?

Adds lucene input / output formats and pig load / store funcs

Adds lucene input / output formats and pig load / store funcs

ff9733c

rangadi reviewed Nov 29, 2012
View reviewed changes

isnotinvain added 3 commits November 30, 2012 19:57

Remove two InterruptedException TODOs now that I know what to do

897f135

javadoc cleanup

a4b2934

use collector searching, clean up class hierarchy a little

251f5c5

billonahill reviewed Dec 4, 2012
View reviewed changes

isnotinvain added 2 commits December 4, 2012 18:18

First round of addressing bill's comments

f8bc064

Second round of addressing bill's comments

5bf3d94

rangadi reviewed Dec 5, 2012
View reviewed changes

Use content summary for more efficient directory size counting, fix b…

ba8eb98

…ug where directories' sizes were being counted

isnotinvain reviewed Dec 6, 2012
View reviewed changes

isnotinvain added 7 commits December 6, 2012 12:18

Add better error handling to HDFSDirectory

f554ec6

Better error handling in glob statuses

5b84f06

use built in hit count collector

57a6f30

Add in some reasonable defaults

f2da8de

Add merge tool

e4372b7

Remove and address all TODOs

2170390

Clean up the poms per discussion with raghu

b8e1119

sagemintblue reviewed Dec 12, 2012
View reviewed changes

isnotinvain added 3 commits December 12, 2012 19:08

cleanup pig loader arg parseing a little

1d2ed7a

remove uneeded scope tag from pig-lucene pom

38a3275

move the test scope of queryparser to the child poms

35f6de9

rangadi added a commit that referenced this pull request Dec 13, 2012

Merge pull request #276 from isnotinvain/lucene

be76534

Adds lucene input / output formats and pig load / store funcs

rangadi merged commit be76534 into twitter:master Dec 13, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds lucene input / output formats and pig load / store funcs #276

Adds lucene input / output formats and pig load / store funcs #276

isnotinvain commented Nov 29, 2012

rangadi Nov 29, 2012

rangadi Nov 29, 2012

isnotinvain Nov 30, 2012

billonahill Dec 4, 2012

isnotinvain Dec 4, 2012

rangadi Dec 5, 2012

isnotinvain Dec 5, 2012

rangadi Dec 5, 2012

isnotinvain Dec 5, 2012

isnotinvain Dec 6, 2012

billonahill Dec 6, 2012

isnotinvain Dec 6, 2012

sagemintblue Dec 12, 2012

isnotinvain Dec 13, 2012

		@@ -0,0 +1,147 @@
		package com.twitter.elephantbird.mapreduce.input;

Adds lucene input / output formats and pig load / store funcs #276

Adds lucene input / output formats and pig load / store funcs #276

Conversation

isnotinvain commented Nov 29, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment