New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds lucene input / output formats and pig load / store funcs #276
Conversation
|
||
walkPath(path, filter, conf, accumulateDirectories, new PathVisitor() { | ||
@Override | ||
public void visit(FileStatus fileStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accumulateDirectories is not checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, it is checked in walkPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about removing that flag and then leaving the option up to the
path filter, but then the pathfilter needs to do figure out if a path is a
directory which requires a FileSystem and config.
On Nov 29, 2012 2:01 PM, "Raghu Angadi" notifications@github.com wrote:
In core/src/main/java/com/twitter/elephantbird/util/HdfsUtils.java:
- /**
- * Recursively walk a path, adding paths that are accepted by filter to accumulator
- * @param path root path to begin walking, will be added to accumulator
- * @param filter filter to determine which paths to accept
- * @param conf hadoop conf
- * @param accumulateDirectories whether or not to accumulate directories
- * @param accumulator all paths accepted will be added to accumulator
- * @throws IOException
- */
- public static void collectPaths(Path path, PathFilter filter, Configuration conf,
boolean accumulateDirectories, final List<Path> accumulator) throws IOException {
- walkPath(path, filter, conf, accumulateDirectories, new PathVisitor() {
@Override
public void visit(FileStatus fileStatus) {
oh, it is checked in walkPath
—
Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/pull/276/files#r2271629.
* @param conf to write to | ||
* @throws IOException | ||
*/ | ||
public static void writeObjectToConfig(String key, Object obj, Configuration conf) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method name and its reader should be more descriptive (i.e. writeObjectToConfAsBase64) to allow for other methods with different serialization schemes (i.e. writeObjectToConfAsJson).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense.
* @throws IOException | ||
*/ | ||
public static long getDirectorySize(Path path, FileSystem fs) throws IOException { | ||
return getDirectorySize(path, fs, PathFilters.ACCEPT_ALL_PATHS_FILTER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: if we counting everything, fs.getContentSummary(path) would be the most efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks I'll use that.
@@ -0,0 +1,147 @@ | |||
package com.twitter.elephantbird.mapreduce.input; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the package name is bit odd. should it be under '...elephantbird.lucene...'? almost always 'mapreduce.input' deals with MR inputformats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided that since it's only used by an InputFormat that it can stay here.
…ug where directories' sizes were being counted
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException { | ||
FileOutputCommitter committer = (FileOutputCommitter) this.getOutputCommitter(job); | ||
|
||
File tmpDirFile = Files.createTempDir(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the current working directory for this task instead of a jvm temp dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same thing. I thought you'd want to create a temp dir under the location "mapred.child.tmp" refers to, but it seems Hadoop sets java.io.tmpdir for the task so you're ok as implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I confirmed it on our cluster so seems fine.
I had thought to use context.getWorkingDir() but that returns an HDFS path.
<dependency> | ||
<groupId>org.apache.lucene</groupId> | ||
<artifactId>lucene-queryparser</artifactId> | ||
<scope>test</scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double checking-- this artifact is commonly used only for testing?
for consistency may want to swap ordering of and elements here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it's not a testing utility, but I'm using it in a test. We don't want to bundle it because people may not use it.
Am I missing something?
Adds lucene input / output formats and pig load / store funcs
This commit adds an abstract OutputFormat + StoreFunc for creating lucene indexes in HDFS and an abstract InputFormat + LoadFunc for searching them.
It also adds a tool for merging indexes in HDFS.
These base classes make it easy to index + search from an MR task or pig script.
A lot of this is code migration with some generalization and cleanup.
I've tried to do the Right Thing™ with the maven pom files but I'm new to maven so let me know if it could be better.
There are some TODOs left in the code, most of them are questions that hopefully I can get answered on this pull request and fix them before submitting.