Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds lucene input / output formats and pig load / store funcs #276

Merged
merged 17 commits into from Dec 13, 2012
Merged

Adds lucene input / output formats and pig load / store funcs #276

merged 17 commits into from Dec 13, 2012

Conversation

isnotinvain
Copy link
Contributor

This commit adds an abstract OutputFormat + StoreFunc for creating lucene indexes in HDFS and an abstract InputFormat + LoadFunc for searching them.

It also adds a tool for merging indexes in HDFS.

These base classes make it easy to index + search from an MR task or pig script.

A lot of this is code migration with some generalization and cleanup.

I've tried to do the Right Thing™ with the maven pom files but I'm new to maven so let me know if it could be better.

There are some TODOs left in the code, most of them are questions that hopefully I can get answered on this pull request and fix them before submitting.


walkPath(path, filter, conf, accumulateDirectories, new PathVisitor() {
@Override
public void visit(FileStatus fileStatus) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accumulateDirectories is not checked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it is checked in walkPath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about removing that flag and then leaving the option up to the
path filter, but then the pathfilter needs to do figure out if a path is a
directory which requires a FileSystem and config.
On Nov 29, 2012 2:01 PM, "Raghu Angadi" notifications@github.com wrote:

In core/src/main/java/com/twitter/elephantbird/util/HdfsUtils.java:

  • /**
  • * Recursively walk a path, adding paths that are accepted by filter to accumulator
  • * @param path root path to begin walking, will be added to accumulator
  • * @param filter filter to determine which paths to accept
  • * @param conf hadoop conf
  • * @param accumulateDirectories whether or not to accumulate directories
  • * @param accumulator all paths accepted will be added to accumulator
  • * @throws IOException
  • */
  • public static void collectPaths(Path path, PathFilter filter, Configuration conf,
  •  boolean accumulateDirectories, final List<Path> accumulator) throws IOException {
    
  • walkPath(path, filter, conf, accumulateDirectories, new PathVisitor() {
  •  @Override
    
  •  public void visit(FileStatus fileStatus) {
    

oh, it is checked in walkPath


Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/pull/276/files#r2271629.

* @param conf to write to
* @throws IOException
*/
public static void writeObjectToConfig(String key, Object obj, Configuration conf) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name and its reader should be more descriptive (i.e. writeObjectToConfAsBase64) to allow for other methods with different serialization schemes (i.e. writeObjectToConfAsJson).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

* @throws IOException
*/
public static long getDirectorySize(Path path, FileSystem fs) throws IOException {
return getDirectorySize(path, fs, PathFilters.ACCEPT_ALL_PATHS_FILTER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: if we counting everything, fs.getContentSummary(path) would be the most efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks I'll use that.

@@ -0,0 +1,147 @@
package com.twitter.elephantbird.mapreduce.input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the package name is bit odd. should it be under '...elephantbird.lucene...'? almost always 'mapreduce.input' deals with MR inputformats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided that since it's only used by an InputFormat that it can stay here.

…ug where directories' sizes were being counted
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException {
FileOutputCommitter committer = (FileOutputCommitter) this.getOutputCommitter(job);

File tmpDirFile = Files.createTempDir();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the current working directory for this task instead of a jvm temp dir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same thing. I thought you'd want to create a temp dir under the location "mapred.child.tmp" refers to, but it seems Hadoop sets java.io.tmpdir for the task so you're ok as implemented.

http://lucene.472066.n3.nabble.com/Creating-and-working-with-temporary-file-in-a-map-function-td3893392.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I confirmed it on our cluster so seems fine.
I had thought to use context.getWorkingDir() but that returns an HDFS path.

<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<scope>test</scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checking-- this artifact is commonly used only for testing?

for consistency may want to swap ordering of and elements here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's not a testing utility, but I'm using it in a test. We don't want to bundle it because people may not use it.
Am I missing something?

rangadi added a commit that referenced this pull request Dec 13, 2012
Adds lucene input / output formats and pig load / store funcs
@rangadi rangadi merged commit be76534 into twitter:master Dec 13, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants