Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split elephant-bird into modules #201

Closed
wants to merge 32 commits into from
Closed

Split elephant-bird into modules #201

wants to merge 32 commits into from

Conversation

johanoskarsson
Copy link
Contributor

I split the project into core, pig, hive, cascading2 and rcfile modules. This will make it easier to depend on only the parts you need without getting a metric ton of extra jar files. The rcfile module was only since it needs both hive and pig.

I switched the project to Maven in order to easier modularize it. I don't really care what is used though, this just seemed easier. Added two plugins for thrift and protobuf code gen from Maven.

This branch is almost fully baked, but there are a couple of nits. I probably won't have time to get this all the way to the finish line, but it's almost there. Hopefully someone else can help out with these.

  • testStableHashcodeAcrossJVMs in TestProtobufWritable is commented out. This was using some Ant dependent way of launching another JVM. Should be doable to make this work without Ant though.
  • Maven complains about directly using the artifacts in lib/, but most of those have been discussed in another issue (frankenjar etc). It seems to work anyway.
  • All tests pass, but I have not tested any of the artifacts with a real Hadoop job to see if it actually works

@sagemintblue
Copy link
Contributor

Raghu, the maven-release-plugin can take care of version ids in the parent and all sub modules if you go that route:

http://maven.apache.org/plugins/maven-release-plugin/prepare-mojo.html#autoVersionSubmodules

@rangadi
Copy link
Contributor

rangadi commented May 16, 2012

huge thanks Johan. will go through more thoroughly.
yeah, I think we should be able to take this further.. it might even be EB-3.0.0 (doesn't mean a big delay :).

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.twitter</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add parent info? We can configure dependency management in parent pom.xml to simplify dependency version management and other build plugins across all sub modules that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very well versed in the Maven world. What exactly is the parent info you are referring to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I found it. Adding now.

@traviscrawford
Copy link
Contributor

Can you update the readme with a section about building, testing and deploying to the in-git-repo repository? This is a pretty major build change and we should point users in the right direction.

<dependency>
<groupId>com.twitter</groupId>
<artifactId>elephant-bird-core</artifactId>
<version>2.2.3-SNAPSHOT</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to set the project version in one place and have all sub projects use it? I could see these version strings getting out of date, not just for the EB release, but for stuff like protobuf. Can we reference properties or something instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can specify all dependency version info in parent pom.xml within dependencyManagement section. Also, if each sub module pom.xml contains parent section, we can leave off the sub module's groupId and version elements.

@johanoskarsson
Copy link
Contributor Author

I've updated the readme and fixed the broken test. I also added the two plugins to our maven repo so they won't disappear. Using the hadoop-lzo from the local dir until we can get that in a repo. I believe this covers all of the issues pointed out above.

<groupId>com.twitter</groupId>
<artifactId>elephant-bird</artifactId>
<version>2.2.3-SNAPSHOT</version>
</parent>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May also want element in here and in other submodule pom.xmls:

http://maven.apache.org/pom.html#Inheritance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not following unfortunately. What element? What does it do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like github filtered out the html element from my earlier comment :(

I was referring to parent/relativePath element, which just points submodule to directory containing parent pom.xml if it's likely to be close by. Generally, this is just "..".

@sagemintblue
Copy link
Contributor

When I push the hcatalog dep config down into eb hive module, I get this error:

[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for com.twitter.elephantbird:elephant-bird-hive:jar:3.0.0-SNAPSHOT
[WARNING] 'dependencies.dependency.systemPath' for org.apache.hcatalog:hcatalog:jar should not point at files within the project directory, ${basedir}/../lib/hcatalog-0.5.0-dev.jar will be unresolvable by dependent projects @ line 33, column 19
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.

Not sure if we should worry about this one, but we could alleviate it by publishing this jar into some publicly accessibly maven repo and updating repository config for EB hive module.

@dvryaboy
Copy link
Contributor

https://issues.apache.org/jira/browse/HCATALOG-132 -- looks like Travis got the pom generated, but auto-publishing isn't happening yet.

@johanoskarsson
Copy link
Contributor Author

Does the hcatalog jar contain modified code or can we simply change that to one of their released artifacts?

@johanoskarsson
Copy link
Contributor Author

If we just need one of their artifacts but it's not in a maven repo yet we can simply put it in maven.twittr.com till they do.

@dvryaboy
Copy link
Contributor

We need the 0.5 snapshot (which is a moving target, naturally). But afaik there are no custom patches. Travis is the authority on this one.

- Removes yamlbeans from dependency management as it's not used by any modules.
- Moves relative system path of hcatalog dep from parent pom to hive pom.
- Adds slf4j impl jar dep to submodules. This dep is optional by default so each submodule has to specify what slf4j impl jar they want to use.
- Forces junit47 provider within default surefire configuration. I tested parallel test execution, but some tests fail in that mode.
@sagemintblue
Copy link
Contributor

I've made a few small updates to the poms. Please see https://github.com/johanoskarsson/elephant-bird/pull/2

Question: The pig and mahout module tests produce a lot of logging output. We can add src/test/resources/log4j.properties configurations to reduce the console output to WARN / ERROR messages. Sound good, or should we just leave all output there?

@sagemintblue
Copy link
Contributor

Seems like there are a few places in core module which still use log4j api's directly, instead of slf4j. Anyone mind if I submit a pull req to migrate to slf4j completely?

@sagemintblue
Copy link
Contributor

Looks like there's also a path for deployment of 3rd-party artifacts to Sonatype-- another option for the hcatalog jar:

https://docs.sonatype.org/display/Repository/Uploading+3rd-party+Artifacts+to+The+Central+Repository

I've been working more on release process and configuration for deployment to Sonatype. Hoping to have another pull request ready later tonight.

@dvryaboy
Copy link
Contributor

Ok the latest seems to have worked for me ("seems" because I am now working through other issues. But at least EB was fetched and the compile is no longer complaining about transitive dependencies like Hadoop).

- Adds Johan to contributors section of Readme
- Adds developers section
- Adds parent section to parent pom which references Sonatype's root pom
- Removes conflicting configuration from pluginManagement section
@sagemintblue
Copy link
Contributor

Some small updates to support deployment to Sonatype OSS repos:

https://github.com/johanoskarsson/elephant-bird/pull/3

Updates parent pom.xml to include required Sonatype OSS config
@sagemintblue
Copy link
Contributor

Here's my +1 on this branch. The one remaining thing I'd wanted to test is full staging release to Sonatype repo, but this is awkward till after this branch has been merged.

I'd also like to revisit name of submodule directories, in case the maven release plugin is more forgiving of those now.. I prefer simpler names i.e. elephant-bird/elephant-bird-pig becomes elephant-bird/pig.

@rangadi
Copy link
Contributor

rangadi commented May 30, 2012

the output from tests does not go console using ant.
I prefer the simpler names for top level directories. I am not how much more maven work not use the release plugin..

@sagemintblue
Copy link
Contributor

@rangadi I had tested adding a test-specific log4j.properties configuration which pushes test logging output to a file. This is a viable solution, but we'll have to add it to path src/test/resources/ within every submodule.

In terms of release process, I'd like to try and stick with the release plugin if at all possible. The only issue with using submodule directory names which don't match the module names is that SCM paths must be explicitly defined within submodule poms. Otherwise, they default to SCM paths based on module names, and not the actual directory names.

@rangadi
Copy link
Contributor

rangadi commented May 30, 2012

explicitly defining paths does not seem that bad. it is done only once. The directory names are the ones we come across all the time.

adding log4j.properties to each directory is not bad either..

@sagemintblue
Copy link
Contributor

I'll make those two things happen and request another pull from Johan into this branch.

@rangadi
Copy link
Contributor

rangadi commented May 30, 2012

Thanks guys. I will make a branch-0.3 later today and you can even publish the jars. will merge into master in a few days..

@sagemintblue
Copy link
Contributor

branch-0.3 or branch-3.0?

@rangadi
Copy link
Contributor

rangadi commented May 30, 2012

3.0 of course, or we could merge directly the master. What do you guys think?
Only 'beta' part I see are the things like directory names etc.

sagemintblue and others added 3 commits May 30, 2012 14:53
Also updates .gitignore to reflect new module paths.
Renames all submodule directories, removing "elephant-bird-" prefix
@rangadi
Copy link
Contributor

rangadi commented May 31, 2012

thanks guys.
merged to https://github.com/kevinweil/elephant-bird/tree/branch-3.0

@rangadi rangadi closed this May 31, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants