Standardize on groupId/artifactId for Hadoop/Pig/Oozie #226

tucu00 opened this Issue Nov 24, 2010 · 4 comments


None yet

3 participants

tucu00 commented Nov 24, 2010

Currently we are assuming that JARs for Hadoop/Pig coming from Apache/Yahoo/Cloudera have different groupIds (org.apache.,, com.cloudera.*).

Instead using different groupIds, the different JAR providers (Apache, Yahoo, Cloudera, etc) should use the groupId and use the version to specify the JAR provider.

For example, under the proposed model the groupId for Hadoop JARs would be org.apache.hadoop, for Pig org.apache.pig, for Oozie

Then, the versions would indicate the origin if different than the original provider. For example, for Apache Hadoop a version would be 0.22.0 while for Yahoo the corresponding version would be y0.22.0.

The main reason for this standardization is to allow developers using these JARs to effectively manage exclusions. For example, today, somebody using a Pig JAR wanting to exclude the dependent Hadoop JARs must do:

dependency: ${pigGroupId}🐷0.7.0

exclude: org.apache.hadoop:hadoop-core


NOTE: Oozie does this, pig groupId is parameterized and hadoop-core must be excluded from the possible groups. Furthermore, Cloudera must add to its POMs a 3rd exclusion for com.cloudera.hadoop:hadoop-core.

This does not only affect Oozie but anybody developing applications for Hadoop/Pig using Maven or Ivy.

Cloudera is in the process of normalizing all its groupIds to use the original ones.

Apache is not affected by this as they have the original groupIds for Hadoop/Pig.

Yahoo should change the groupsIds for the Hadoop/Pig JARs they publish.

For Oozie we should keep

omalley commented Nov 24, 2010

It seems dangerous to have multiple orgs all pushing into the same namespace. Furthermore, I don't think most of the forges that sync to the central repository will let you submit for other groups.

tucu00 commented Nov 24, 2010

Well, it depends, hadoop-core JARs produced by Apache, Yahoo, Cloudera, etc. are after all different versions of the same component.

And that is how Maven groupId:artifactId:version management is designed to work.

If central repositories don't accept a particular version, as a developer, you can always add the corresponding repository to your repository list.

Currently for Oozie (forget Cloudera distribution) this is a pain because it has to deal with Apache and Yahoo Hadoop components for Hadoop and Pig. And this complicates significantly the build process.

And this complexity is not only to Oozie developers but to Oozie users as well as they have to do multiple exclusions for the 'same' component just because it may come from Apache or Yahoo.

Bottom line, Maven is designed to handled this efficiently via versions.

That is why this request.

tucu00 commented Dec 1, 2010

Any further thoughts on this one?

cdouglas commented Dec 9, 2010

I didn't find much guidance in the Maven documentation on using versions to distinguish vendors. However, from docs on version syntax:

And a proposed (but old) extension to handle vendors, where the repo is cited as sufficient to distinguish sources:

It looks like it could could be a good solution. Since the Apache project uses <major>.<minor>.<patch> version strings, other groups could use the optional <qualifier> to distinguish them, though neither y0.22.0 nor 0.22.0-y would work well with the range syntax, and I'm not sure whether the default version comparison will handle non-numeric characters in the major version.

I'm unfamiliar with the forge case. Would it prevent jars from Yahoo or Cloudera from being aggregated (as the namespace wouldn't match the source), or are there other issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment