Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-801398 Tidy up shaded jar #490

Merged
merged 4 commits into from
May 5, 2023
Merged

SNOW-801398 Tidy up shaded jar #490

merged 4 commits into from
May 5, 2023

Conversation

sfc-gh-lsembera
Copy link
Contributor

@sfc-gh-lsembera sfc-gh-lsembera commented May 3, 2023

This PR reorganizes our shaded JAR, so that the only top level packages are snowflake ones. All dependencies are shaded into the package net.snowflake.ingest.internal.

The script check_content.sh has been adapted accordingly, it now only ignores a limited set of required top-level files and directories.

Additionally, this PR removes some unnecessary transitive dependencies, particularly those pulled in by hadoop-common.

Tests executed:

  • dew tests (both Arrow and Parquet) have passed.
  • SDK integration tests have passed. The tests were extracted into a separate maven project and the SDK was added as a dependency, to make sure it tests the shaded JAR.

@sfc-gh-pbennes
Copy link

Ah nice, didn't know someone else was looking at #356. I opened a draft: #492 but given the discussion in #356 I'm more inclined to suggest removal of shading altogether.

@sfc-gh-lsembera sfc-gh-lsembera force-pushed the lsembera/shading branch 4 times, most recently from 2fc4e6e to b60d068 Compare May 4, 2023 12:49
@sfc-gh-lsembera
Copy link
Contributor Author

sfc-gh-lsembera commented May 4, 2023

@sfc-gh-pbennes I agree that unshaded JAR is a better option and maybe it should be our primary artifact (i.e. instead of having 1.1.4 and 1.1.4-unshaded, we could have 1.1.4 and 1.1.4-shaded). But I would argue against the removal of shading because it can help in situations when the client requires a specific version of a dependency, but we are shipping with a different, incompatible version. Some potential suspects would be Hadoop or Jackson.

@leszekgruchala
Copy link

@sfc-gh-pbennes That's fine, but the problem is that the shaded jars are not shaded correctly and are conflicting with same libraries in the class path #356

@codecov
Copy link

codecov bot commented May 4, 2023

Codecov Report

Merging #490 (268881b) into master (e00b2d3) will increase coverage by 0.11%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
+ Coverage   76.80%   76.91%   +0.11%     
==========================================
  Files          78       78              
  Lines        5229     5229              
  Branches      458      458              
==========================================
+ Hits         4016     4022       +6     
+ Misses       1021     1017       -4     
+ Partials      192      190       -2     

see 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sfc-gh-pbennes
Copy link

@sfc-gh-pbennes I agree that unshaded JAR is a better option and maybe it should be our primary artifact (i.e. instead of having 1.1.4 and 1.1.4-unshaded, we could have 1.1.4 and 1.1.4-shaded). But I would argue against the removal of shading because it can help in situations when the client requires a specific version of a dependency, but we are shipping with a different, incompatible version. Some potential suspects would be Hadoop or Jackson.

Yeah if we need to shade one or two things that's fine but we probably don't need to shade everything.

@sfc-gh-lsembera sfc-gh-lsembera marked this pull request as ready for review May 4, 2023 16:37
@sfc-gh-lsembera sfc-gh-lsembera requested review from a team and sfc-gh-tzhang as code owners May 4, 2023 16:37
Copy link
Contributor

@sfc-gh-tjones sfc-gh-tjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this but let's have Preston have the final say as I am not a Java wizard.

@sfc-gh-pbennes
Copy link

I'm good with this but let's have Preston have the final say as I am not a Java wizard.

The nice thing about having all the plugin checks is that as long as those pass it should provide a higher degree of confidence for merging stuff we might otherwise be unsure about. I'm gonna pull down the branch and take a quick look at the produced jar though.

Copy link

@sfc-gh-pbennes sfc-gh-pbennes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the version of check_content.sh from https://github.com/snowflakedb/snowflake-ingest-java/blob/f0137e6ef51a52a7ab90029fe8102043728a5a60/scripts/check_content.sh and noticed that unrelocated shaded snowflake-jdbc classes are present. snowflake-jdbc should either be relocated or excluded from shading:

[pbennes@SDP_DevVM-pbennes snowflake-ingest-java]$ scripts/check_content.sh | grep -v META-INF | head
[ERROR] Ingest SDK jar includes classes outside net.snowflake.ingest
Found classes in jar:
com/snowflake/client/jdbc/SnowflakeDriver.class
net/snowflake/client/core/AssertUtil.class
net/snowflake/client/core/BasicEvent$QueryState.class

@sfc-gh-japatel
Copy link
Collaborator

Ah nice, didn't know someone else was looking at #356. I opened a draft: #492 but given the discussion in #356 I'm more inclined to suggest removal of shading altogether.

Wouldnt removing shading altogether considered a behavior change? IIUC, this will force users of this jar to add more dependencies?

@sfc-gh-pbennes
Copy link

Ah nice, didn't know someone else was looking at #356. I opened a draft: #492 but given the discussion in #356 I'm more inclined to suggest removal of shading altogether.

Wouldnt removing shading altogether considered a behavior change? IIUC, this will force users of this jar to add more dependencies?

For relocated dependencies yes because there would now be an interaction for dependencies with multiple uses in a project consuming the ingest sdk. If any our first-order dependencies (ex. hadoop) require us to pull in ancient versions of commonly used dependencies (ex. protobuf-java) continuing to shade those may make sense, if we can't just outright exclude them, to avoid irreconcilable differences for people consuming snowflake-ingest-sdk.

For unrelocated shaded dependencies basically no because it's the same as if the dependency weren't shaded at all. The caveat here is that unrelocated shaded dependencies have the potential to be problematic due to the class conflict issue described in #492 where it's not going to be immediately obvious to someone consuming snowflake-ingest-sdk that they're going to get a version of those shaded unrelocated classes they may not expect depending on where snowflake-ingest-sdk appears in their pom.xml (assuming maven, IDK offhand how gradle etc. handles this problem).

Basically:

  • Anything shaded MUST be relocated
  • We should be highly selective about what gets shaded/relocated

Copy link

@sfc-gh-pbennes sfc-gh-pbennes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM with snowflake-jdbc excluded and the update to check_content.sh

@sfc-gh-lsembera
Copy link
Contributor Author

I discussed it with @sfc-gh-pbennes privately - my goal with the un-relocated snowflake-jdbc dependency was to avoid our users having it twice on classpath because the JAR is quite large (30MB). Not relocating the package would mean that the client app can use the JDBC driver directly from our jar. I just excluded snowflake-jdbc from shading altogether, which means that it will appear in the reduced pom.xml, so it will get pulled into the client application. We will get the reconciled version.

@sfc-gh-tzhang
Copy link
Contributor

@sfc-gh-pbennes I agree that unshaded JAR is a better option and maybe it should be our primary artifact (i.e. instead of having 1.1.4 and 1.1.4-unshaded, we could have 1.1.4 and 1.1.4-shaded). But I would argue against the removal of shading because it can help in situations when the client requires a specific version of a dependency, but we are shipping with a different, incompatible version. Some potential suspects would be Hadoop or Jackson.

I don't think now it's a good time to update the default behavior, this would be a big behavior change that needs to be communicated with customers carefully and we haven't heard many complains so far. I suggest we keep releasing 1.1.4 and 1.1.4-unshaded, customers that want the unshaded version could use that and then we could see the usage from internal telemetry to conclude on the direction we want to go.

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sfc-gh-lsembera !

pom.xml Outdated Show resolved Hide resolved
@sfc-gh-pbennes
Copy link

@sfc-gh-pbennes I agree that unshaded JAR is a better option and maybe it should be our primary artifact (i.e. instead of having 1.1.4 and 1.1.4-unshaded, we could have 1.1.4 and 1.1.4-shaded). But I would argue against the removal of shading because it can help in situations when the client requires a specific version of a dependency, but we are shipping with a different, incompatible version. Some potential suspects would be Hadoop or Jackson.

I don't think now it's a good time to update the default behavior, this would be a big behavior change that needs to be communicated with customers carefully and we haven't heard many complains so far. I suggest we keep releasing 1.1.4 and 1.1.4-unshaded, customers that want the unshaded version could use that and then we could see the usage from internal telemetry to conclude on the direction we want to go.

Agreed. My comments are more on the direction and long term goals and less on the immediate PR which is a step in the right direction.

This PR reorganizes our shaded JAR, so that the only top level packages
are snowflake ones. All dependencies are shaded in to the package
`net.snowflake.ingest.internal`.

The script `check_content.sh` has been adapted accordingly, it now only
ignores a limited set of top-level files.

Additionally, this PR removes some unnecessary transitive dependencies, particularly
those pulled in by hadoop-common.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants