Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-818891 Remove Arrow and other unneeded dependencies #497

Merged
merged 3 commits into from
May 31, 2023

Conversation

sfc-gh-lsembera
Copy link
Contributor

@sfc-gh-lsembera sfc-gh-lsembera commented May 16, 2023

This PR removes Arrow and the some transitive dependencies of hadoop-common: Apache Curator (Zookeeper client), commons-logging, jsch (SSH client). The JAR size shrunk from ~49MB to ~41MB. Build runtime has been cut significantly.

Testing: Dew tests passed

@codecov
Copy link

codecov bot commented May 16, 2023

Codecov Report

Merging #497 (446e9ae) into master (5e54f7e) will decrease coverage by 2.13%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #497      +/-   ##
==========================================
- Coverage   76.97%   74.85%   -2.13%     
==========================================
  Files          79       77       -2     
  Lines        5273     4764     -509     
  Branches      460      419      -41     
==========================================
- Hits         4059     3566     -493     
+ Misses       1024     1017       -7     
+ Partials      190      181       -9     
Impacted Files Coverage Δ
...e/ingest/streaming/internal/AbstractRowBuffer.java 94.00% <ø> (+1.51%) ⬆️
...wflake/ingest/streaming/internal/BlobMetadata.java 100.00% <ø> (ø)
...wflake/ingest/streaming/internal/ChannelCache.java 87.50% <ø> (ø)
...owflake/ingest/streaming/internal/ChannelData.java 97.87% <ø> (ø)
...wflake/ingest/streaming/internal/FlushService.java 79.02% <ø> (+0.58%) ⬆️
...t/snowflake/ingest/streaming/internal/Flusher.java 100.00% <ø> (ø)
...ke/ingest/streaming/internal/ParquetRowBuffer.java 87.69% <ø> (ø)
...ake/ingest/streaming/internal/RegisterService.java 88.17% <ø> (ø)
...ternal/SnowflakeStreamingIngestClientInternal.java 86.53% <ø> (-2.11%) ⬇️
...ngest/streaming/internal/StreamingIngestStage.java 76.47% <ø> (ø)
... and 8 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sfc-gh-lsembera sfc-gh-lsembera force-pushed the lsembera/dependency-review branch 3 times, most recently from 67bbdc7 to e263fbb Compare May 24, 2023 12:36
@sfc-gh-lsembera sfc-gh-lsembera changed the title SNOW-818891 Remove unneeded dependencies SNOW-818891 Remove Arrow and other unneeded dependencies May 24, 2023
@sfc-gh-lsembera sfc-gh-lsembera force-pushed the lsembera/dependency-review branch 2 times, most recently from e5e6cce to 86cc28a Compare May 24, 2023 16:01
@sfc-gh-lsembera sfc-gh-lsembera marked this pull request as ready for review May 24, 2023 16:11
Copy link
Collaborator

@sfc-gh-japatel sfc-gh-japatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for cleanup!

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sfc-gh-lsembera ! This is huge! Please make sure to ask Andrey or Gloria to take a look

allocator);
}

private BufferAllocator createBufferAllocator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the memory allocation logic is inside the Parquet library? What if we want to use a different allocator per channel/table for usage tracking and better estimation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the memory allocation logic is inside the Parquet library?

Yes.

What if we want to use a different allocator per channel/table for usage tracking and better estimation?

That's a good point, we might need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need it, we can get it from git history, it is not a complicated code.

@sfc-gh-tzhang
Copy link
Contributor

The code coverage decreases by 2.5%, could you make sure to add the tests for Parquet?

Comment on lines 69 to 70
// Unused (previously Arrow with per column compression.
// TWO(2),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can remove this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

allocator);
}

private BufferAllocator createBufferAllocator() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the memory allocation logic is inside the Parquet library?

Yes.

What if we want to use a different allocator per channel/table for usage tracking and better estimation?

That's a good point, we might need it.

{"Arrow", Constants.BdecVersion.ONE},
{"Parquet_w/o_optimization", Constants.BdecVersion.THREE}
});
new Object[][] {{"Parquet_w/o_optimization", Constants.BdecVersion.THREE}});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this parameter logic as a whole, since it is just testing parquet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@sfc-gh-azagrebin sfc-gh-azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Lukas!

int compressedChunkLength = compressionResult.getSecond();
Pair<byte[], Integer> paddedChunk =
padChunk(chunkData, Constants.ENCRYPTION_ALGORITHM_BLOCK_SIZE_BYTES);
byte[] compressedAndPaddedChunkData = paddedChunk.getFirst();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
byte[] compressedAndPaddedChunkData = paddedChunk.getFirst();
byte[] paddedChunkData = paddedChunk.getFirst();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Pair<byte[], Integer> paddedChunk =
padChunk(chunkData, Constants.ENCRYPTION_ALGORITHM_BLOCK_SIZE_BYTES);
byte[] compressedAndPaddedChunkData = paddedChunk.getFirst();
int compressedChunkLength = paddedChunk.getSecond();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int compressedChunkLength = paddedChunk.getSecond();
int paddedChunkLength = paddedChunk.getSecond();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

public enum BdecVersion {
/** Uses Arrow to generate BDEC chunks. */
ONE(1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider to comment it out for history, just to see what we used prev numbers for, like TWO(2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out

@@ -1,30 +0,0 @@
JSch 0.0.* was released under the GNU LGPL license. Later, we have switched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it Arrow related? how to know?

Copy link
Contributor Author

@sfc-gh-lsembera sfc-gh-lsembera May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes other dependencies, not just Arrow. This license file is removed because the jsch (an SSH client pulled in by hadoop-common) is being removed.


@RunWith(Parameterized.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove Parameterized only in this test? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed from everywhere

@sfc-gh-lsembera
Copy link
Contributor Author

Thanks @sfc-gh-lsembera ! This is huge! Please make sure to ask Andrey or Gloria to take a look

@sfc-gh-tzhang I reviewed the codecov report and there aren't any new holes in coverage. The reason for lower coverage is that the some higher-covered Arrow code has been removed, which lowered the overall coverage.

@sfc-gh-lsembera sfc-gh-lsembera merged commit 9d1ab3d into master May 31, 2023
10 of 12 checks passed
@sfc-gh-lsembera sfc-gh-lsembera deleted the lsembera/dependency-review branch May 31, 2023 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants