Spark list hashing #11292

bdice · 2022-07-18T18:43:17Z

Closes #10378. This PR provides Spark-compliant hash values for list columns.

…hashing

…code.

bdice

Self-review comments.

cpp/src/hash/spark_murmur_hash.cu

cpp/include/cudf/table/experimental/row_operators.cuh

cpp/src/hash/spark_murmur_hash.cu

ttnghia · 2022-07-26T23:31:41Z

cpp/include/cudf/detail/hashing.hpp

@@ -44,6 +44,12 @@ std::unique_ptr<column> murmur_hash3_32(
  rmm::cuda_stream_view stream        = cudf::default_stream_value,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

+std::unique_ptr<column> spark_murmur_hash3_32(


I wonder why the APIs here don't have doxygen?

They're detail APIs, which don't require docs. The public API is cudf::hash.

The build does not require detail APIs to have doxygen but programmers would still appreciate documentation.
You can see many detail functions are documented with @copydoc tags for example.
I can add some detail to the https://github.com/rapidsai/cudf/blob/branch-22.08/cpp/docs/DOCUMENTATION.md

There will be some significant changes in this file with other planned work (#11296), so I'm going to defer on this until I can do it for the whole file. I added a note to myself to improve this later for all hash functions. #10081 (comment)

ttnghia · 2022-07-26T23:35:01Z

cpp/src/hash/spark_murmur_hash.cu

+
+void check_hash_compatibility(table_view const& input)
+{
+  using column_checker_fn_t = std::function<void(column_view const&)>;


Why a function wrapper is used here? Why can't just use a lambda?

Good question - I asked this on a previous PR that did something like this (can't find the reference). The problem is that lambda functions defined as auto don't work when they must be called recursively. https://stackoverflow.com/questions/2067988/recursive-lambda-functions-in-c11

You can see this in other places in libcudf:

cudf/cpp/src/table/row_operators.cu

Line 266 in e7e5f45

column_checker_fn_t check_column = [&](column_view const& c) {

…_hashing

ttnghia

Great work. Handling the level of complexity of template+typename+template... in this PR is highly appreciated 😄

rwlee · 2022-07-27T01:52:45Z

cpp/include/cudf/table/experimental/row_operators.cuh

  Nullate const _check_nulls;
+  table_device_view const _table;


Just curious, is there a style guideline or other reasoning for the ordering change?

I'm guessing one of two possibilities:

He just chose to alphabetize.

The order in which members are initialized is based on the order that they are declared here, not the order that they appear in the initializer list (the part after the : in the constructor). Since check_nulls comes before t in the constructor signature, he may have reordered the initializer list to match, and then reordering this bit becomes necessary to avoid creating an asymmetry that could catch unwary developers off guard (there are subtle bugs that can come from the wrong initialization order if the constructor makes some invalid assumptions).

This is about initialization order matching member order. Compilers sometimes throw warnings about this, and it’s good practice to make the constructor argument order match the initialization order and member order when possible.

rwlee

Looks good, all the nits I had seen previously got cleaned up in other iterations. Really exciting nested type functionality!

I added Java tests for nested-structs, lists, and structs-of-lists for better coverage. We can expand this testing in the plugin and in the follow on for lists-of-structs.

Plugin side testing has also indicated good results with this solution outside of the lists-of-structs case that is currently caught and excepted inside cudf C++. Because the error case is non-trivial to catch in the JNI layer, relying on the C++ exception is acceptable.

cpp/src/hash/spark_murmur_hash.cu

cpp/include/cudf/table/experimental/row_operators.cuh

… to be private.

vyasr

LGTM, thanks! I haven't looked at the Java test, but I assume it's an issue with not properly closing a resource.

bdice · 2022-07-28T03:19:29Z

@rwlee It appears there might be an issue with the Java tests leaking memory. I reverted those tests and have opened PR #11379 to fix this for branch-22.10. I don't think the leak is in libcudf - I see a few Java tests that manually close their column views, so I hope this is one of those cases (but I'm unsure what to do to fix it).

ai.rapids.cudf.RmmException: Could not shut down RMM there appear to be outstanding allocations

bdice · 2022-07-28T03:38:59Z

@gpucibot merge

@rwlee

This PR closes #11296. While implementing Spark list hashing in #11292, I noticed that `HASH_SERIAL_MURMUR3` does not appear to be used except in tests. It is not exposed in Python. While it is exposed in the JNI bindings, it is not used by spark-rapids. I discussed this with @rwlee and it seems that this feature was added only for parallel design with the Spark serial hash implementation in #6781, which is superseded by #11292. We do not need to keep this vestigial feature. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/brandon-b-miller - David Wendt (https://github.com/davidwendt) - Jason Lowe (https://github.com/jlowe) URL: #11383

This PR adds Java tests for the Spark list hashing feature added in #11292. Depends on #11292. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - Jason Lowe (https://github.com/jlowe) URL: #11379

bdice added 4 commits June 28, 2022 14:01

Add test for Spark list-of-list hashing.

ebf079a

Improve test.

5632e7c

Merge remote-tracking branch 'upstream/branch-22.08' into spark-list-…

d721597

…hashing

Merge remote-tracking branch 'upstream/branch-22.08' into spark-list-…

aee8fe0

…hashing

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 18, 2022

bdice self-assigned this Jul 18, 2022

bdice added the 2 - In Progress Currently a work in progress label Jul 18, 2022

bdice added this to PR-WIP in v22.08 Release via automation Jul 18, 2022

bdice added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jul 18, 2022

bdice mentioned this pull request Jul 18, 2022

Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 #11296

Closed

bdice added 4 commits July 18, 2022 22:04

Copy experimental row hasher for modification.

6e55198

Make preprocessed table methods public.

95f25cb

Use structured binding to make names map more clearly to the objects.

ac21705

Add note about hash types, fix tests.

4106aef

github-actions bot added the CMake CMake build issue label Jul 19, 2022

bdice added 7 commits July 19, 2022 14:01

Align hashing behavior closer to Spark. Some tests passing, but not all.

ca8557d

Remove commented code, fix dispatch for decimal32 and decimal128.

703956f

Reorder members to match constructor signature.

a98c130

Update nested type hashing to act more like the previous serial hash …

659bca9

…code.

Use seed 42 for consistency with Spark.

fcca18a

Clean up.

189f3bf

Fix bug in test (wrong index).

ee754a7

bdice commented Jul 20, 2022

View reviewed changes

cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved

cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved

bdice added 2 commits July 20, 2022 09:32

Template row_hasher on device_row_hasher class.

b163b49

Fix up friends / private methods.

387390a

bdice commented Jul 20, 2022

View reviewed changes

cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved

bdice added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 20, 2022

bdice marked this pull request as ready for review July 20, 2022 19:08

ttnghia reviewed Jul 26, 2022

View reviewed changes

rwlee added 2 commits July 26, 2022 17:06

add list test

6475517

Merge remote-tracking branch 'pub/pull-request/11292' into bdice/list…

0c0a1fb

…_hashing

ttnghia approved these changes Jul 27, 2022

View reviewed changes

nested struct java test

028b0bc

rwlee reviewed Jul 27, 2022

View reviewed changes

rwlee approved these changes Jul 27, 2022

View reviewed changes

robertmaynard requested changes Jul 27, 2022

View reviewed changes

cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved

cpp/src/hash/spark_murmur_hash.cu Outdated Show resolved Hide resolved

GregoryKimball mentioned this pull request Jul 27, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

vyasr reviewed Jul 27, 2022

View reviewed changes

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved

bdice added 4 commits July 27, 2022 17:02

Remove deleted constructor.

5d05984

Require SparkMurmurHash3_32 in constructor.

9281aee

Remove deleted public constructor because the constructor is declared…

fa50d2c

… to be private.

Template only the device_hasher method.

ef0ad3b

bdice requested review from vyasr and robertmaynard July 27, 2022 22:32

vyasr approved these changes Jul 28, 2022

View reviewed changes

robertmaynard approved these changes Jul 28, 2022

View reviewed changes

v22.08 Release automation moved this from PR-Needs review to PR-Reviewer approved Jul 28, 2022

Temporarily remove JNI tests due to an outstanding allocation bug.

aaadac2

bdice mentioned this pull request Jul 28, 2022

Add Spark list hashing Java tests #11379

Merged

3 tasks

rapids-bot bot merged commit 0891746 into rapidsai:branch-22.08 Jul 28, 2022

v22.08 Release automation moved this from PR-Reviewer approved to Done Jul 28, 2022

bdice mentioned this pull request Jul 28, 2022

Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 #11383

Merged

3 tasks

GregoryKimball mentioned this pull request Oct 3, 2022

[FEA] Add nested struct support in serial hash functions #9119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark list hashing #11292

Spark list hashing #11292

bdice commented Jul 18, 2022

bdice left a comment

ttnghia Jul 26, 2022

bdice Jul 26, 2022 •

edited

Loading

davidwendt Jul 27, 2022

bdice Jul 27, 2022

ttnghia Jul 26, 2022

bdice Jul 26, 2022

ttnghia left a comment

rwlee Jul 27, 2022

vyasr Jul 27, 2022

bdice Jul 27, 2022

rwlee left a comment

vyasr left a comment •

edited

Loading

bdice commented Jul 28, 2022

bdice commented Jul 28, 2022

Spark list hashing #11292

Spark list hashing #11292

Conversation

bdice commented Jul 18, 2022

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rwlee left a comment

Choose a reason for hiding this comment

vyasr left a comment • edited Loading

Choose a reason for hiding this comment

bdice commented Jul 28, 2022

bdice commented Jul 28, 2022

bdice Jul 26, 2022 •

edited

Loading

vyasr left a comment •

edited

Loading