[REVIEW] Serial murmur3 hash with configurable seed #6781

rwlee · 2020-11-17T00:47:30Z

Expand existing murmur3 hashing functionality to hash the row elements serially rather than using a merge function. Also enables configuring the hash seed and null hash value.

GPUtester · 2020-11-17T00:48:02Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

jrhemstad · 2020-11-17T01:40:50Z

cpp/src/hash/hashing.cu

@@ -642,12 +642,14 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
 std::unique_ptr<column> hash(table_view const& input,
                             hash_id hash_function,
                             std::vector<uint32_t> const& initial_hash,
+                             uint32_t seed,


Is the initial_hash not sufficient for the seed? Can it be made to be sufficient? I'd like to avoid having both initial_hash and seed as it is confusing.

Yeah, it can definitely substitute. I'll include an assertion that it should be a single value in the vector.

Oh, I see the difference. Previously initial_hash requires a value per column. seed is just a single value. Hm, maybe both are okay then.

The code change is simple, so it's really a question of what is most intuitive to a user. I had originally confused initial hash with seed values -- which is why I split it off -- but I'm also worried about adding too many arguments to the hash function. I think an argument of a single seed value is generic enough to include though.

codecov · 2020-11-17T13:51:16Z

Codecov Report

Merging #6781 (5593718) into branch-0.17 (e1e3047) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.17    #6781      +/-   ##
===============================================
- Coverage        81.94%   81.94%   -0.01%     
===============================================
  Files               96       96              
  Lines            16164    16166       +2     
===============================================
+ Hits             13246    13247       +1     
- Misses            2918     2919       +1

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/datetime.py	`88.44% <0.00%> (-0.51%)`	⬇️
python/cudf/cudf/core/column/string.py	`86.64% <0.00%> (-0.18%)`	⬇️
python/cudf/cudf/core/tools/datetimes.py	`81.60% <0.00%> (-0.15%)`	⬇️
python/cudf/cudf/core/series.py	`91.29% <0.00%> (-0.08%)`	⬇️
python/cudf/cudf/core/column/timedelta.py	`89.45% <0.00%> (-0.05%)`	⬇️
python/cudf/cudf/core/dataframe.py	`90.99% <0.00%> (-0.02%)`	⬇️
python/cudf/cudf/core/column/numerical.py	`94.50% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`90.06% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/index.py	`93.13% <0.00%> (+0.24%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1e3047...5593718. Read the comment docs.

harrism · 2020-11-18T01:58:35Z

Instead of using [WIP], please create a draft PR and set the "in progress" label. When it's ready for review, add the corresponding label, and click "ready for review". This way reviewers are not notified until you are ready.

revans2 · 2020-11-20T19:53:29Z

java/src/main/java/ai/rapids/cudf/ColumnVector.java

+   * @param columns array of columns to hash, must have identical number of rows.
+   * @return the new ColumnVector of 32 character hex strings representing each row's hash value.
+   */
+  public static ColumnVector serial32BitMurmurHash3(int seed, ColumnVector... columns) {


The input parameters need to be ColumnView instead of ColumnVector.

revans2 · 2020-11-20T19:54:33Z

java/src/main/java/ai/rapids/cudf/ColumnVector.java

+    return new ColumnVector(hash(columnViews, HashType.HASH_SERIAL_MURMUR3.getNativeId(), new int[0], seed));
+  }
+
+  public static ColumnVector serial32BitMurmurHash3(ColumnVector... columns) {


Could we get java docs for this? Also just like for the above function we need ColumnView instead of ColumnVector as the input.

revans2 · 2020-11-20T19:55:12Z

java/src/main/native/src/ColumnVectorJni.cpp

-                                                              jobject j_object,
-                                                              jlongArray column_handles,
-                                                              jint hash_function_id) {
+                                                                  jobject j_object,


nit: indentation appears to be off.

jrhemstad · 2020-11-20T20:00:30Z

cpp/src/hash/hashing.cu

+        for (int col_index = 0; col_index < device_input.num_columns(); col_index++) {
+          hash_result = cudf::type_dispatcher(
+            device_input.column(col_index).type(),
+            element_hasher_with_seed<MurmurHash3_32, true>{hash_result, hash_result},
+            device_input.column(col_index),
+            row_index);
+        }


This could be better done as a thrust::reduce (or transform_reduce) with a thrust::seq exec policy.

Restructured to use thrust::tabulate and a lambda with a thrust::reduce

nvdbaranec · 2020-11-20T21:30:34Z

cpp/include/cudf/table/row_operators.cuh

+    if (has_nulls && col.is_null(row_index)) { return _null_hash; }
+
+    return hash_function<T>{_seed}(col.element<T>(row_index));
+  }


Just as a heads up on handling nested types here. One way to add support without requiring recursive examination here would be:

Preprocess any nested type column into a uint32_t column of pre-hashed values.

Substitute that preprocessed column into the table view in place of the nested type. You'd also have to know not to invoke hash_function<> here and just return the value directly.

Depending what it means to hash something like a List<List<Struct<int, float>, int, List>>>> etc you could then potentially generate this preprocessed column using the standard nested type technique of processing each level of nesting as a separate chunk of GPU work, and then recursing on the CPU. A good example of this being something like:

cudf/cpp/src/structs/copying/concatenate.cu

Line 40 in 71d4c34

std::unique_ptr<column> concatenate(std::vector<column_view> const& columns,

The plausibility great depends on what it even means to hash data like this, of course.

Note : not suggesting that should go into this PR. Just how it might work when we get to it.

cpp/src/hash/hashing.cu

harrism · 2020-12-02T09:19:23Z

cpp/src/hash/hashing.cu

+  auto output_view        = output->mutable_view();
+
+  if (has_nulls(input)) {
+    thrust::tabulate(rmm::exec_policy(stream)->on(stream.value()),


python/cudf/cudf/_lib/hash.pyx

Co-authored-by: GALI PREM SAGAR <sagarprem75@gmail.com>

harrism · 2020-12-03T11:10:38Z

@rapidsai/ops can you explain why the automerge didn't work here? Why is github saying merging is blocked? I don't see any remaining required reviews.

ajschmidt8 · 2020-12-03T18:10:03Z

@rapidsai/ops can you explain why the automerge didn't work here? Why is github saying merging is blocked? I don't see any remaining required reviews.

@harrism, the automerger did its job correctly here. It shouldn't merge any PRs that aren't all green.

It looks like the PR still requires a rapidsai/cudf-java-codeowners member review. It appears that @revans2 from that group has left a comment via a PR review, but didn't explicitly approve the PR yet. Once he approves the PR, this should merge assuming all other checks are passing.

razajafri · 2020-12-03T20:04:05Z

java/src/main/java/ai/rapids/cudf/ColumnVector.java

+   * @return the new ColumnVector of 32 character hex strings representing each row's hash value.
+   */
+  public static ColumnVector serial32BitMurmurHash3(ColumnView columns[]) {
+    return serial32BitMurmurHash3(0, columns);


probably a stupid question but why 0?

0 is the default seed that cuDF uses, had previously been hard coded a few layers deep with no configurability. Trying to retain and match cuDF existing behavior as consistently as possible.

razajafri · 2020-12-03T20:05:47Z

java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java

+    try (ColumnVector v0 = ColumnVector.fromBoxedInts(0, 100, null, null, Integer.MIN_VALUE, null);
+         ColumnVector v1 = ColumnVector.fromBoxedInts(0, null, -100, null, null, Integer.MAX_VALUE);
+         ColumnVector result = ColumnVector.serial32BitMurmurHash3(42, new ColumnVector[]{v0, v1});
+         ColumnVector expected = ColumnVector.fromBoxedInts(59727262, 751823303, -1080202046, 42, 723455942, 133916647)) {


You might have a good reason for hard-coding and I am not familiar with hashing in Java, instead of hard-coding, can we generate the expected values?

I could generate expected values, but the results should be static and never change. Also using org.apache.commons.codec.digest.MurmurHash3 would add other potential failure points (requires converting input to byte arrays, which would not necessarily be an apples to apples comparison).

razajafri · 2020-12-03T20:06:48Z

java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java

+          NEGATIVE_DOUBLE_NAN_UPPER_RANGE, NEGATIVE_DOUBLE_NAN_LOWER_RANGE,
+          Double.POSITIVE_INFINITY, Double.NEGATIVE_INFINITY);
+         ColumnVector result = ColumnVector.serial32BitMurmurHash3(new ColumnVector[]{v});
+         ColumnVector expected = ColumnVector.fromBoxedInts(1669671676, 0, -544903190, -1831674681, 150502665, 474144502, 1428788237, 1428788237, 1428788237, 1428788237, 420913893, 1915664072)) {


same comment about hard-coding the expected values

razajafri · 2020-12-03T20:08:11Z

Others have approved the PR some of them are questions and minor, please feel free to merge

@rwlee

This PR closes #11296. While implementing Spark list hashing in #11292, I noticed that `HASH_SERIAL_MURMUR3` does not appear to be used except in tests. It is not exposed in Python. While it is exposed in the JNI bindings, it is not used by spark-rapids. I discussed this with @rwlee and it seems that this feature was added only for parallel design with the Spark serial hash implementation in #6781, which is superseded by #11292. We do not need to keep this vestigial feature. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/brandon-b-miller - David Wendt (https://github.com/davidwendt) - Jason Lowe (https://github.com/jlowe) URL: #11383

Serial murmur3 hash with configurable seed

ab25cc8

rwlee added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Nov 17, 2020

rwlee requested a review from a team as a code owner November 17, 2020 00:47

rwlee requested review from harrism and nvdbaranec November 17, 2020 00:47

jrhemstad reviewed Nov 17, 2020

View reviewed changes

rwlee marked this pull request as draft November 18, 2020 09:21

rwlee added the 2 - In Progress Currently a work in progress label Nov 18, 2020

rwlee added 6 commits November 19, 2020 02:17

murmur3 testing, JNI, and kernel fix

999adcf

Merge remote-tracking branch 'pub/branch-0.17' into rwlee/sparkmurmur3

4f6b7ed

Update python API

13ed6ce

Fix md5 rebase error

da8fbb1

Fix python cudf hash function mapping

de90192

update changelog

66ca93d

rwlee marked this pull request as ready for review November 20, 2020 19:46

rwlee requested review from a team as code owners November 20, 2020 19:46

rwlee requested review from kkraus14 and brandon-b-miller November 20, 2020 19:46

rwlee added the 3 - Ready for Review Ready for review by team label Nov 20, 2020

rwlee changed the title ~~[WIP] Serial murmur3 hash with configurable seed~~ [REVIEW] Serial murmur3 hash with configurable seed Nov 20, 2020

revans2 reviewed Nov 20, 2020

View reviewed changes

jrhemstad reviewed Nov 20, 2020

View reviewed changes

rwlee and others added 2 commits November 20, 2020 12:00

Merge branch 'branch-0.17' into rwlee/sparkmurmur3

cbd7f3f

resolve rebase switch of stream and mr arg order

c8a724f

nvdbaranec requested changes Nov 20, 2020

View reviewed changes

Merge remote-tracking branch 'pub/branch-0.17' into rwlee/sparkmurmur3

638eda2

rwlee mentioned this pull request Nov 30, 2020

[FEA] Murmur3 that matches spark hashing for partitioning #6863

Closed

Reconfigure thrust calls

63e9eea

rwlee added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed 2 - In Progress Currently a work in progress labels Dec 1, 2020

nvdbaranec self-requested a review December 1, 2020 23:44

nvdbaranec approved these changes Dec 1, 2020

View reviewed changes

harrism reviewed Dec 2, 2020

View reviewed changes

harrism approved these changes Dec 2, 2020

View reviewed changes

galipremsagar requested changes Dec 2, 2020

View reviewed changes

python/cudf/cudf/_lib/hash.pyx Outdated Show resolved Hide resolved

Update python/cudf/cudf/_lib/hash.pyx

5593718

Co-authored-by: GALI PREM SAGAR <sagarprem75@gmail.com>

harrism requested a review from galipremsagar December 3, 2020 01:26

galipremsagar approved these changes Dec 3, 2020

View reviewed changes

harrism added 5 - Ready to Merge Testing and reviews complete, ready to merge 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Dec 3, 2020

revans2 approved these changes Dec 3, 2020

View reviewed changes

razajafri reviewed Dec 3, 2020

View reviewed changes

rapids-bot bot merged commit 73cca47 into rapidsai:branch-0.17 Dec 3, 2020

bdice mentioned this pull request Jul 18, 2022

Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 #11296

Closed

bdice mentioned this pull request Jul 28, 2022

Remove HASH_SERIAL_MURMUR3 / serial32BitMurmurHash3 #11383

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Serial murmur3 hash with configurable seed #6781

[REVIEW] Serial murmur3 hash with configurable seed #6781

rwlee commented Nov 17, 2020 •

edited by harrism

Loading

GPUtester commented Nov 17, 2020

jrhemstad Nov 17, 2020

rwlee Nov 17, 2020

jrhemstad Nov 17, 2020

rwlee Nov 17, 2020

codecov bot commented Nov 17, 2020 •

edited

Loading

harrism commented Nov 18, 2020

revans2 Nov 20, 2020

revans2 Nov 20, 2020

revans2 Nov 20, 2020

jrhemstad Nov 20, 2020

rwlee Dec 1, 2020

nvdbaranec Nov 20, 2020 •

edited

Loading

harrism Dec 2, 2020

harrism commented Dec 3, 2020

ajschmidt8 commented Dec 3, 2020

razajafri Dec 3, 2020

rwlee Dec 3, 2020

razajafri Dec 3, 2020

rwlee Dec 3, 2020

razajafri Dec 3, 2020

razajafri commented Dec 3, 2020

[REVIEW] Serial murmur3 hash with configurable seed #6781

[REVIEW] Serial murmur3 hash with configurable seed #6781

Conversation

rwlee commented Nov 17, 2020 • edited by harrism Loading

GPUtester commented Nov 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 17, 2020 • edited Loading

Codecov Report

harrism commented Nov 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvdbaranec Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism commented Dec 3, 2020

ajschmidt8 commented Dec 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Dec 3, 2020

rwlee commented Nov 17, 2020 •

edited by harrism

Loading

codecov bot commented Nov 17, 2020 •

edited

Loading

nvdbaranec Nov 20, 2020 •

edited

Loading