Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SHA-1 and SHA-2 hash functions. #14391

Merged
merged 21 commits into from
Jan 22, 2024
Merged

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Nov 9, 2023

Description

This PR adds support for SHA-1 and SHA-2 (SHA-256, SHA-512, and truncated digests SHA-224, SHA-384) hash functions. Resolves #8641. Replaces #9215.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@bdice bdice requested review from a team as code owners November 9, 2023 22:07
@bdice bdice requested review from vyasr, mroeschke and nvdbaranec and removed request for a team November 9, 2023 22:07
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. labels Nov 9, 2023
@bdice bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 9, 2023
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have some Java smoke tests for the new hash mappings, see HashTest.java for how the other hash functions are being tested.

java/src/main/java/ai/rapids/cudf/HashType.java Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor Author

bdice commented Nov 13, 2023

@jlowe It looks like these tests might be in ColumnVectorTest.java. I didn't see HashTest.java. Do you want to add tests for every data type and every algorithm, to align with MD5 tests? It's pretty extensive so I want to check before doing that work (the file is already 7000+ lines).

Also note that this PR does not add list hashing support. Is that a requirement from the Spark side?

@jlowe
Copy link
Member

jlowe commented Nov 13, 2023

I didn't see HashTest.java.

Oh, sorry, my bad. HashTest.java is in NVIDIA/spark-rapids-jni which we use as a wrapper around cudf, didn't notice that file was in another project in the IDE.

Do you want to add tests for every data type and every algorithm

I don't think this is necessary, since IMO the Java test purpose is to make sure the bindings are accurate more than to test every single corner case of the underlying algorithm. Doing the latter adds a lot of redundancy with the C++ tests (unless that extensive testing isn't there or anywhere else). The Java bindings aren't doing anything specific to the type.

Regarding list support, it would be nice if it supported hashing a list of uint8, since that's how we represent Spark's BinaryType. BinaryType is the only type that Spark supports hashing with these algorithms. However we probably can cast it to a string column as a workaround, so it shouldn't be a blocker.

@jlowe
Copy link
Member

jlowe commented Nov 13, 2023

Looks like the raw hash function that takes a hash function ID parameter has not been exposed in the Java bindings. Instead there are wrappers around specific hash functions like MD5 and Murmur3. I'm fine if we want to punt filling out the Java bindings to access these new algorithms in a followup.

@bdice bdice self-assigned this Nov 14, 2023
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend reverting the enum changes.

cpp/include/cudf/hashing.hpp Outdated Show resolved Hide resolved
rapids-bot bot pushed a commit that referenced this pull request Dec 12, 2023
… xxhash64 to cudf Python. (#14538)

This PR refactors the Python code for `IndexedFrame.hash_values` to use the newer named C++ functions from `cudf::hashing::*`. I also added bindings for xxhash64 and updated some tests.

Needed for #14391.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #14538
@bdice bdice removed the request for review from a team January 2, 2024 21:52
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python/cython changes look good

cpp/src/hash/sha_hash.cuh Show resolved Hide resolved
cpp/src/hash/sha_hash.cuh Outdated Show resolved Hide resolved
cpp/tests/hashing/sha384_test.cpp Outdated Show resolved Hide resolved
cpp/src/hash/sha_hash.cuh Show resolved Hide resolved
cpp/include/cudf/hashing/detail/hash_functions.cuh Outdated Show resolved Hide resolved
cpp/src/hash/sha_hash.cuh Show resolved Hide resolved
cpp/src/hash/sha512_hash.cu Outdated Show resolved Hide resolved
cpp/tests/hashing/sha1_test.cpp Show resolved Hide resolved
cpp/src/hash/sha_hash.cuh Show resolved Hide resolved
cpp/src/hash/sha_hash.cuh Show resolved Hide resolved
@bdice
Copy link
Contributor Author

bdice commented Jan 22, 2024

/merge

@rapids-bot rapids-bot bot merged commit 42d8d78 into rapidsai:branch-24.02 Jan 22, 2024
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] Add support for SHA256 and SHA512 to cudf::hash
6 participants