Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for SHA256 and SHA512 to cudf::hash #8641

Closed
MikeChenfu opened this issue Jul 1, 2021 · 8 comments · Fixed by #14391
Closed

[FEA] Add support for SHA256 and SHA512 to cudf::hash #8641

MikeChenfu opened this issue Jul 1, 2021 · 8 comments · Fixed by #14391
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@MikeChenfu
Copy link

MikeChenfu commented Jul 1, 2021

Is your feature request related to a problem? Please describe.
Hello, currently I am thinking if the cudf supports more hash functions like hashlib.sha256, hashlib.sha512. Thanks for the consideration.

Describe the solution you'd like
df['sha256'] = df['h'].hash_func(method='sha256')

Describe alternatives you've considered

import pandas as pd
df['sha256'] = df['h'].apply(lambda x: (hashlib.sha256(str(x).encode()).hexdigest().upper()))

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@MikeChenfu MikeChenfu added Needs Triage Need team to review and classify feature request New feature or request labels Jul 1, 2021
@beckernick
Copy link
Member

Hi @MikeChenfu , thanks for the request. Does the existing Series.hash_values (murmur3) in Python work for your use case?

@beckernick beckernick added 0 - Waiting on Author Waiting for author to respond to review Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jul 2, 2021
@MikeChenfu
Copy link
Author

Hi @beckernick, thanks for the reply. Currently, murmur3 is not the choice for our use case. It would be great if we have multiple choices for the hash function.

@shwina shwina added libcudf Affects libcudf (C++/CUDA) code. and removed 0 - Waiting on Author Waiting for author to respond to review labels Jul 6, 2021
@shwina
Copy link
Contributor

shwina commented Jul 6, 2021

Likely will need support for this at the C++ (libcudf) layer. cc: @jrhemstad @harrism

@jrhemstad
Copy link
Contributor

We do have some other hash functions other than murmur3 that aren't exposed to Python:

enum class hash_id {
HASH_IDENTITY = 0, ///< Identity hash function that simply returns the key to be hashed
HASH_MURMUR3, ///< Murmur3 hash function
HASH_MD5, ///< MD5 hash function
HASH_SERIAL_MURMUR3, ///< Serial Murmur3 hash function
HASH_SPARK_MURMUR3 ///< Spark Murmur3 hash function

@MikeChenfu
Copy link
Author

@shwina @jrhemstad Thanks for the information. Can we have the sha256 into the cudf lib?

@davidwendt
Copy link
Contributor

Reference #6020

@shwina shwina self-assigned this Aug 26, 2021
@jrhemstad jrhemstad changed the title [FEA]Hash methods request [FEA] Add support for SHA256 and SHA512 to cudf::hash Aug 30, 2021
@shwina shwina removed their assignment Aug 31, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 12, 2021
This PR introduces a public API in cuDF for MD5 hashing, using the parameter `DataFrame.hash_columns(..., method="md5")` or `Series.hash_values(..., method="md5")`. The default hashing method is MurmurHash3 (`method="murmur3"`). I also changed the return value of `Series.hash_values` to be a `Series`, rather than a cupy array.

Related to #8641. SHA support will be added in a later PR.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Ashwin Srinath (https://github.com/shwina)

URL: #9390
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@bdice
Copy link
Contributor

bdice commented Mar 18, 2022

Status update on this issue: I have been working on several refactors, fixes, and performance enhancements for libcudf's hashing functionality. I expect this feature to land in the 22.06 release with PR #9215.

@bdice bdice removed this from the Pandas API Alignment and Coverage milestone Mar 18, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 22, 2024
This PR adds support for SHA-1 and SHA-2 (SHA-256, SHA-512, and truncated digests SHA-224, SHA-384) hash functions.  Resolves #8641. Replaces #9215.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Matthew Roeschke (https://github.com/mroeschke)
  - David Wendt (https://github.com/davidwendt)
  - https://github.com/nvdbaranec

URL: #14391
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants