-
Notifications
You must be signed in to change notification settings - Fork 0
Graph-based data hashing #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Get and assign values directly from the dot notation
|
@samwaseda, at the meeting we discussed making the hash value independent of the labels since they are not "true information". I'd still like that before this is merged, yes? |
…database is not on Conda, I copied and pasted them
|
If I'm reading |
…ata because it looks like it's not even necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a graph-based data hashing system for workflow management that tracks data dependencies through their computational history rather than trying to hash potentially non-serializable intermediate data structures. The system uses NetworkX to build dependency graphs and generates SHA-256 hashes based on function metadata and input provenance.
Key changes:
- Implements graph-based workflow representation using NetworkX
- Adds data hashing functionality that tracks computational provenance
- Provides utility functions for nested dictionary operations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| flowrep/workflow.py | Core implementation of graph-based hashing with functions for workflow graph creation, node hashing, and utility operations |
| tests/unit/test_workflow.py | Comprehensive unit tests covering the new hashing functionality and edge cases |
Comments suppressed due to low confidence (1)
flowrep/workflow.py:1390
- [nitpick] The parameter name 'cls' suggests a class, but this function accepts any callable. Consider renaming to 'func' or 'callable_obj' for clarity.
def _get_function_metadata(cls: Callable) -> tuple[str, str, str]:
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
You might remember the discussion towards the end of last year about the history based hashing, which is based on the assumption that the human inputs are always serializable. Let's take the following example:
If we try to hash the data independently, we face the problem that
structuremight not be hashable. So instead of trying to do so for each node, we keep track of wherestructurecame from, in this case fromcreate_structure, whose inputAlis hashable.Now, I cannot remember where I saw an actual implementation (either Jörg's or Sebastian's), which I think was pretty neat, but I made another prototype here, because this new algorithm is fairly modular. What does it mean? Firstly, it uses NetworkX in the background to trace the history of the ports. Secondly, it runs on the
flowrepdict, meaning in theory it's gonna be applicable to any of the workflow managers in the future.MWE:
Output:
{ "nodes": { "module": "__main__", "qualname": "add", "version": "not_defined", "connected_inputs": [], }, "inputs": {"x": 10, "y": 20}, "outputs": ["output"], }For
multiply_0it looks like this:{ "nodes": { "module": "__main__", "qualname": "multiply", "version": "not_defined", "connected_inputs": ["x"], }, "inputs": {"x": "097c4e61c3d890eb4e2c6050f8d02d277ec1d9cd66f0616c16d7cb57a06ff18f@output", "y": 20}, "outputs": ["output"], },