Skip to content

Conversation

@samwaseda
Copy link
Member

@samwaseda samwaseda commented Jul 25, 2025

You might remember the discussion towards the end of last year about the history based hashing, which is based on the assumption that the human inputs are always serializable. Let's take the following example:

def create_structure(element):
    ...
    return structure

def create_vacancy(structure):
    ...
    vacancy_structure

structure = create_structure(element="Al")
vacancy_structure = create_vacancy(structure=structure)

If we try to hash the data independently, we face the problem that structure might not be hashable. So instead of trying to do so for each node, we keep track of where structure came from, in this case from create_structure, whose input Al is hashable.

Now, I cannot remember where I saw an actual implementation (either Jörg's or Sebastian's), which I think was pretty neat, but I made another prototype here, because this new algorithm is fairly modular. What does it mean? Firstly, it uses NetworkX in the background to trace the history of the ports. Secondly, it runs on the flowrep dict, meaning in theory it's gonna be applicable to any of the workflow managers in the future.

MWE:

def workflow_with_data(a=10, b=20):
    x = add(a, b)
    y = multiply(x, b)
    return x, y

workflow_dict = fwf.get_workflow_dict(workflow_with_data)
graph = fwf.get_workflow_graph(workflow_dict)
data_dict = fwf.get_hashed_node_dict("add_0", graph, workflow_dict["nodes"])
print(data_dict)

Output:

{
    "nodes": {
        "module": "__main__",
        "qualname": "add",
        "version": "not_defined",
        "connected_inputs": [],
    },
    "inputs": {"x": 10, "y": 20},
    "outputs": ["output"],
}

For multiply_0 it looks like this:

{
    "nodes": {
        "module": "__main__",
        "qualname": "multiply",
        "version": "not_defined",
        "connected_inputs": ["x"],
    },
    "inputs": {"x": "097c4e61c3d890eb4e2c6050f8d02d277ec1d9cd66f0616c16d7cb57a06ff18f@output", "y": 20},
    "outputs": ["output"],
},

@github-actions
Copy link

Binder 👈 Launch a binder notebook on branch pyiron/flowrep/data

@samwaseda samwaseda requested review from Copilot and liamhuber July 25, 2025 14:10

This comment was marked as outdated.

@liamhuber
Copy link
Member

@samwaseda, at the meeting we discussed making the hash value independent of the labels since they are not "true information". I'd still like that before this is merged, yes?

@samwaseda samwaseda marked this pull request as draft July 30, 2025 14:31
@samwaseda samwaseda marked this pull request as ready for review July 31, 2025 20:05
@samwaseda samwaseda requested review from liamhuber and removed request for liamhuber July 31, 2025 20:05
@samwaseda
Copy link
Member Author

If I'm reading store_node_in_database from pyiron_database correctly, the functions that I just wrote can be straightforwardly used there, meaning we can get rid of the dependence on pyiron_workflow.

@samwaseda samwaseda requested a review from Copilot July 31, 2025 20:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a graph-based data hashing system for workflow management that tracks data dependencies through their computational history rather than trying to hash potentially non-serializable intermediate data structures. The system uses NetworkX to build dependency graphs and generates SHA-256 hashes based on function metadata and input provenance.

Key changes:

  • Implements graph-based workflow representation using NetworkX
  • Adds data hashing functionality that tracks computational provenance
  • Provides utility functions for nested dictionary operations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
flowrep/workflow.py Core implementation of graph-based hashing with functions for workflow graph creation, node hashing, and utility operations
tests/unit/test_workflow.py Comprehensive unit tests covering the new hashing functionality and edge cases
Comments suppressed due to low confidence (1)

flowrep/workflow.py:1390

  • [nitpick] The parameter name 'cls' suggests a class, but this function accepts any callable. Consider renaming to 'func' or 'callable_obj' for clarity.
def _get_function_metadata(cls: Callable) -> tuple[str, str, str]:

samwaseda and others added 5 commits July 31, 2025 22:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@samwaseda samwaseda merged commit ad3ab78 into main Aug 5, 2025
17 checks passed
@samwaseda samwaseda deleted the data branch August 5, 2025 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants