Graph-based data hashing #8

samwaseda · 2025-07-25T13:45:53Z

You might remember the discussion towards the end of last year about the history based hashing, which is based on the assumption that the human inputs are always serializable. Let's take the following example:

def create_structure(element):
    ...
    return structure

def create_vacancy(structure):
    ...
    vacancy_structure

structure = create_structure(element="Al")
vacancy_structure = create_vacancy(structure=structure)

If we try to hash the data independently, we face the problem that structure might not be hashable. So instead of trying to do so for each node, we keep track of where structure came from, in this case from create_structure, whose input Al is hashable.

Now, I cannot remember where I saw an actual implementation (either Jörg's or Sebastian's), which I think was pretty neat, but I made another prototype here, because this new algorithm is fairly modular. What does it mean? Firstly, it uses NetworkX in the background to trace the history of the ports. Secondly, it runs on the flowrep dict, meaning in theory it's gonna be applicable to any of the workflow managers in the future.

MWE:

def workflow_with_data(a=10, b=20):
    x = add(a, b)
    y = multiply(x, b)
    return x, y

workflow_dict = fwf.get_workflow_dict(workflow_with_data)
graph = fwf.get_workflow_graph(workflow_dict)
data_dict = fwf.get_hashed_node_dict("add_0", graph, workflow_dict["nodes"])
print(data_dict)

Output:

{
    "nodes": {
        "module": "__main__",
        "qualname": "add",
        "version": "not_defined",
        "connected_inputs": [],
    },
    "inputs": {"x": 10, "y": 20},
    "outputs": ["output"],
}

For multiply_0 it looks like this:

{
    "nodes": {
        "module": "__main__",
        "qualname": "multiply",
        "version": "not_defined",
        "connected_inputs": ["x"],
    },
    "inputs": {"x": "097c4e61c3d890eb4e2c6050f8d02d277ec1d9cd66f0616c16d7cb57a06ff18f@output", "y": 20},
    "outputs": ["output"],
},

github-actions · 2025-07-25T13:46:03Z

👈 Launch a binder notebook on branch pyiron/flowrep/data

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Get and assign values directly from the dot notation

liamhuber · 2025-07-30T14:30:38Z

@samwaseda, at the meeting we discussed making the hash value independent of the labels since they are not "true information". I'd still like that before this is merged, yes?

…database is not on Conda, I copied and pasted them

samwaseda · 2025-07-31T20:13:27Z

If I'm reading store_node_in_database from pyiron_database correctly, the functions that I just wrote can be straightforwardly used there, meaning we can get rid of the dependence on pyiron_workflow.

…ata because it looks like it's not even necessary

Copilot

Pull Request Overview

This PR introduces a graph-based data hashing system for workflow management that tracks data dependencies through their computational history rather than trying to hash potentially non-serializable intermediate data structures. The system uses NetworkX to build dependency graphs and generates SHA-256 hashes based on function metadata and input provenance.

Key changes:

Implements graph-based workflow representation using NetworkX
Adds data hashing functionality that tracks computational provenance
Provides utility functions for nested dictionary operations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
flowrep/workflow.py	Core implementation of graph-based hashing with functions for workflow graph creation, node hashing, and utility operations
tests/unit/test_workflow.py	Comprehensive unit tests covering the new hashing functionality and edge cases

Comments suppressed due to low confidence (1)

flowrep/workflow.py:1390

[nitpick] The parameter name 'cls' suggests a class, but this function accepts any callable. Consider renaming to 'func' or 'callable_obj' for clarity.

def _get_function_metadata(cls: Callable) -> tuple[str, str, str]:

flowrep/workflow.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

samwaseda added 3 commits July 25, 2025 15:21

Add separate_data

dd9927f

add tests

69be4b1

black

04108a6

typo

4433b1f

samwaseda requested review from Copilot and liamhuber July 25, 2025 14:10

This comment was marked as outdated.

Sign in to view

samwaseda and others added 5 commits July 25, 2025 16:11

Update flowrep/workflow.py

6daf78a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Convert to str

19c8111

forgot to convert

576cdb2

Implement get and set entry and add tests

bbadc0d

append nodes as prefix

229f176

samwaseda mentioned this pull request Jul 26, 2025

Get and assign values directly from the dot notation #9

Merged

samwaseda and others added 5 commits July 26, 2025 05:53

And of course I forgot black

67a3809

Merge pull request #9 from pyiron/dot

1adf1c0

Get and assign values directly from the dot notation

Add docstring

e4818fe

correct errors and add more tests

517ddcc

black

3896675

samwaseda marked this pull request as draft July 30, 2025 14:31

samwaseda added 5 commits July 31, 2025 21:08

Add get_type etc., which come from pyiron_database, but since pyiron_…

ab8fa5f

…database is not on Conda, I copied and pasted them

Remove __class__ because it's not pyiron_workflow

f62a5a0

Add tests and remove unused functions

da27e66

Update hashing algorithm and add tests

0ede732

Remove json because it's not used

026009d

samwaseda marked this pull request as ready for review July 31, 2025 20:05

samwaseda requested review from liamhuber and removed request for liamhuber July 31, 2025 20:05

I'm hopeless

8e11b2c

samwaseda added 5 commits July 31, 2025 22:16

Make functions public for pyiron_database

8d889f4

Add tests

ebbc641

black

e0786c5

Docstring and type hints

8e2f68e

Remove unused workflow decorator and remove also tests for separate_d…

ade1bf9

…ata because it looks like it's not even necessary

samwaseda requested a review from Copilot July 31, 2025 20:32

Copilot AI reviewed Jul 31, 2025

View reviewed changes

flowrep/workflow.py Outdated Show resolved Hide resolved

flowrep/workflow.py Outdated Show resolved Hide resolved

flowrep/workflow.py Outdated Show resolved Hide resolved

samwaseda and others added 5 commits July 31, 2025 22:33

Update flowrep/workflow.py

779381d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update flowrep/workflow.py

647b9ab

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update flowrep/workflow.py

b9e89a0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add json in the end

eb70e3c

Black again

25a345d

samwaseda merged commit ad3ab78 into main Aug 5, 2025
17 checks passed

samwaseda deleted the data branch August 5, 2025 06:18

samwaseda mentioned this pull request Aug 7, 2025

Adjust function serialization #13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Graph-based data hashing #8

Graph-based data hashing #8

Uh oh!

samwaseda commented Jul 25, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

liamhuber commented Jul 30, 2025

Uh oh!

samwaseda commented Jul 31, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Graph-based data hashing #8

Graph-based data hashing #8

Uh oh!

Conversation

samwaseda commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

liamhuber commented Jul 30, 2025

Uh oh!

samwaseda commented Jul 31, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samwaseda commented Jul 25, 2025 •

edited

Loading