Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashing input to avoid running #169

Closed
liamhuber opened this issue Jan 21, 2024 · 2 comments · Fixed by #395
Closed

Hashing input to avoid running #169

liamhuber opened this issue Jan 21, 2024 · 2 comments · Fixed by #395

Comments

@liamhuber
Copy link
Member

Nodes should optionally hash their input on run calls, and if they have this option enabled and find a "last hash" value, compare the current input to that hash and simply return the same output instead of running. (It might be good/necessary to hash the output too).

This will let us re-"run" graphs quickly in conditions where the hashing is much faster than the actual node computation. The hashing still gives us a speed-limit, but it seems like a good improvement.

Combined with saving and loading in #160, this will let us build more expensive workflows, run them, save them, then come back another day to extend them without having to re-run anything expensive we did earlier.

When we introduce the ability to ship nodes off to a computation queue (e.g. slurm), if we do it in a way that the remotely executing job is serializing stuff to the right storage location, then combined with automatic loading in #160, this handles a big chunk of the work for fetching completed queue jobs too.

@jan-janssen
Copy link
Member

In pyiron_base we use the following hash function to hash the binary of the cloud pickled input plus function body pyiron/pyiron_base#1285 :

def get_hash(binary):
    # Remove specification of jupyter kernel from hash to be deterministic
    binary_no_ipykernel = re.sub(b"(?<=/ipykernel_)(.*)(?=/)", b"", binary)
    return str(hashlib.md5(binary_no_ipykernel).hexdigest())

The regex part is important to remove the jupyter notebook kernel which changes every time the jupyter notebook kernel is restarted.

@liamhuber
Copy link
Member Author

@jan-janssen, I really liked your comment in the meeting this morning that the hashing here should be kept distinct from hashing for the database; at the end of the day we may be able to reuse some functionality for actually generating the hashes, but I totally agree this should be an independent feature and can be developed separately from #126.

This issue is now very close to the top of my todo-list, and (mostly notes for myself here) the spec I have in mind is:

  • Opt-in at the node instance level with a property (+ability to set via input kwarg)
  • An easy method to clear an existing hash to force re-running (e.g. for when you've changed what the node functionality is)
  • Allow the behaviour for anything where the input values hash natively, but just allow it to fail cleanly if you try to do it on a node with unhashable input -- this is the user's problem
  • Maybe? Allow specifying the hashing property default at the class-level instead of directly in __init__, so that certain sub-classes wind up having opt-out instead of opt-in behaviour?
  • Implementation question: can this be handled already in Runnable instead of down in Node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants