Hashing input to avoid running #169

liamhuber · 2024-01-21T00:30:37Z

Nodes should optionally hash their input on run calls, and if they have this option enabled and find a "last hash" value, compare the current input to that hash and simply return the same output instead of running. (It might be good/necessary to hash the output too).

This will let us re-"run" graphs quickly in conditions where the hashing is much faster than the actual node computation. The hashing still gives us a speed-limit, but it seems like a good improvement.

Combined with saving and loading in #160, this will let us build more expensive workflows, run them, save them, then come back another day to extend them without having to re-run anything expensive we did earlier.

When we introduce the ability to ship nodes off to a computation queue (e.g. slurm), if we do it in a way that the remotely executing job is serializing stuff to the right storage location, then combined with automatic loading in #160, this handles a big chunk of the work for fetching completed queue jobs too.

jan-janssen · 2024-01-21T10:57:30Z

In pyiron_base we use the following hash function to hash the binary of the cloud pickled input plus function body pyiron/pyiron_base#1285 :

def get_hash(binary):
    # Remove specification of jupyter kernel from hash to be deterministic
    binary_no_ipykernel = re.sub(b"(?<=/ipykernel_)(.*)(?=/)", b"", binary)
    return str(hashlib.md5(binary_no_ipykernel).hexdigest())

The regex part is important to remove the jupyter notebook kernel which changes every time the jupyter notebook kernel is restarted.

liamhuber · 2024-06-25T16:59:41Z

@jan-janssen, I really liked your comment in the meeting this morning that the hashing here should be kept distinct from hashing for the database; at the end of the day we may be able to reuse some functionality for actually generating the hashes, but I totally agree this should be an independent feature and can be developed separately from #126.

This issue is now very close to the top of my todo-list, and (mostly notes for myself here) the spec I have in mind is:

Opt-in at the node instance level with a property (+ability to set via input kwarg)
An easy method to clear an existing hash to force re-running (e.g. for when you've changed what the node functionality is)
Allow the behaviour for anything where the input values hash natively, but just allow it to fail cleanly if you try to do it on a node with unhashable input -- this is the user's problem
Maybe? Allow specifying the hashing property default at the class-level instead of directly in __init__, so that certain sub-classes wind up having opt-out instead of opt-in behaviour?
Implementation question: can this be handled already in Runnable instead of down in Node?

liamhuber mentioned this issue Feb 15, 2024

Add and test wrappers for sticking nodes in a pyiron job #189

Merged

liamhuber mentioned this issue Jun 25, 2024

💡Some ideas regarding storage (DB) concepts #126

Open

liamhuber mentioned this issue Jul 30, 2024

[patch] Introduce caching #395

Merged

liamhuber closed this as completed in #395 Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hashing input to avoid running #169

Hashing input to avoid running #169

liamhuber commented Jan 21, 2024

jan-janssen commented Jan 21, 2024

liamhuber commented Jun 25, 2024

Hashing input to avoid running #169

Hashing input to avoid running #169

Comments

liamhuber commented Jan 21, 2024

jan-janssen commented Jan 21, 2024

liamhuber commented Jun 25, 2024