-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hashing input to avoid running #169
Comments
In def get_hash(binary):
# Remove specification of jupyter kernel from hash to be deterministic
binary_no_ipykernel = re.sub(b"(?<=/ipykernel_)(.*)(?=/)", b"", binary)
return str(hashlib.md5(binary_no_ipykernel).hexdigest()) The regex part is important to remove the jupyter notebook kernel which changes every time the jupyter notebook kernel is restarted. |
@jan-janssen, I really liked your comment in the meeting this morning that the hashing here should be kept distinct from hashing for the database; at the end of the day we may be able to reuse some functionality for actually generating the hashes, but I totally agree this should be an independent feature and can be developed separately from #126. This issue is now very close to the top of my todo-list, and (mostly notes for myself here) the spec I have in mind is:
|
Nodes should optionally hash their input on run calls, and if they have this option enabled and find a "last hash" value, compare the current input to that hash and simply return the same output instead of running. (It might be good/necessary to hash the output too).
This will let us re-"run" graphs quickly in conditions where the hashing is much faster than the actual node computation. The hashing still gives us a speed-limit, but it seems like a good improvement.
Combined with saving and loading in #160, this will let us build more expensive workflows, run them, save them, then come back another day to extend them without having to re-run anything expensive we did earlier.
When we introduce the ability to ship nodes off to a computation queue (e.g. slurm), if we do it in a way that the remotely executing job is serializing stuff to the right storage location, then combined with automatic loading in #160, this handles a big chunk of the work for fetching completed queue jobs too.
The text was updated successfully, but these errors were encountered: