Caching questions

Hi @jan-janssen, 

Very nice improvements to `executorlib` since last time I looked at it! I had no problem experimenting with `SingleNodeExecutor` and was able to uncover some useful info about the interaction of `pyiron_workflow` and `executorlib`. There's two things I think I understand, and I was just hoping that you could confirm or correct my understanding. EDIT: And one idea for allowing extra input

1) The main thing I'm interested in is the persistence of data retrieval when the interpreter that spawned the `executorlib` executor is shut down before the task finishes. In this respect, I can explore the hash-and-retrieve functionality using `executorlib.SingleNodeExecutor`, even under the condition of a new interpreter, but it's not exactly suited to my purpose because the task needs to finish first. Is it correct that something like the `Slurm*Executor` will give me the same data-retrieval power via the hash, but also I can shut down the python process as soon as the task has been submit? A new instance of the executor producing the same hash with the same `cache_directory` will relocate the pending job and since it's on the Slurm queue it will still (eventually) finish running?

2) My integration is that `pyiron_workflow` submits `Node.on_run, *run_args, **run_kwargs` to the executor. In this way, using the `executorlib.SingleNodeExecutor`, I found that the cache key got a different hash every time I ran it. My understanding is that since [these are being used for the binary which is getting hashed](https://github.com/pyiron/executorlib/blob/609d185ffe0f92307b673568f083a7e63dc6629e/executorlib/standalone/serialize.py#L65-L67), and since my function is a bound method of some particular instance, that the fact I have a fresh instance after restarting my kernel means that the `"fn" fn` part of the serialized-and-hashed-data is guaranteeing I'll never get a cache hit (I didn't have resources, so #672 doesn't impact me).

2a) Have I missed/gotten anything wrong?

2b) For pure function nodes, this is something where I could make modifications on the `pyiron_workflow` side. For macros...maybe? It would certainly be much harder to set it up to be guaranteed that the serialized data will always return the same hash. However, since my objects are all "lexical" and each function I'm submitting can be uniquely scoped inside its workflow, I already have a hash-equivalent on my end. Would you be open to providing an interface for me to explicitly provide the cache key at `submit` time? I'm having trouble finding the intermediate connections, but I guess it would look something like adding `cache_key: Optional[str] = None,` in various interface points and modifying `executorlib.standalone.serialize` like

```
def serialize_funct_h5(
    fn: Callable,
    fn_args: Optional[list] = None,
    fn_kwargs: Optional[dict] = None,
    resource_dict: Optional[dict] = None,
    key: Optional[str] = None,
) -> tuple[str, dict]:
    """
    Serialize a function and its arguments and keyword arguments into an HDF5 file.

    Args:
        fn (Callable): The function to be serialized.
        fn_args (list): The arguments of the function.
        fn_kwargs (dict): The keyword arguments of the function.
        resource_dict (dict): resource dictionary, which defines the resources used for the execution of the function.
                              Example resource dictionary: {
                                  cores: 1,
                                  threads_per_core: 1,
                                  gpus_per_worker: 0,
                                  oversubscribe: False,
                                  cwd: None,
                                  executor: None,
                                  hostname_localhost: False,
                              }

    Returns:
        Tuple[str, dict]: A tuple containing the task key and the serialized data.

    """
    if fn_args is None:
        fn_args = []
    if fn_kwargs is None:
        fn_kwargs = {}
    if resource_dict is None:
        resource_dict = {}
    if key is None:
        binary_all = cloudpickle.dumps(
            {"fn": fn, "args": fn_args, "kwargs": fn_kwargs, "resource_dict": resource_dict}
        )
        task_key = fn.__name__ + _get_hash(binary=binary_all)
    else:
        task_key = key
    data = {
        "fn": fn,
        "args": fn_args,
        "kwargs": fn_kwargs,
        "resource_dict": resource_dict,
    }
    return task_key, data
```

Let me know if that might be a reasonable path


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Caching questions #674

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Caching questions #674

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions