💡 IO and for-loop roadmap

As I work more with @JNmpi's #33 and the `iter` (and dataclass) functionality therein, I've been collecting some ideas about how to modify the way IO is defined on a node. I'll break these ideas into separate issues (or link existing issues), but I also wanted a sort of high-level summary to help clarify the way the ideas support each other. Concretely, I'd like to make the following changes:

- [x] Make `output_labels` strictly a class option, and not allow changing it on instances (#265, #266, #269)
- [x] Don't allow `inputs_map` or `outputs_map` on macros (#276)
  - [x] ~(I feel less strongly about this) Don't allow `inputs_map` or `outputs_map` on workflows either, i.e. only allow access via the standard "scoped label" (`f"{target_channel.owner.label}__{target_channel.label}"`)~ Present the IO maps in the pedagogical material, but as a sort of optional thing and then mostly avoid using them (#283)
- [x] Don't auto-populate macro IO from unconnected child IO, i.e. force the use of the function-like signature definition of IO (#276)
- [x] (Bonus) make macros more efficient by only introducing UI nodes where _necessary_ (#277)
- [x] Don't support `self` for function nodes (#279)
- [x] Scrape return labels from the `Macro.graph_creator` too, but strip `f"{first argument}."` from them, e.g. `return macro.n1, macro.n2` just gets output labels `n1` and `n2` (if the function is `def MyMacro(macro, ...)`
  - Just raise an error if the scraped label still has a "." in it -- then you need to provide an output label
- [x] Extract common "static IO" behaviour shared between `AbstractFunction` and `AbstractMacro` to a common parent class (#282)
- [x] Make `wf` the canonical first-argument for macro `graph_creator` functions in the notebooks, and make it `self` in the tests -- the former lets us tell a story, the latter makes sense because that's what it is! (#283)

-- BREAK (merge above, then proceed below, maybe after a pause for efficiency/scaling work) --

- [x] Introduce a new public-facing node class(es) e.g. `DataNode` (per @JNmpi's suggestion) for dataframes and dataclasses (#306, #308)
- [x] Make a new "for" loop interface that creates a for-loop class with IO channels (including which are scattered on and which get broadcast) defined at the class level, but which dynamically (re)creates children on each `run` call so instances can freely adapt to input of different lengths (partial progress in #276, #309)

## `output_labels` modification exclusively on classes

This is fairly simple. [Right now](https://github.com/pyiron/pyiron_workflow/commit/5962bb43eb4074d230272654c3b0904d7413e1b1), you can modify the output labels on each instance, e.g.

```python
from pyiron_workflow import Workflow


renamed = Workflow.create.standard.UserInput(output_labels="my_label")
default = Workflow.create.standard.UserInput()
print(renamed.outputs.labels, default.outputs.labels)
>>> ['my_label'] ['user_input']
```

This has only changed the name of the output label for the _instance_ `renamed` and hasn't changed the expected _class_ IO signature for `UserInput` at all.

As of #266, for function nodes (children of `AbstractFunction`), it's no longer possible to do this -- `output_labels` is strictly available when defining the class, then this interface naming scheme is static for _all instances_ of that class.

That means you can freely set them when using the `Workflow.wrap_as.function_node(*output_labels)` decorator or `Workflow.create.Function(..., output_labels=None)` class-creation interfaces, but then they're fixed.

The advantage to this is that we can already peek at the IO at the class level:

```python
from pyiron_workflow import Workflow

@Workflow.wrap_as.function_node("xplus1", "xminus1")
def PlusMinusBound0(x: int) -> tuple[int, int | None]:
    return x + 1, None if x - 1 < 0 else x - 1

print(PlusMinusBound0.preview_output_channels())
>>> {'xplus1': int, 'xminus1': int | None}

print(PlusMinusBound0.preview_input_channels())
>>> {'x': (<class 'int'>, NOT_DATA)}
```

This is critical for guided workflow design (*ala* `ironflow`), and also helped to simplify some code under the hood.

I would like to make a similar change to `AbstractMacro`

## No more maps

@samwaseda, when we talked after the pyiron meeting this week, I expressed my sadness at the unavoidability of the `inputs_map` and `outputs_map` for allowing power-users to modify existing macros. After giving it more thought, I'm pretty sure that we can get rid of them after all!

Since #265, `AbstractMacro.graph_creator` is a `@classmethod` (as is `AbstractFunction.node_function`). When combined with the idea above to guarantee that `output_labels` are strictly class and not instance features, that means that a power-user can modify an existing macro by defining a new macro class leveraging the base class's `.graph_creator`. Concretely, on #265 I can now do this:

```python
from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node("original")
def MyWorkflow(macro, x, y):
    macro.n1 = x + y
    macro.n2 = macro.n1 ** 2
    return macro.n2

@Workflow.wrap_as.macro_node("renamed", "new")
def ModifiedWorkflow(macro, x, y, z):
     # First, create the graph you already like
    MyWorkflow.graph_creator(macro, x, y)

    # Then modify it how you want
    macro.n1.disconnect_all()
    macro.remove_child(macro.n1)
    macro.n1 = x - y
    macro.n2.inputs.obj = macro.n1

    macro.n3 = macro.n2 * z

    return macro.n2, macro.n3

m = ModifiedWorkflow(x=1, y=2, z=3)
m()
>>> {'renamed': 1, 'new': 3}
```

This isn't quite ideal yet, but with a few more changes I am confident I can get it down to

```python
@Workflow.wrap_as.macro_node("renamed", "new")
def ModifiedWorkflow(macro, x, y, z):
    MyWorkflow.graph_creator(macro, x, y)

    macro.replace_child(macro.n1, x - y)
    macro.n3 = macro.n2 * z

    return macro.n2, macro.n3
```

This doesn't offer identical functionality to being able to set `inputs_map` and `outputs_map`, but IMO it offers _equivalent_ functionality in a more transparent and robust way.

### Get rid of the maps _entirely_

At the same time, I'd like to get rid of the maps completely by removing them from `Workflow` too! This just means that you can't define shortcuts to IO at the workflow level and always need to use the fully-scoped name, like `wf(some_child__some_channel=42)` instead of adding a map `wf.inputs_map = {"some_child__some_channel": "kurz"}; wf(kurz=42)`. This is a price I'm willing to pay to remove the complexity from both the code and the user's head, but I'm not married to this part of the idea.

## Don't auto-populate macro IO

Finally, the default right now is that if you _don't_ use the function-like definition or `output_labels` for your macro, you get IO based on the unconnected children, i.e.

```python
from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node()
def AutomaticMacro(macro):
    macro.n1 = Workflow.create.standard.UserInput(user_input=0)
    macro.n2 = Workflow.create.standard.UserInput(user_input=macro.n1)
    
auto = AutomaticMacro()
print(auto.inputs.labels, auto.outputs.labels)
>>> ['n1__user_input'] ['n2__user_input']
```

Is equivalent to 

```python
from pyiron_workflow import Workflow

@Workflow.wrap_as.macro_node("n2__user_input")
def ExplicitMacro(macro, n1__user_input=0):
    macro.n1 = Workflow.create.standard.UserInput(user_input=n1__user_input)
    macro.n2 = Workflow.create.standard.UserInput(user_input=macro.n1)
    return macro.n2

explicit = ExplicitMacro()
print(explicit.inputs.labels, explicit.outputs.labels)
>>> ['n1__user_input'] ['n2__user_input']
```

I'd like to stop auto-populating things and force the macro definition to be explicit.

Cons:
- Might be inconvenient sometimes

Pros:
- Zen of python "explicit is better than implicit"
- Reduces mental load by making macro definitions act more like function definitions
- Shrinks the codebase a little bit

### An aside on efficiency

Right now, when a macro has input arguments in its signature beyond the first `macro: AbstractMacro` item, when we build the graph we _prepopulate it_ with `UserInput` nodes for each signature item. This works fine, and is necessary when that input is getting bifurcated to be used in multiple child nodes -- but if we _require_ the function signature approach to graph definition, there will be times when the input is being used in only one place and it's downright inefficient to stick an intermediate `UserInput` node in the way! The macro-level input can simply "value link" to the child node's input directly.

I already made a branch yesterday that takes care of this and purges such useless nodes at the end of the graph creation, so there's no big concern about efficiency. Unfortunately, while it went 99% smoothly, this feature interacts poorly with the combination of input maps and storage, so just a couple of tests fail where a workflow owns and reloads a macro. I am confident that adding this efficiency change back in  will be possible after `output_labels` are class properties and `inputs_map` is gone.

## Stop supporting `self` for function node

@JNmpi, when we had to stop ourselves from hijacking the pyiron meeting on Monday to talk about `pyiron_workflow`, you seemed to me to be expressing the idea that function nodes should be really stateless, and if someone wants state they should just write a function node to handle it and put the whole thing in a macro. I am 100% on board with this perspective -- let's _really_ encourage function nodes to be functional!

To do this, I'd like to just get rid of support for `self` showing up in `AbstractFunction.node_function` functions entirely. It already [breaks in some places that we need to work around](https://github.com/pyiron/pyiron_workflow/blob/5962bb43eb4074d230272654c3b0904d7413e1b1/pyiron_workflow/function.py#L515), so it will feel good to remove it.

From an ideological and UX perspective, I really like this, because now at this point in the todo list function nodes are stateless and always wrap a function like `def foo(x, y, z)`, and macro nodes are stateful and always wrap a function that basically has `self` in it like `def bar(macro, a, b, c)`.

## Data nodes

IMO, the one real downside to forcing users to explicitly define their node IO as part of the function signature/output labels is that it might get a bit verbose for nodes with lots of input -- this is especially true for macros.

@JNmpi in #33 has already been working with dataclasses to package together sets of related input. This allows sensible defaults to be provided, and lets developers build up input/output by composition using multiple inheritance. All great stuff. In the context of nodes, I see this making it more succinct to write the IO like this:

```python
@Workflow.wrap_as.macro_node("result")
def MyNewMacro(macro, input_collection: Workflow.create.some_package.SomeDataNode.dataclass):
    macro.n1 = Workflow.create.some_package.SomeNode(input=input_collection)
    macro.n2 = Workflow.create.some_package.SomeOtherNode(
        macro.n1.output_collection.foo  # Leveraging node injection to grab a property off the output class
    )
    macro.n3 = input_collection.bar - macro.n2 
    # Again, `` is actually the input node `input_collection` and then we use node-injection to grab its `bar` attribute
    return macro.n3
```

Then even if the dataclass has lots of fields, we don't need to write them all out all the time. 

This idea is already discussed on #208.

## For loops

Ok, so with all this in place we can get to the actual meat of the project which is facilitating clean, powerful, and robust for-loops. @JNmpi, you mentioned on Monday wanting to be able to pass nodes (or at least node classes) to other nodes, and I think it's the class-level-availability of IO signatures that is critical for this. Once we can do this and have `SomeNodeClass.preview_input/output_channels()` available for macro nodes like they already are for function nodes, we'll be able to pass classes to a constructor that dynamically defines a new class with corresponding IO!

The spec for a for-loop is then to have a creator like `Function` (NOT `AbstractFunction`) that creates a new child of `AbstractMacro` that...
- Takes a node class (or node instance) as input
  - Passing an instance is just a convenient polymorphism; we would leverage the instances `.__class__` attribute to create the new for-macro class, and then use the instance's specific IO values and connections to update the IO of the new for-macro instance
- The for-macro class has the same IO signature as the input class(/instance)
- The creator requires specifying which of these inputs are going to be scattered to child nodes, and which are going to be broadcast to child nodes
  - Concretely, I imagine using the `ironflow` syntax of capitalization to combine this specification with values like `ForLoop(Workflow.create.atomistics.BulkStructure, LATTICE=[3.9, 4, 4.1], species="Al")` or simply specifying which ones are going to get scattered and delay actual values to later like `ForLoop(Workflow.create.atomistics.BulkStructure, scatter_on=("lattice",)`
- At this point, just as `Function`:`AbstractFunction`, we have a `ForLoop` that is dynamically creating a new child class of some `AbstractForLoop` where the body node class and IO (including broadcast vs scattered) are all defined at the class level, and then we immediately instantiate it and are free to populate this IO (possibly using some body class instance's values, if an instance was passed instead of a class)
- The form of the IO is fixed, but we want to be free to vary the length of the scattered input! This can be accomplished by modifying the run call so that at each call all the for-macro's children get removed (if there are any), and $N$ of them get re-instantiated and reconnected(re-linked) with the macro's input (whether it's being broadcast or scattered), and run again
  - This touches on a-priori and post-facto graph provenance! The a-priori is now that for loops have a particular structure, and after running there is exactly one child node for each item in the for loop! I hope something similar can be worked out to give while-loops the same level of post-facto provenance, but I haven't figured it out yet.
  - Remember that since macros are "walled gardens", no one outside the macro actually makes IO connections to the children's IO channels -- we can safely delete and re-instantiate them internally while the macro as a whole maintains all its data connections with its siblings
- The for-macro should have two outputs, `scattered_dataframe: pandas.DataFrame` *ala* @JNmpi's `iter` method that links the scattered input items to child node output, and `broadcast_input: DotDict` that gives easy access to all the input that is identical across all rows of the dataframe. (In principle we could return _just_ the dataframe, but duplicating the non-scattered input like that seems needlessly memory-inefficient...)
  - Returning a dataframe differs from the existing `For` meta-node that has list-like channels for each of the child node's input. That's because my `For` meta-node handles only a single for loop, while @JNmpi's `iter` method handles looping on multiple values in nested for-loops. Hist way is better
- As syntactic sugar, I'd like to provide both `ForZipped` and `ForNested` interfaces, where the nested version is like the current `iter` on #33 and `Zipped` zips instead of nesting the scattered input.
- Finally, expose a shortcut to creating such a node on the `Node.iter` method (well, nodes and `iter_zipped` and `iter_nested`, probably)
- These `iter` methods will need to work a little differently on `Workflow`, which is a parent-most object and not able to pass itself in as the relevant class for the `ForLoop` creator class, but this is an edge-case that can be defined in the `Workflow` class itself by overriding the methods.
- At the instance-level, we will want some convenience methods/features for distributing the same executor to all the children at once (again, similar to #33 and its `max_workers`)

That's a monstrous wall of text, so let me see if I can end with a for loop syntax example

```python
from concurrent.futures import ThreadPoolExecutor
import numpy as np

from pyiron_workflow import Workflow
Workflow.register("some.atomistics.module", "atom")

@Workflow.wrap_as.macro("energy")
def BulkLammps(macro, species, lattice, cubic):
    macro.bulk = Workflow.create.atoms.Bulk(species=species, cubic=cubic)
    macro.engine = Workflow.create.atoms.Lammps()
    macro.static = Workflow.create.atoms.Static(
        structure=macro.bulk, 
        engine=macro.engine
    )
    energy = macro.static.energy_pot  
    # Here I imagine that `Static` is returning an instance of some `StaticAtomisticsOutput`
    # and that it's a single value node, so above we're actually leveraging node injection 
    # to say `energy = Workflow.create.standard.GetAttr(macro.static.outputs.output, "energy_pot")
    return energy
    
wf = Workflow("my_for_loop_example")
wf.calculation = Workflow.create.standard.ForNested(
    BulkLammps,
    SPECIES=["Al", "Cu"],  # Scattered
    LATTICE=np.arange(2, 6, 100),  # Scattered
    cubic=True  # Broadcast,
    child_executor=ThreadPoolExecutor(max_workers=2),
)
# Then expoit node injection to operate on the for loop's dataframe
wf.plot_Al = wf.create.plotting.Plot(
    x=wf.calculation[wf.energies["species"] == "Al"]["lattice"].values,
    y=wf.calculation[wf.energies["species"] == "Al"]["energy"].values,
)
wf.plot_Al = wf.create.plotting.Plot(
    x=wf.calculation[wf.energies["species"] == "Cu"]["lattice"].values,
    x=wf.calculation[wf.energies["species"] == "Cu"]["energy"].values,
)

wf()  # Run it

# Or again with more data and more power
wf.calculation.set_child_executor(ThreadPoolExecutor(max_workers=20))
wf(calculation__lattice=np.arange(2, 6, 10000)) 
```

Or if we don't care about the workflow but just want to get a dataframe quickly, we could use the shortcut to say something like:

```python
m = BulkLammps()
df = m.iter_nested(
    SPECIES=["Al", "Cu"],  # Scattered
    LATTICE=np.arange(2, 6, 100),  # Scattered
    cubic=True  # Broadcast,
    child_executor=ThreadPoolExecutor(max_workers=2),
)  # Makes the new for node, runs it, and returns the dataframe
```

Linked issue #72 

## Conclusion

I've already started down this path with #265, #266, and the un-pushed work pruning off the unused macro IO nodes. I like this direction and will just keep hacking away at it until `main` and #33 are totally compliant with each other. I am very happy for any feedback on these topics!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

💡 IO and for-loop roadmap #268

`output_labels` modification exclusively on classes

No more maps

Get rid of the maps entirely

Don't auto-populate macro IO

An aside on efficiency

Stop supporting `self` for function node

Data nodes

For loops

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

💡 IO and for-loop roadmap #268

Description

output_labels modification exclusively on classes

No more maps

Get rid of the maps entirely

Don't auto-populate macro IO

An aside on efficiency

Stop supporting self for function node

Data nodes

For loops

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`output_labels` modification exclusively on classes

Stop supporting `self` for function node