-
Notifications
You must be signed in to change notification settings - Fork 3
Description
As I work more with @JNmpi's #33 and the iter
(and dataclass) functionality therein, I've been collecting some ideas about how to modify the way IO is defined on a node. I'll break these ideas into separate issues (or link existing issues), but I also wanted a sort of high-level summary to help clarify the way the ideas support each other. Concretely, I'd like to make the following changes:
- Make
output_labels
strictly a class option, and not allow changing it on instances ([minor] MakeFunction
andMacro
definition functions available at the class level #265, [minor] MakeFunction
IO info available at the class level #266, [minor] Make macro output labels a class attribute #269) - Don't allow
inputs_map
oroutputs_map
on macros ([minor] Explicit macro io #276)-
(I feel less strongly about this) Don't allowPresent the IO maps in the pedagogical material, but as a sort of optional thing and then mostly avoid using them ([patch] Canonical macro self-variable #283)inputs_map
oroutputs_map
on workflows either, i.e. only allow access via the standard "scoped label" (f"{target_channel.owner.label}__{target_channel.label}"
)
-
- Don't auto-populate macro IO from unconnected child IO, i.e. force the use of the function-like signature definition of IO ([minor] Explicit macro io #276)
- (Bonus) make macros more efficient by only introducing UI nodes where necessary ([minor] Prune macro IO nodes that don't get forked #277)
- Don't support
self
for function nodes ([minor] Function selflessly #279) - Scrape return labels from the
Macro.graph_creator
too, but stripf"{first argument}."
from them, e.g.return macro.n1, macro.n2
just gets output labelsn1
andn2
(if the function isdef MyMacro(macro, ...)
- Just raise an error if the scraped label still has a "." in it -- then you need to provide an output label
- Extract common "static IO" behaviour shared between
AbstractFunction
andAbstractMacro
to a common parent class ([minor] Extract a parent class for pulling IO data from a class method #282) - Make
wf
the canonical first-argument for macrograph_creator
functions in the notebooks, and make itself
in the tests -- the former lets us tell a story, the latter makes sense because that's what it is! ([patch] Canonical macro self-variable #283)
-- BREAK (merge above, then proceed below, maybe after a pause for efficiency/scaling work) --
- Introduce a new public-facing node class(es) e.g.
DataNode
(per @JNmpi's suggestion) for dataframes and dataclasses ([patch] More transformers #306, [patch] Dataclass transformer #308) - Make a new "for" loop interface that creates a for-loop class with IO channels (including which are scattered on and which get broadcast) defined at the class level, but which dynamically (re)creates children on each
run
call so instances can freely adapt to input of different lengths (partial progress in [minor] Explicit macro io #276, [minor] Introduce for-loop #309)
output_labels
modification exclusively on classes
This is fairly simple. Right now, you can modify the output labels on each instance, e.g.
from pyiron_workflow import Workflow
renamed = Workflow.create.standard.UserInput(output_labels="my_label")
default = Workflow.create.standard.UserInput()
print(renamed.outputs.labels, default.outputs.labels)
>>> ['my_label'] ['user_input']
This has only changed the name of the output label for the instance renamed
and hasn't changed the expected class IO signature for UserInput
at all.
As of #266, for function nodes (children of AbstractFunction
), it's no longer possible to do this -- output_labels
is strictly available when defining the class, then this interface naming scheme is static for all instances of that class.
That means you can freely set them when using the Workflow.wrap_as.function_node(*output_labels)
decorator or Workflow.create.Function(..., output_labels=None)
class-creation interfaces, but then they're fixed.
The advantage to this is that we can already peek at the IO at the class level:
from pyiron_workflow import Workflow
@Workflow.wrap_as.function_node("xplus1", "xminus1")
def PlusMinusBound0(x: int) -> tuple[int, int | None]:
return x + 1, None if x - 1 < 0 else x - 1
print(PlusMinusBound0.preview_output_channels())
>>> {'xplus1': int, 'xminus1': int | None}
print(PlusMinusBound0.preview_input_channels())
>>> {'x': (<class 'int'>, NOT_DATA)}
This is critical for guided workflow design (ala ironflow
), and also helped to simplify some code under the hood.
I would like to make a similar change to AbstractMacro
No more maps
@samwaseda, when we talked after the pyiron meeting this week, I expressed my sadness at the unavoidability of the inputs_map
and outputs_map
for allowing power-users to modify existing macros. After giving it more thought, I'm pretty sure that we can get rid of them after all!
Since #265, AbstractMacro.graph_creator
is a @classmethod
(as is AbstractFunction.node_function
). When combined with the idea above to guarantee that output_labels
are strictly class and not instance features, that means that a power-user can modify an existing macro by defining a new macro class leveraging the base class's .graph_creator
. Concretely, on #265 I can now do this:
from pyiron_workflow import Workflow
@Workflow.wrap_as.macro_node("original")
def MyWorkflow(macro, x, y):
macro.n1 = x + y
macro.n2 = macro.n1 ** 2
return macro.n2
@Workflow.wrap_as.macro_node("renamed", "new")
def ModifiedWorkflow(macro, x, y, z):
# First, create the graph you already like
MyWorkflow.graph_creator(macro, x, y)
# Then modify it how you want
macro.n1.disconnect_all()
macro.remove_child(macro.n1)
macro.n1 = x - y
macro.n2.inputs.obj = macro.n1
macro.n3 = macro.n2 * z
return macro.n2, macro.n3
m = ModifiedWorkflow(x=1, y=2, z=3)
m()
>>> {'renamed': 1, 'new': 3}
This isn't quite ideal yet, but with a few more changes I am confident I can get it down to
@Workflow.wrap_as.macro_node("renamed", "new")
def ModifiedWorkflow(macro, x, y, z):
MyWorkflow.graph_creator(macro, x, y)
macro.replace_child(macro.n1, x - y)
macro.n3 = macro.n2 * z
return macro.n2, macro.n3
This doesn't offer identical functionality to being able to set inputs_map
and outputs_map
, but IMO it offers equivalent functionality in a more transparent and robust way.
Get rid of the maps entirely
At the same time, I'd like to get rid of the maps completely by removing them from Workflow
too! This just means that you can't define shortcuts to IO at the workflow level and always need to use the fully-scoped name, like wf(some_child__some_channel=42)
instead of adding a map wf.inputs_map = {"some_child__some_channel": "kurz"}; wf(kurz=42)
. This is a price I'm willing to pay to remove the complexity from both the code and the user's head, but I'm not married to this part of the idea.
Don't auto-populate macro IO
Finally, the default right now is that if you don't use the function-like definition or output_labels
for your macro, you get IO based on the unconnected children, i.e.
from pyiron_workflow import Workflow
@Workflow.wrap_as.macro_node()
def AutomaticMacro(macro):
macro.n1 = Workflow.create.standard.UserInput(user_input=0)
macro.n2 = Workflow.create.standard.UserInput(user_input=macro.n1)
auto = AutomaticMacro()
print(auto.inputs.labels, auto.outputs.labels)
>>> ['n1__user_input'] ['n2__user_input']
Is equivalent to
from pyiron_workflow import Workflow
@Workflow.wrap_as.macro_node("n2__user_input")
def ExplicitMacro(macro, n1__user_input=0):
macro.n1 = Workflow.create.standard.UserInput(user_input=n1__user_input)
macro.n2 = Workflow.create.standard.UserInput(user_input=macro.n1)
return macro.n2
explicit = ExplicitMacro()
print(explicit.inputs.labels, explicit.outputs.labels)
>>> ['n1__user_input'] ['n2__user_input']
I'd like to stop auto-populating things and force the macro definition to be explicit.
Cons:
- Might be inconvenient sometimes
Pros:
- Zen of python "explicit is better than implicit"
- Reduces mental load by making macro definitions act more like function definitions
- Shrinks the codebase a little bit
An aside on efficiency
Right now, when a macro has input arguments in its signature beyond the first macro: AbstractMacro
item, when we build the graph we prepopulate it with UserInput
nodes for each signature item. This works fine, and is necessary when that input is getting bifurcated to be used in multiple child nodes -- but if we require the function signature approach to graph definition, there will be times when the input is being used in only one place and it's downright inefficient to stick an intermediate UserInput
node in the way! The macro-level input can simply "value link" to the child node's input directly.
I already made a branch yesterday that takes care of this and purges such useless nodes at the end of the graph creation, so there's no big concern about efficiency. Unfortunately, while it went 99% smoothly, this feature interacts poorly with the combination of input maps and storage, so just a couple of tests fail where a workflow owns and reloads a macro. I am confident that adding this efficiency change back in will be possible after output_labels
are class properties and inputs_map
is gone.
Stop supporting self
for function node
@JNmpi, when we had to stop ourselves from hijacking the pyiron meeting on Monday to talk about pyiron_workflow
, you seemed to me to be expressing the idea that function nodes should be really stateless, and if someone wants state they should just write a function node to handle it and put the whole thing in a macro. I am 100% on board with this perspective -- let's really encourage function nodes to be functional!
To do this, I'd like to just get rid of support for self
showing up in AbstractFunction.node_function
functions entirely. It already breaks in some places that we need to work around, so it will feel good to remove it.
From an ideological and UX perspective, I really like this, because now at this point in the todo list function nodes are stateless and always wrap a function like def foo(x, y, z)
, and macro nodes are stateful and always wrap a function that basically has self
in it like def bar(macro, a, b, c)
.
Data nodes
IMO, the one real downside to forcing users to explicitly define their node IO as part of the function signature/output labels is that it might get a bit verbose for nodes with lots of input -- this is especially true for macros.
@JNmpi in #33 has already been working with dataclasses to package together sets of related input. This allows sensible defaults to be provided, and lets developers build up input/output by composition using multiple inheritance. All great stuff. In the context of nodes, I see this making it more succinct to write the IO like this:
@Workflow.wrap_as.macro_node("result")
def MyNewMacro(macro, input_collection: Workflow.create.some_package.SomeDataNode.dataclass):
macro.n1 = Workflow.create.some_package.SomeNode(input=input_collection)
macro.n2 = Workflow.create.some_package.SomeOtherNode(
macro.n1.output_collection.foo # Leveraging node injection to grab a property off the output class
)
macro.n3 = input_collection.bar - macro.n2
# Again, `` is actually the input node `input_collection` and then we use node-injection to grab its `bar` attribute
return macro.n3
Then even if the dataclass has lots of fields, we don't need to write them all out all the time.
This idea is already discussed on #208.
For loops
Ok, so with all this in place we can get to the actual meat of the project which is facilitating clean, powerful, and robust for-loops. @JNmpi, you mentioned on Monday wanting to be able to pass nodes (or at least node classes) to other nodes, and I think it's the class-level-availability of IO signatures that is critical for this. Once we can do this and have SomeNodeClass.preview_input/output_channels()
available for macro nodes like they already are for function nodes, we'll be able to pass classes to a constructor that dynamically defines a new class with corresponding IO!
The spec for a for-loop is then to have a creator like Function
(NOT AbstractFunction
) that creates a new child of AbstractMacro
that...
- Takes a node class (or node instance) as input
- Passing an instance is just a convenient polymorphism; we would leverage the instances
.__class__
attribute to create the new for-macro class, and then use the instance's specific IO values and connections to update the IO of the new for-macro instance
- Passing an instance is just a convenient polymorphism; we would leverage the instances
- The for-macro class has the same IO signature as the input class(/instance)
- The creator requires specifying which of these inputs are going to be scattered to child nodes, and which are going to be broadcast to child nodes
- Concretely, I imagine using the
ironflow
syntax of capitalization to combine this specification with values likeForLoop(Workflow.create.atomistics.BulkStructure, LATTICE=[3.9, 4, 4.1], species="Al")
or simply specifying which ones are going to get scattered and delay actual values to later likeForLoop(Workflow.create.atomistics.BulkStructure, scatter_on=("lattice",)
- Concretely, I imagine using the
- At this point, just as
Function
:AbstractFunction
, we have aForLoop
that is dynamically creating a new child class of someAbstractForLoop
where the body node class and IO (including broadcast vs scattered) are all defined at the class level, and then we immediately instantiate it and are free to populate this IO (possibly using some body class instance's values, if an instance was passed instead of a class) - The form of the IO is fixed, but we want to be free to vary the length of the scattered input! This can be accomplished by modifying the run call so that at each call all the for-macro's children get removed (if there are any), and
$N$ of them get re-instantiated and reconnected(re-linked) with the macro's input (whether it's being broadcast or scattered), and run again- This touches on a-priori and post-facto graph provenance! The a-priori is now that for loops have a particular structure, and after running there is exactly one child node for each item in the for loop! I hope something similar can be worked out to give while-loops the same level of post-facto provenance, but I haven't figured it out yet.
- Remember that since macros are "walled gardens", no one outside the macro actually makes IO connections to the children's IO channels -- we can safely delete and re-instantiate them internally while the macro as a whole maintains all its data connections with its siblings
- The for-macro should have two outputs,
scattered_dataframe: pandas.DataFrame
ala @JNmpi'siter
method that links the scattered input items to child node output, andbroadcast_input: DotDict
that gives easy access to all the input that is identical across all rows of the dataframe. (In principle we could return just the dataframe, but duplicating the non-scattered input like that seems needlessly memory-inefficient...)- Returning a dataframe differs from the existing
For
meta-node that has list-like channels for each of the child node's input. That's because myFor
meta-node handles only a single for loop, while @JNmpi'siter
method handles looping on multiple values in nested for-loops. Hist way is better
- Returning a dataframe differs from the existing
- As syntactic sugar, I'd like to provide both
ForZipped
andForNested
interfaces, where the nested version is like the currentiter
on Joerg's lammps nodes #33 andZipped
zips instead of nesting the scattered input. - Finally, expose a shortcut to creating such a node on the
Node.iter
method (well, nodes anditer_zipped
anditer_nested
, probably) - These
iter
methods will need to work a little differently onWorkflow
, which is a parent-most object and not able to pass itself in as the relevant class for theForLoop
creator class, but this is an edge-case that can be defined in theWorkflow
class itself by overriding the methods. - At the instance-level, we will want some convenience methods/features for distributing the same executor to all the children at once (again, similar to Joerg's lammps nodes #33 and its
max_workers
)
That's a monstrous wall of text, so let me see if I can end with a for loop syntax example
from concurrent.futures import ThreadPoolExecutor
import numpy as np
from pyiron_workflow import Workflow
Workflow.register("some.atomistics.module", "atom")
@Workflow.wrap_as.macro("energy")
def BulkLammps(macro, species, lattice, cubic):
macro.bulk = Workflow.create.atoms.Bulk(species=species, cubic=cubic)
macro.engine = Workflow.create.atoms.Lammps()
macro.static = Workflow.create.atoms.Static(
structure=macro.bulk,
engine=macro.engine
)
energy = macro.static.energy_pot
# Here I imagine that `Static` is returning an instance of some `StaticAtomisticsOutput`
# and that it's a single value node, so above we're actually leveraging node injection
# to say `energy = Workflow.create.standard.GetAttr(macro.static.outputs.output, "energy_pot")
return energy
wf = Workflow("my_for_loop_example")
wf.calculation = Workflow.create.standard.ForNested(
BulkLammps,
SPECIES=["Al", "Cu"], # Scattered
LATTICE=np.arange(2, 6, 100), # Scattered
cubic=True # Broadcast,
child_executor=ThreadPoolExecutor(max_workers=2),
)
# Then expoit node injection to operate on the for loop's dataframe
wf.plot_Al = wf.create.plotting.Plot(
x=wf.calculation[wf.energies["species"] == "Al"]["lattice"].values,
y=wf.calculation[wf.energies["species"] == "Al"]["energy"].values,
)
wf.plot_Al = wf.create.plotting.Plot(
x=wf.calculation[wf.energies["species"] == "Cu"]["lattice"].values,
x=wf.calculation[wf.energies["species"] == "Cu"]["energy"].values,
)
wf() # Run it
# Or again with more data and more power
wf.calculation.set_child_executor(ThreadPoolExecutor(max_workers=20))
wf(calculation__lattice=np.arange(2, 6, 10000))
Or if we don't care about the workflow but just want to get a dataframe quickly, we could use the shortcut to say something like:
m = BulkLammps()
df = m.iter_nested(
SPECIES=["Al", "Cu"], # Scattered
LATTICE=np.arange(2, 6, 100), # Scattered
cubic=True # Broadcast,
child_executor=ThreadPoolExecutor(max_workers=2),
) # Makes the new for node, runs it, and returns the dataframe
Linked issue #72
Conclusion
I've already started down this path with #265, #266, and the un-pushed work pruning off the unused macro IO nodes. I like this direction and will just keep hacking away at it until main
and #33 are totally compliant with each other. I am very happy for any feedback on these topics!