Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compute_mapped #170

Merged
merged 17 commits into from
Jun 21, 2024
Merged

Add compute_mapped #170

merged 17 commits into from
Jun 21, 2024

Conversation

SimonHeybrock
Copy link
Member

This fixes a big UX hurdle when computing results that use mapped nodes.

We will later on consider if this should be integrated more tightly with Pipeline.compute, for now this should serve as a solution that is 80% there an can be used to gather more insights/experience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider how you could integrate this functionality into compute? E.g., check whether a given key is mapped and then do the equivalent of compute_series?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that, yes.

Copy link
Member

@jl-wynen jl-wynen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for a prototype. But in the long run, I would prefer merging this with compute. And I would also prefer not using pandas if possible because users might not use pandas either.

@SimonHeybrock
Copy link
Member Author

SimonHeybrock commented Jun 19, 2024

Ok for a prototype.

This is not meant to be a prototype.

And I would also prefer not using pandas if possible because users might not use pandas either.

We could use a plain dict, but then we do not have index names (aka dims), so one would need to implement a custom data structure for this, including ultimately something like a multi-index (or, simpler something N-D like Scipp, but that won't work any more as soon as Cyclebane supports groupby). I don't think users would prefer using a custom class from an unknown library over a very well known and package in widespread use. Do you have something else in mind?

But in the long run, I would prefer merging this with compute.

It is not yet clear to me that this is a good idea, so I refrained from expressing a preference at this point. Pipeline.compute can compute multiple keys (returning a dict, instead of a single value). So we would need to return a dict containing one or more pandas.Series (instead of a single Series). Would it be too confusing to get a dict, a Series, or a dict containing Series, depending on the args? It is starting too look like a bad interface. Maybe the methods for one target (or a single mapped target) should be separate from the one computing multiple targets? Edit: Furthermore, I had to add an optional index_names argument, in case there are multiple mapped nodes with the same name, which would be cumbersome to integrate into compute with multiple targets.

@SimonHeybrock SimonHeybrock changed the title Add compute_series Add compute_mapped Jun 19, 2024
"\n",
"**Note**\n",
"\n",
"[compute_mapped](https://scipp.github.io/sciline/generated/functions/sciline.compute_mapped.html) depends on Pandas, which is not a dependency of Sciline and must be installed separately, e.g., using pip:\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link and the link below should be relative links.

src/sciline/pipeline.py Outdated Show resolved Hide resolved

candidates = [
node
for node in graph._cbgraph.graph.nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that this uses a protected attribute of graph. Should get_mapped_node_names be a method of Pipeline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely something that we need to consider for a lot of other functionality (in particular around planner graph operations) that we intend to implement. Either we need to make those properties accessible on the public interface, or potentially add a lot more methods to Pipeline. Right now I cannot say which is better? So unless you can clearly say which solution should be chosen, I'd like to keep this as it is for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what operations you have in mind that need direct access to the graph. But I would say that we should expose a minimal set of primitive operations that can be composed in, e.g., methods of workflow classes.



def get_mapped_node_names(
graph: Pipeline, key: type, index_names: Sequence[Hashable] | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a graph or a pipeline? We should be consistent with names.

@@ -194,3 +199,93 @@ def bind_and_call(
def _repr_html_(self) -> str:
nodes = ((key, data) for key, data in self._graph.nodes.items())
return pipeline_html_repr(nodes)


def get_mapped_node_names(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'name' or 'key'? What is the difference?

node for node in candidates if set(node.indices) == set(index_names)
]
if len(candidates) == 0:
raise ValueError(f"'{key}' is not a mapped node.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be before filtering by index names? It seems that key refers to a mapped node but that is filtered out by the index names.

@SimonHeybrock SimonHeybrock merged commit bd8e6f3 into main Jun 21, 2024
5 checks passed
@SimonHeybrock SimonHeybrock deleted the compute-series branch June 21, 2024 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants