Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "Node" concept #19

Closed
bilderbuchi opened this issue Jan 22, 2023 · 20 comments · Fixed by #24
Closed

The "Node" concept #19

bilderbuchi opened this issue Jan 22, 2023 · 20 comments · Fixed by #24
Labels
discussion-needed A solution still needs to be determined distributed_ops Aspects of a distributed operation, networked or on a node messages Concerns the message format

Comments

@bilderbuchi
Copy link
Member

bilderbuchi commented Jan 22, 2023

I think it could be useful for describing the details of how we approach distributed operations to introduce the concept of a "Node" to express a locally encapsulated (subset of) an ECP deployment.

This could also be useful for what was proposed in the discussion on reliability (#4). That is, to have an alternative (less complicated, more reliable) message transport mode for "local" (i.e. limited to one Node) applications, e.g. using Queues between threads instead of conventional zmq.
In the same breath, it makes sense to also introduce the term "message transport" (name TBC, I'm open for improvement suggestions)
Already refined formulation:

Node
A Node is a local context in which (part of) an ECP deployment runs. This may be a single application using one or more threads or processes. An ECP network has one or more Nodes. If it has a single Node, its Components may use the Local Message Transport ("local mode"). If it has multiple Nodes, they must use the Distributed Message Transport ("distributed mode").
(TBC) Optional feature "Bridging Coordinator": In a DMT network, if the Components of a Node use local mode, only a Coordinator in that Node may use DMT to bridge messages to/from other Nodes. Put differently, Components in local mode may only communicate to outside their Node via a Coordinator in their Node.

Message Transport (Distributed/Local)
The communication layer that transports ECP messages between Components. The Local Message Transport (default TBD) only works within a Node. The Distributed Message Transport (zeromq tcp protocol) also works across Nodes. Local Message Transport options include zeromq inproc, zeromq IPC, and queues between threads/processes (TBC).

Open question: In some contexts, "node" means a (physical) computer. Does it make sense to preserve/integrate that interpretation, or does it not make sense as for zmq it doesn't really matter if sources/sinks are on the same PC or somewhere else over the network? (or does it?) Let's scratch that.

Thoughts?

@bilderbuchi bilderbuchi added distributed_ops Aspects of a distributed operation, networked or on a node discussion-needed A solution still needs to be determined messages Concerns the message format labels Jan 22, 2023
@BenediktBurger
Copy link
Member

I propose the following definition: Node is a single participant in the ECP network communication. It might be an encapsulation of different software parts, which do not participate in the network themselves.

That should be clear enough and shows, that a Device connected via Ethernet to a Driver is not a node (only the corresponding actor).

@bilderbuchi
Copy link
Member Author

That is already a "Component", at least as per the current status of the glossary PR. I should have explicitly called that out.

@BenediktBurger
Copy link
Member

A Driver is a component (part of ECP), but not a node (does not participate in the network). Or am I wrong?

@bilderbuchi
Copy link
Member Author

bilderbuchi commented Jan 23, 2023

In the current formulation (mine), yes, because Driver is equivalent to the new "Actor". I have not yet updated the PR because the issue you opened is not resolved yet.

In the updated formulation, I plan that a Driver will be "filed" under Actor (only an Actor has a Driver), so not a separate Component.

With the Node concept I want to find a way to express "locality" of Components in an ECP network, to enable us to use something simpler than zmq in some situations, as someone expressed a desire for that in #4.

@BenediktBurger
Copy link
Member

Now I get it: Node is a collection of components, which share additional communication possibilities with each other, such that message transfer could happen via different means than zmq.
For example the components on a single computer could use queues to exchange data.

(to clarify my use in the other issue: Formerly I used node for every "communication endpoint", that means component communicating)

Zmq offers different protocols. For tcp it does not matter, if the communication partner is in the same computer or elsewhere. The inproc (in a process) and IPC (between processes) protocols require "locality".

@BenediktBurger
Copy link
Member

A question regarding intention of the node definition: Do you intend to make a transparent transition between the components of one node and the Zmq network, or do we distinguish between single node mode (similar current pymeasure) and multi node, which requires Zmq for all communication?

The combination of some components in one (or several) node and the other ones directly in the network, seems quite complex to me (at least if any combination is possible. Certain combinations are more easy, like current pymeasure, which publishes the data).

@bilderbuchi
Copy link
Member Author

Now I get it: Node is a collection of components, which share additional communication possibilities with each other, such that message transfer could happen via different means than zmq.
For example the components on a single computer could use queues to exchange data.

Yes! Although the "collection of components" is imo still a "network", and the Node provides the context they are operating in (local/distributed).

Zmq offers different protocols. For tcp it does not matter, if the communication partner is in the same computer or elsewhere. The inproc (in a process) and IPC (between processes) protocols require "locality".

That is good to know, so Distributed vs. Local Message Transport is not a question purely of zmq vs. something else, but there could be "local" modes using zmq, too. I'll edit above.

A question regarding intention of the node definition: Do you intend to make a transparent transition between the components of one node and the Zmq network, or do we distinguish between single node mode (similar current pymeasure) and multi node, which requires Zmq for all communication?

The combination of some components in one (or several) node and the other ones directly in the network, seems quite complex to me (at least if any combination is possible. Certain combinations are more easy, like current pymeasure, which publishes the data).

I wrote that "Components in a Node must communicate with other Nodes' Components via the Distributed Message Transport", so I imagine that as soon as a Component wants to communicate outside its Node, it needs to use the Distributed Message Transport.
For simplicity it indeed seems best (at least for now) to use a single-node or multi-node mode.

@BenediktBurger
Copy link
Member

I missed that apostrophe and understood (wrongly) communication with components of the same node...

Thinking while riding, I figured, that it is not difficult to have some components in a node. At least in my test implementation, it would be feasible (kind of coordinator-coordinator connection. Similar to NAT, the node's endpoint is a coordinator translating between the internal/external network).

@bilderbuchi
Copy link
Member Author

Thinking while riding, I figured, that it is not difficult to have some components in a node. At least in my test implementation, it would be feasible (kind of coordinator-coordinator connection. Similar to NAT, the node's endpoint is a coordinator translating between the internal/external network).

Yes, that sounds feasible -- so LMT inside a Node, and if you want to cross to another Node, you have to go via a Coordinator and DMT. That would nicely limit the impact/additional complexity, as only the Coordinator can reach out of a LMT Node.

@bilderbuchi
Copy link
Member Author

I added a formulation for the "Bridging Coordinator" feature (optional for now, to be refined/confirmed later).

@BenediktBurger
Copy link
Member

Another implementation idea: in a node, we could use "virtual sockets". They behave like sockets, but push / pull the messages through a pipe or so.
That way the code does not notice, that it talks differently.

@bilderbuchi
Copy link
Member Author

Is that maybe similar to what zmq inproc does internally? 🤷‍♂️

@bklebel
Copy link
Collaborator

bklebel commented Jan 28, 2023

I am really not quite sure how we can make the ECP agnostic to programming languages, and then introduce something which will necessarily depend heavily on implementation details in the respective language. zmq can easily talk between programs written in different languages (as far as I understand it), but I doubt that e.g. python queues will work in a similar way with C++ or LabView. It is therefore difficult for me to imagine how we can make a protocol definition for that.
I am aware of the concerns about reliability in #4, and I think there is a valid point about the switch from zmq to queues, since there can be instances where "zmq will drop messages (more or less) silently (can be asserted for)", and of course even if we use tcp on one machine, the OS might screw this up (although very unlikely), and if we are on distributed machines, the routers/switches/cables might get screwed up, and then messages are dropped.

I am not sure how well we can separate these things, how well we can say "let's do it in LMT", and then some additional machine needs to be added and we'll say "let's stay with LMT within this one node, and do DMT to the other machine". I think interconnecting LMT and DMT like this would be an unnecessaritly large piece of work, and not necessarily necessary. Once multiple machines are involved, DMT is necessary (easiest for all Components to be involved on equal footing, without bridging LMT and DMT), and if not, then LMT is an option. Once we have something go over DMT, we anyways loose the possible reliability advantage (ok, sure, not within the node, but still), and then we can just go full DMT anyways.

Regarding inproc and IPC: IPC only works on linux, so I would like to avoid it. Otherwise, inproc is no different from tcp in terms of reliability from the zmq point of view (if we consider the OS and the network infrastructure not to give us any trouble). Messages dropped over zmq.tcp will also be dropped over zmq.inproc. inproc is faster, and does not need tcp sockets/ports, it can have quite arbitrary names for sockets, but it only works within the same process, i.e. in the same zmq context afaik.

If we can define the whole protocol in a way that we can seamlessly switch between zmq and python queues, that would sound fine, then we can have a simple keyword-switch on instantiating Components to say whether it now should work LMT or DMT. However, I am afraid that the implementation in zmq and queues would differ quite a bit, and I am not sure whether one protocol would fit both ways. On the side of the implementation, we can surely do both, but much of what is needed for the DMT mode and zmq (starting with knowing IP addresses and selecting ports, and going towards protocol definitions of zmq header frames) becomes irrelevant with LMT (queues, or similar, NOT zmq inproc if that is what was desired).

@BenediktBurger
Copy link
Member

Even with queues, we have to distribute messages, therefore, the header frame can stay the same.
In the end, we need objects, which send and receive a series of bytes objects. That can be a Zmq socket or a queue. So it is easy to make a Component use queues instead of Zmq sockets (similar to using a manually created Serial Adapter vs giving a Visa resource string to an instrument in pymeasure).
We will loose some interoperability in a local environment, but I don't see it as a problem.
Also interoperability can be achieved easily with a Coordinator, especially as we solved, how coordinators talk to each other.

I don't see it as a large problem.

We can try it out, once we have the distributed mode.

@BenediktBurger
Copy link
Member

Regarding the connection types I second @bklebel.

@bilderbuchi
Copy link
Member Author

Thanks for raising the point of LMT between languages, that is an issue. We could place some additional restrictions as needed for the various LMT connection options. (E.g. If you want to use thread queue-based LMT, you need to stay in one language/one program).

Regarding LMT in one Node and DMT elsewhere: So far we have considered inter-Node messaging to only happen via Coordinators, so I'm not concerned about that.
However, this raises an important thing that I think we haven't touched yet:

Do we allow direct Component-to-Component connections without a Coordinator as intermediary?
If no, LMT/DMT routing will be easier, but we effectively always need a Coordinator in the network.
If yes, small networks will be simpler, but our Components potentially need to maintain multiple connections, and we generate a couple more questions: what about Processors? What about PUBlishing? how do Components discover other Components (only via the user?,...)

@BenediktBurger
Copy link
Member

Leaving out a Coordinator is difficult from another point of view, too: One socket of a connection binds to some address (host and port), while the other one connects to that address (all socket types can do either one).
Right now, only Coordinators bind and everybody else connects. If we change it, it gets complicated.

What we could do, is to create (in the reference implementation) a Director, which contains internally a Coordinator.

@bilderbuchi
Copy link
Member Author

That's a good datapoint in support of always having a Coordinator. It's fine by me, we just have to decide that.

So probably, the smallest network needs 1 Coordinator, 1 Director, 1 Actor. Maybe a very simple Director that is basically our command&control interface to the ECP.

What we could do, is to create (in the reference implementation) a Director, which contains internally a Coordinator.

I wouldn't hide too much of the structure. I would have the Director create a Coordinator, not contain it. Or we have the starter bring up everything.

@BenediktBurger
Copy link
Member

I was sloppy on expression as I meant one piece of code, which contains Director and Coordinator (already pre-configured to work together) for a faster start.

@bilderbuchi
Copy link
Member Author

I pushed an update to #24, requiring excatly one Coordinator per Node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion-needed A solution still needs to be determined distributed_ops Aspects of a distributed operation, networked or on a node messages Concerns the message format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants