Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component-Coordinator Transport Layer Protocol #32

Closed
2 tasks
BenediktBurger opened this issue Jan 30, 2023 · 15 comments · Fixed by #38
Closed
2 tasks

Component-Coordinator Transport Layer Protocol #32

BenediktBurger opened this issue Jan 30, 2023 · 15 comments · Fixed by #38
Labels
distributed_ops Aspects of a distributed operation, networked or on a node documentation Improvements or additions to documentation messages Concerns the message format

Comments

@BenediktBurger
Copy link
Member

BenediktBurger commented Jan 30, 2023

As mermaid diagrams are not rendered in a PR, I collect the protocol definitions here.

The Message Layer will define, how the commands are encoded, here they are in plan English.
How the Header is formatted, will be defined in #33

General notes:

  • S means Sender.
  • R means Recipient.
  • The namespace of the first "Coordinator" is always "Co1".
  • Message IDs are not shown in communication
  • Conversation IDs (reply-reference) are not shown, unless necessary to indicate it specifically.
  • The communication shows the messages handed over to the sending socket, including the identity frame of the ROUTER socket (which will not arrive at the destination)
  • "|" indicates frame barrier. "||" at beginning/end indicates that it starts/ends with an empty frame.
  • Empty frames separate routing information from the payload.

Connection

erDiagram
    Component }|--|| Coordinator : "DEALER connects to ROUTER"
    Coordinator {
        string address
        string namespace
    }
    Component {
        string ID
    }
Loading

address is for example protocol, host, and port.

Basic communication

basic communication (connect/disconnect, heartbeat)

Successful communication

sequenceDiagram
    Note over CA,Co1: Initial communication
    CA ->> Co1: || I connect
    Note right of Co1: Stores CA's address in its list
    Co1 ->> CA: CA||Welcome to namespace "Co1" and here are relevant infos
    Note left of CA: Stores "C1" as its namespace.
    Note over CA,Co1: Some time later, a heartbeat
    CA ->> Co1: ||ping
    Note right of Co1: Updates heartbeat time.
    Co1 ->> CA: CA||pong
    Note left of CA: Updates hearbeat time.
    Note over CA,Co1: Some communication
    CA ->> Co1: Co2|CB||Some message for someone else.
    Note right of Co1: Updates heartbeat time.
    Co1 ->> CA: CA||pong
    Note right of Co1: Sends message to CB via Co2
    Note left of CA: Updates heartbeat time
    Note over CA,Co1: End of communication
    CA ->> Co1: || I disconnect from you.
    Co1 ->> CA: CA|| Acknowledge.
    Note right of Co1: Deletes CA from address list.
Loading

Notes:

  • Every message serves as a heartbeat. Heartbeat actions are not shown in the following diagrams anymore.
  • A disconnect message removes a Component from the address list.
  • Any message serves as a "connect" message

Different unsuccessful communication parts

sequenceDiagram
    Note over CA,Co1: Name already used: zmq.connect raises error
    Note over CA,Co1: The CA was known, but did not send a message in a long time
    Co1 ->> CA: CA|| Are you still alive?
    Note left of CA: Does not respond.
    Note right of Co1: Deletes CA from address list.
    Note over CA,Co1: TBD: The CA was known, but did not send a message in a long time
    Note right of Co1: Deletes "CA" from address list.
    CA ->> Co1: R:"C1.CA2". S:"C1.CA". Some communication for someone else.
    Note right of Co1: Stores "CA in its address list.
    Co1 ->> CA: R:"C1.CA". S:"C1.Co1". Acknowledge. The namespace is "C1".
    Note right of Co1: Handles the communication to CA 2
    Note left of CA: Updates "C1 as its namespace.
    Note over CA,Co1: Unknown recipient
    CA ->> Co1: Co1|CB|| Some message.
    Note right of Co1: Does not know CA3.
    Co1 ->> CA: CA|| Error: I do not know "CA3".
Loading

Components should request a heartbeat (by sending one themselves) before the time expires.

Message exchange

Message exchange in one Coordinator

sequenceDiagram
    CA ->> Co1: Co1|CB|| Give me property A.
    Co1 ->> CA: CA|| Acknowledge.
    Co1 ->> CB: CB|| Give me property A. ||Co1|CA
    CB ->> Co1: Co1|CA|| Property A has value 5.
    Co1 ->> CB: CB|| Acknowledge.
    Co1 ->> CA: CA|| Property A has value 5. ||Co1|CB
    Note over CA,Co1: As first message would work equally a local namespace:
    CA ->> Co1: CB|| Give me property A.
Loading

Notes:

  • During the whole exchange, the conversation ID is the same.
  • The message is not modified by the Coordinator.

Questions:

  • Should we allow "local mode", i.e. without specifying any namespace (recipient only, not sender!)? See Component IDs #27

Message exchange with two Coordinators.

sequenceDiagram
    CA ->> Co1: Co2|CB|| Give me property A.
    Co1 ->> CA: CA|| Acknowledge.
    Co1 ->> Co2: Co2|CB|| Give me property A.||Co1|CA
    Co2 ->> CB: CB|| Give me property A.||Co1|CA
    CB ->> Co2: Co1|CA|| Property A has value 5
    Co2 ->> CB: CB|| Acknowledge.
    Co2 ->> Co1: Co1|CA|| Property A has value 5||Co2||CB
    Co1 ->> CA: CA|| Property A has value 5||Co2||CB
Loading

During the whole exchange, the conversation ID is the same.

  • Should Coordinators acknowledge to each other the reception of a message (reliability Reliability #4)?
@BenediktBurger BenediktBurger added documentation Improvements or additions to documentation distributed_ops Aspects of a distributed operation, networked or on a node messages Concerns the message format labels Jan 30, 2023
@bilderbuchi
Copy link
Member

As mermaid diagrams are not rendered in a PR, I collect the protocol definitions here.

When we set up CI (#13 ) we can enable that RTD renders docs for PRs, too (https://docs.readthedocs.io/en/stable/pull-requests.html). I think that should make it possible to at least inspect the results.

@bilderbuchi
Copy link
Member

bilderbuchi commented Jan 30, 2023

Don't you want to abbreviate, e.g. Coordinator Co1, Co2, and Components C1, C2,... (or CA, CB,...) -- saves a lot of typing and space in the diagrams?

Successful communication: Some communication.

The message "R:"C1.Component". S:"C1.Coordinator". Acknowledge.", I think, should not be an ACK, but the reply from C1.Component2. Otherwise, this communication is now over without C1 getting the reply it is actually interested in?

Also, is the namespace of a Coordinator not the same as its name? Or do you want to treat the namespaces differently?

A disconnect message has the same consequences as no message during hearbeat time. However, a disconnect makes the name available again for another Component.

I think the Coordinator should at least ask with GET_STATUS once, before it disconnects.

  • Note: we need to specify the allowed heartbeat/"last seen" time, and how much fraction of that Components should wait before sending a heartbeat

Any message serves as a "connect" message

I disagree. The first message has to be a CONNECT. Otherwise, we end up mixing commands and their replies, and/or the protocol just gets needlessly complicated. Also, a Component will not know the namespace yet. Also, a Component that just connected does not even know which other Components are available, as it did not receive the address list yet. etc etc

@bilderbuchi
Copy link
Member

bilderbuchi commented Jan 30, 2023

Name already used

That one has a funny hole. In the reply, we are using R:".Component", but this Component already exists, so this message will go to the wrong Component! I guess we need a different flow for the establishing a connection.

The Component was known, but did not send a messag in a long time

I already mentioned the problem with implicitly establishing connections.

Components should request a heartbeat (by sending one themselves) before the time expires.

I'm not sure. I'm OK with sending heartbeats out regularly, but I don't think one should get a reply back. We should check how other protocols handle this.
If you want to know if someone is alive (but are not sure) you should ask for GET_STATUS or a separate GET_ALIVE. The latter has the added benefit that a Component can now realise that its heartbeats have not been heard in time, and tweak the interval.

@bilderbuchi
Copy link
Member

bilderbuchi commented Jan 30, 2023

Message exchange in one Coordinator
As first message would work equally a local namespace (.):

You mean the first message in the shown exchange? Or the first after connection? (I assume the former)

We already talked about the additional ACKs, and message symmetry elsewhere, but I'm not through with my notifications, yet.

Should we allow "local mode", i.e. without specifying any namespace (recipient only, not sender!)?

This feels attractive for single Node setups, to not needlessly prefix the coordinator namespace all the time. Have we discarded the notion that a Coordinator strips its name from a namespace when sending locally?

If we allow local namespace-less addresses, we should be consistent:

  • Addresses without namespace are Node-local (also in the Address book the Coordinator sends to its own Components). Therefore, all Node-local comms are without namespace.
  • A Coordinator adds its namespace to the Sender address of messages going to another Nodes
  • A Coordinator strips its namespace from the Recipient address of messages coming from other Nodes
  • A Component puts its own name (w/o namespace) as the Sender address
  • A Component knows the necessary address from the Coordinator. ComponentB is local, Coordinator2.ComponentB is another, remote Component

I think this should then be transparent and consistent for single node setups, even ones that grow into multinode later.

  • Do we have to use the leading period in the address (even without namespace)? It feels weird/ugly, and doesn't help with zmq topic filtering iiuc.

@BenediktBurger
Copy link
Member Author

That one has a funny hole. In the reply, we are using R:".Component", but this Component already exists, so this message will go to the wrong Component! I guess we need a different flow for the establishing a connection.

In the example communication, I did show only the frames actually sent. But that is not the whole truth:
A ROUTER socket (used in the Communicator) prepends every received message with an address (some bytes value, say "vioasdf"). If you send a message with the ROUTER socket, you have to prepend the data you want to send with that address, such that the zmq magic knows, to whom to send the data. So actually you call send_multipart(["vioasdf", data_frame0, data_frame1...]).
The Coordinators keep a list of known Sender names and the corresponding addresses (the local part of the "address book").
Therefore, you can always respond to any connected peer, if you know the address. Therefore, the Coordinator is able to respond to the Component usurping the name "Component", that the name is already taken, because it knows the address of the usurper (from the message it received).

@bilderbuchi
Copy link
Member

Message exchange with two Coordinators.

Should Coordinators acknowledge to each other the reception of a message

IMO, no, ACKs should only (primarily?) be for messages that would otherwise not get a reply. The reception of the reply is the acknowledgement. If no reply comes, you know something went wrong, and can retry and/or notify upstream Components. E.g.

sequenceDiagram
    CA ->> Coord1: R:"C2.CB". S:"C1.CA". Give me property A.
    Coord1 ->> Coord2: R:"C2.CB". S:"C1.CA". Give me property A.
    Coord2 ->> CB: R:"C2.CB". S:"C1.CA". Give me property A.
    Note over CB: No response/timeout
    CB -->> Coord2: <missing message>
    Coord2 ->> Coord1: R:"C1.CA". S:"C2.CB". Error: C2.CB did not respond
    Coord1 ->> CA: R:"C1.CA". S:"C2.CB". Error: C2.CB did not respond
Loading

@bilderbuchi
Copy link
Member

bilderbuchi commented Jan 30, 2023

In the example communication, I did show only the frames actually sent. But that is not the whole truth:

Ah, devil's in the details! All clear!
The usurper could then react by reporting with another, mutated, name. To avoid a back and forth with _1, _2, _3 suffixes, the Coordinator could even reply with a suggestion it knows is still free: Why don't you call yourself "Component_42", instead?

@BenediktBurger
Copy link
Member Author

Also, is the namespace of a Coordinator not the same as its name? Or do you want to treat the namespaces differently?

I thought, that we could name a Coordinator just "Coordinator", as it is unique in its namespace. Therefore you can always address your personal Coordinator if you do not supply any namespace, regardless of the namespace.

I think the Coordinator should at least ask with GET_STATUS once, before it disconnects.

You mean, instead of dropping a name, it sends a "are you still alive?" message. and if no reply arrives, it is removed from the list? Good idea. So you give code a chance to respond, if they forgot their heartbeat.

I disagree. The first message has to be a CONNECT. Otherwise, we end up mixing commands and their replies, and/or the protocol just gets needlessly complicated.

Due to heartbeats and incoming messages (which you cannot control), you have always the risk to receive another message than between sending a request and receiving a reply.

Also, a Component will not know the namespace yet.

Yes, but you can already send local messages.

Also, a Component that just connected does not even know which other Components are available, as it did not receive the address list yet. etc etc

But the user might know the Components name, he wants to connect to.

Another question:

  • If a "connect" is necessary, how do we deal with a dying (and restarting) Coordinator? Without the "connect" message, everything would continue as usual.

The Component was known, but did not send a messag in a long time

With that sentence, I meant, that the Component did not send any heartbeat some time.

You mean the first message in the shown exchange? Or the first after connection? (I assume the former)

I wanted to give an example of "local" communication without specifying the namespace.

@BenediktBurger
Copy link
Member Author

Do we have to use the leading period in the address (even without namespace)? It feels weird/ugly, and doesn't help with zmq topic filtering iiuc.

No. The leading period is not necessary, we could decide to drop it altogether. For the data protocol (topic filtering) we should use the full name.

The usurper could then react by reporting with another, mutated, name. To avoid a back and forth with _1, _2, _3 suffixes, the Coordinator could even reply with a suggestion it knows is still free: Why don't you call yourself "Component_42", instead?

I did not think about that, as I thought, that humans give the names, but that is an idea.

@BenediktBurger
Copy link
Member Author

Have we discarded the notion that a Coordinator strips its name from a namespace when sending locally?

I started an issue regarding that in #27 , from the considerations given there, I prefer to use always the full name, and used it in the examples, but that is not yet decided.

@BenediktBurger
Copy link
Member Author

If no reply comes, you know something went wrong, and can retry and/or notify upstream Components. E.g

I would not put the burden of checking for an answer onto the Coordinator, as it does not know, whether an answer is required.

@bilderbuchi
Copy link
Member

You mean, instead of dropping a name, it sends a "are you still alive?" message. and if no reply arrives, it is removed from the list? Good idea. So you give code a chance to respond, if they forgot their heartbeat.

Exactly.

But the user might know the Components name, he wants to connect to.

We are trying to specify the protocol, though, with as little as possible relying on user capability. ;-)

Due to heartbeats and incoming messages (which you cannot control), you have always the risk to receive another message than between sending a request and receiving a reply.

Yeah, but then you have different conversation IDs for different "topics", and a GET is a different thing from a CONNECT, why mix it up. Also, your protocol state machine gets easier if it starts with one option, a CONNECT, not any message?!

Yes, but you can already send local messages.

I think we need to decide if we always use the full addresses or not, for this.

If a "connect" is necessary, how do we deal with a dying (and restarting) Coordinator? Without the "connect" message, everything would continue as usual.

I fear I don't understand. If the coordinator is "dead", how can everything continue as usual? Aren't all the connections dead? It did not send heartbeats. How does the CONNECT message from a Component come into play here?
Also, if a Coordinator dies, we are in deep shit already, anyway, no? :D

I did not think about that, as I thought, that humans give the names, but that is an idea.

Thanks. Sure humans can do that, but thinking of pymeasure, people also leave their instrument names alone most of the time, and it will be nice if we automatically disambiguate.

No. The leading period is not necessary, we could decide to drop it altogether. For the data protocol (topic filtering) we should use the full name.

👍

I would not put the burden of checking for an answer onto the Coordinator, as it does not know, whether an answer is required.

Elsewhere we talked about that a message always requires a reply (even if it is null) - I thought that to be the original purpose of the ACK - a reply in case no data/content is expected.

@bilderbuchi
Copy link
Member

oh man, multi-parallel processing of discussion points 😓 time for dinner soon 😁

@BenediktBurger
Copy link
Member Author

I fear I don't understand. If the coordinator is "dead", how can everything continue as usual?

If a Coordinator is restarted (due to being an OS service etc.), all the Components reconnect automatically (in Zmq), without knowing, that they reconnected.
Due to constant heartbeats, the Connector rebuilds its address book fast and can route messages easily. Maybe a few messages will get rejected, but not all.

If we require a new "connect" message, all Components have to take an action.

@bilderbuchi
Copy link
Member

If a Coordinator is restarted (due to being an OS service etc.), all the Components reconnect automatically (in Zmq), without knowing, that they reconnected. Due to constant heartbeats, the Connector rebuilds its address book fast and can route messages easily. Maybe a few messages will get rejected, but not all.

If we require a new "connect" message, all Components have to take an action.

OK, I think we are maybe talking about two different "connect" events. You are talking (afaict) about the zmq connection, which automatically gets reconnected.
I was talking about exchanging the necessary info for a Component to interoperate with a Coordinator in LECO -- the address book, avro schemas, the Node's namespace, handshake stuff, whatever might come later.

If the Component does not even realise that the connection was gone for a while, indeed, why would it need a new CONNECT?
However, at the first time it connects (also after it restarts), it needs some info (currently, mainly the address book and avro handshake), and that I would like to handle in a separate message exchange, not interspersed with regular control messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed_ops Aspects of a distributed operation, networked or on a node documentation Improvements or additions to documentation messages Concerns the message format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants