Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliability #4

Open
bilderbuchi opened this issue Jan 2, 2023 · 25 comments
Open

Reliability #4

bilderbuchi opened this issue Jan 2, 2023 · 25 comments
Labels
discussion-needed A solution still needs to be determined enhancement New feature or request

Comments

@bilderbuchi
Copy link
Member

@mcdo0486 raises the importance of protocol reliability:

However, I think we need to make sure that a data protocol doesn’t replace the current threaded pymeasure architecture. This discussion is on a proposed data protocol, not implementation, but I think the protocol design can easily creep into fundamental architecture changes.

The most important thing with an experiment design and operation framework is reliable recording of data as fast as possible. I can see synchronization and data loss being potential problems if message passing was moved entirely over to zmq.

Pymeasure used to use zmq for message passing but moved to thread queues instead. If you look through the commits there are comments like this ominous one related to removing zmq for thread queue: “[listens through a thread queue] to ensure no messages are lost”
pymeasure/pymeasure@390abfd

Listeners and workers were moved to a threaded approach. While the Worker will setup a zmq publisher, and emit data over it, it doesn’t emit data to anything by default. You can rip out all the zmq logic in the Worker class and your procedure would run just fine.

That isn’t to say we can’t do things better now, we can, but the most important thing is fast and reliable data recording as mentioned.

So in sum, I don’t think there should be a “new measurement paradigm” but an “additional measurement paradigm” that stays compatible with current workflow with threads.

@bilderbuchi
Copy link
Member Author

Benedikt:
I did not want to supersede the current style, but to supplement it.
Obviously there are advantages, if you do it in one program.
Actually, I don't care, if some data points are lost every once in a time, as often I need some information and it is more important to put the parts easily together in a different way.
I just use a single computer, but I like Zmq, as it gives a lot of flexibility in the setup without writing more than the variables to measure in a text field.
However, I value your input and it is good to keep that use case (fast and very reliable) in mind.

Separate networked (zmq) and local (threaded) modes have been proposed.

Christoph:
It seems that zmq can be "hardened" with some better handshaking/synchronisation, nodrop flags, etc. I would think the "inproc" connection variant should not be that different from thread queues.
We could add the option of a "single-node mode", so people can use that if they want.
If the reliability is as good as plain threads, but we can avoid duplicating business logic, that would be a win.
In my limited experience, most of the problems I encountered regarding lost messages was due to synchronisation issues at the beginning.

Benjamin:
Yes, I guess if zmq was inherently flawed in dropping messages unnoticed, it would not have found adoption by anyone - the point of zmq is just that much needs to be done by oneself, which might already be present in a bigger framework, which however creates a higher entry threshold. When looking at pymeasure/pymeasure@390abfd, the zmq implementation which waited for messages was rather simple - things can surely be done better than I did them, but even I was able to make it reliable, although it's not always super quick.
One issue of publishers and subscribers is indeed that in the first second or two, things are weird, but if one puts a time.sleep(2) on that, I think at least in this configuration this works just fine. I am not sure whether this is what was referred to here.

Regarding the inproc mode, I am not too sure whether that is necessary if we already have all the ports set up properly - I don't think we need the performance gain which it might bring that badly, and making a full parallel comms network with inproc just to substitute queues, to do everything in zmq might not be worth it. Or course, future contributors would only need to become familiar with one of the two if we only have one of the two, but I guess substituting the queues with a hardened version of zmq inproc comms would then be at the back of the roadmap.

@BenediktBurger BenediktBurger mentioned this issue Jan 9, 2023
@BenediktBurger BenediktBurger added enhancement New feature or request discussion-needed A solution still needs to be determined labels Jan 19, 2023
@bklebel
Copy link
Collaborator

bklebel commented Jan 28, 2023

I think, one part of the reliability problems encountered by pymeasure developers with zmq might be about timing. PUB-SUB connections need a bit to connect to each other properly, however a sleep(1 second) at startup should solve that.
There are however more cases in which messages are dropped silently.

One case in which messages are dropped is when a ROUTER socket is supposed to send a message to an address which it does not know (it does not exist). This case can be handled by checking that a message which should be sent is actually sent, on the side of the ROUTER. For the Coordinators, this would mean that when a request for Component C1 arrives from Component C2, but C1 just died, the Coordinator can catch that (check the return value of send_multipart() from the ROUTER socket in question), and tell C2 that C1 no longer seems to be available.
When we use a Coordinator to reach Actors to SET or GET parameters, if we do not get a reply quickly, we can always consider the message dropped and ask again. If the first answer comes later on, we can discard it by the conversation ID (or reply-reference, whatever we use in the end for it), as long as we keep this kind of information in the protocol. If it were only two Components talking to each other, without a Coordinator, we could just use a counter (as conversation ID), and if a later answer arrives first, we can put it aside and then pick it up in proper order later. However, since we look at a multitude of Components, we need something a little bit more unique, and uuid's worked well for me. I haven't tried to gauge the pure performance of my implementation of this communication, as typically my Actors are busy communicating with the Devices most of the time, and only "interrupt" this for zmq talk every once in a while (done much better my @bmoneke), but maybe we can do that, to see whether this degree of reliability then impairs the speed.

Another thing in zmq regarding dropped messages is the High-Water-Mark (HWM), which individual sockets (ROUTERs afaik) use to avoid congestion (and essentially a possible memory-leak). If a Component is very slow in accepting messages from a Coordinator, but this Coordinator continues to receive commands for this Component, it stores a certain number of those messages, up to the HWM, after which it starts dropping new messages it should actually send, as otherwise it would need to store every increasing numbers of messages which it waits to send to the Component. I guess, we should check for whether messages have been dropped because the HWM was hit (especially in the command protocol Coordinators, they have the ROUTER sockets as of the current discussion), and think about a contingency, or at least make it understandable under what circumstances messages might get lost, and definitely warn about it once it starts to occur.
We don't like silently dropped messages, but if messages are dropped because of something like the HWM, we might want to know about it anyways, since it tells us about a problem in our design.
For example, if we have a case where we have a continuous buildup of unsent messages because the rate at which a Component consumes messages is always lower than the rate with which we send messages to it (i.e. the Component cannot keep up), we should know about it, and change our setup.
However, the HWM is manually settable.
If we have a case where a Component becomes busy/blocked every once in a while for a certain duration, and new messages aren't consumed, however we know that after this blocked duration we can work off the backlog, we can easily increase the HWM (if necessary) to allow for this congestion to occur without loosing messages in the meantime. Possibly, we can have the HWM getting increased the first 2 times it gets hit while publishing warning logs, after which we start publishing error logs, without increasing it further, to avoid the mentioned memory-leak.

As far as I understand it, the reliability-benefit of using inproc versus using tcp (if the comms stay within a single process, necessary for inproc to work) is to cut the round-trip across the OS-network-infrastructure, the above described cases of "silently" dropped messages would occur in both modes, I think. the great benefit of inproc is speed, cutting away quite a bit of overhead.

@BenediktBurger
Copy link
Member

It is great to have you on board @bklebel. Your additional insight in Zmq helps a lot.

I like the idea of checking for dropped messages, I did not think about that part.
Also the idea of checking the address is great. Thanks.
Btw, every socket has a high water mark and that message buffer is on both sides of a connection. Per default, the value is 100, I think. So you have 100 messages in the sending socket and 100 in the receiving one as a backlog, when the sockets start to drop messages.

Somw sockets block, while reaching hwm, others drop messages. I have to look it up.

Reliability

  • Check, whether a message is sent
  • check for high water reached
  • maybe check for late answers

@bklebel
Copy link
Collaborator

bklebel commented Jan 28, 2023

I just looked the HWM up again, very close, the zmqguide says

In ZeroMQ v2.x, the HWM was infinite by default. This was easy but also typically fatal for high-volume publishers. In ZeroMQ v3.x, it’s set to 1,000 by default, which is more sensible.

and

Lastly, the HWMs are not exact; while you may get up to 1,000 messages by default, the real buffer size may be much lower (as little as half), due to the way libzmq implements its queues.

But this is then an implementation detail, which number to put in exactly.

@bklebel
Copy link
Collaborator

bklebel commented Jan 28, 2023

Thanks for your praise! I really enjoy working on this, even though I have a hard time to keep up with all the different conversations here.

Btw, every socket has a high water mark

Okay, true, so we should keep all channels in mind, not just the control one, ok....
From the zmqguide again:

PUB and ROUTER sockets will drop data if they reach their HWM, while other socket types will block

So, generally, the receiving sockets in our current idea of an architecture block, while sending sockets drop the messages. Blocking sockets are not a problem in terms of reliability, dropping sockets are - but if we say we disagree with it being most sensible to just drop the messages in this case, because we are concerned with reliability, we could, if we see that a message was not sent because the HWM was reached, just schedule to send it again.
We use PUB sockets e.g. in Actors in the data-channel, and ROUTER sockets in the Coordinators. For the Coordinators, I am not sure how to solve that, but for the Actors, it could be as simple as giving one Actor increasing numbers of PUB sockets for the data-channel to choose from, as we always allow multiple PUB sockets in that direction anyways. I am not sure whether we actually want to go so far as to implement a load-balancing scheme for the different PUB sockets INSIDE the Actors, but that would become reliable in a sense of not dropping messages, I think. However, I have a hard time to imagine what will happen if we have multiple Actors, which push out data so fast that they need multiple concurrent PUB sockets to fan out the data - in the end, all of them send it to the same one Coordinator, which will get quite a lot of traffic to handle...
But, maybe we want to cross that bridge when we come there, as the next thing would be to have multiple parallel proxies for the data-channel Coordinator, with again some kind of load-balancing scheme to decide to which proxy an Actor should now send their data....I really am unsure as to whether this is within our scope. Pushing out warnings and errors once messages are dropped need to be okay at some point, if someone has this kind of need of speed, they can then contribute a solution - zmq is quite fast in the first place, after all.

@BenediktBurger
Copy link
Member

When does a pub socket drop messages? If the recipient cannot keep up. That will happen, if the recipient stalled or is really slow. Both are the problems of the recipient.
I'd log the dropping of messages.
The recipient should take action, once it has a full buffer.

In the command protocol, a recipient (probably an Actor) has a high backlog, due to slow device communication. In that case, the Actor might send an error "Device busy" to the sender of some request.

@bilderbuchi
Copy link
Member Author

You surely are aware already, but the zmq guide has a hole chapter on reliable REQ-REP patterns: https://zguide.zeromq.org/docs/chapter4/, probably something can be picked up from there?

Also, we should probably decide what kind of reliability we want for message receipt - at-most-once, exactly-once, at-least-once? AFAIK, these all have different trade-offs. If we have a message-id field, we can probably easily handle at-most-once by discarding already seen messages on the receiving side.

Also, should we pause/postpone the design-for-reliability until more of the protocol design has been done (because we know more about the trade-offs etc), or do you think it will be important to some central design questions?

@BenediktBurger
Copy link
Member

BenediktBurger commented Jan 29, 2023

Thanks for linking that chapter.

Reading the part regarding heartbeats, I got the impression, that it would be good, if the Coordinator acknowledges every message received (serves as a Coordinator heartbeat and the Component knows, that it's message is on its way).

I think Zmq ensures, that a message is received just once or dropped.
So if we always require a response (at least an acknowledgment), an application can determine, if the message did not arrive.
And we can introduce checks, that no messages are dropped due to buffer overflow (hwm), such that messages could be rejected instead of dropped.

EDIT: MermaidDiagram of the message flow

sequenceDiagram
    Component1 ->> Coordinator: "To:Component2,From:Component1. Give me property A"
    Coordinator ->> Component1: "ACK: I got your message"
    Coordinator ->> Component2: "To:Component2,From:Component1. Give me property A"
    Component2 ->> Coordinator: "To:Component1,From:Component2. Property A has value 5"
    Coordinator ->> Component2: "ACK: I got your message"
    Coordinator ->> Component1: "To:Component1,From:Component2. Property A has value 5"
Loading

The basic ideas of reliability should enter this discourse, as they might influence the protocol definition.

@BenediktBurger
Copy link
Member

Reading the zmq guide (parts of it) again: We do not need a checksum, as zmq ensures, that the whole message (even a multipart message) arrives in one piece.

@BenediktBurger
Copy link
Member

Should we make a heartbeat pattern (as proposed in zmq guide) to respond to every message. Either with content or with an empty message (i.e. heartbeat)?

@bilderbuchi
Copy link
Member Author

bilderbuchi commented Jan 30, 2023

I guess it will be hard to know when to stop ACKing. Every message having a reply sounds nicely symmetric, though! (and make it much easier to reason about nested/recursed message flow)

W.r.t. your diagram above, I did not expect the ACK from Coordinator to Component2 -- the message with the value was already the reply. I would expect the Coordinator to wait for a reply, and if none is coming, to ask again.

Same with the first ACK from Coordinator to C1 -- the reply should be the value message. If there's just an "ACK", what does the Component do with that? It is not the reply that was requested, does that mean something happened?

Note: We might (optionally/later) want to have a separate "WAIT/ACK" exchange do deal with expected long delays, but otherwise I would keep the request-response pattern direct.

@BenediktBurger
Copy link
Member

W.r.t. your diagram above, I did not expect the ACK from Coordinator to Component2 -- the message with the value was already the reply. I would expect the Coordinator to wait for a reply, and if none is coming, to ask again.

I reasoned, that the Coordinator does not know, whether any message will go back to Component2, so it just acknowledges, that it received a message and hands it on.

Same with the first ACK from Coordinator to C1 -- the reply should be the value message. If there's just an "ACK", what does the Component do with that? It is not the reply that was requested, does that mean something happened?

That ACK is a heartbeat, stating: I'm still alive, your message is on its way.

I guess it will be hard to know when to stop ACKing. Every message having a reply sounds nicely symmetric, though! (and make it much easier to reason about nested/recursed message flow)

In the Message format issue I formulated the idea: Each message with content is acknowledged with an empty message. That prevents the infinite ACKing.

@bilderbuchi
Copy link
Member Author

bilderbuchi commented Jan 30, 2023

That ACK is a heartbeat, stating: I'm still alive, your message is on its way.

Yeah, but do we need/want that? What happens in zmq if the endpoint/recipient of your message is not alive? Do you notice? do you not get an error back?
That's not like UDP, is it?

@BenediktBurger
Copy link
Member

What happens in zmq if the endpoint/recipient of your message is not alive?

The message waits happily in the outbound buffer until the endpoint comes back online, then the message is sent. You do not get an error.

@BenediktBurger
Copy link
Member

That is the reason I went for the ping pong heartbeat: https://zguide.zeromq.org/docs/chapter4/#Heartbeating-for-Paranoid-Pirate

We could (to reduce data transfer) make these heartbeats without any frames (even without names!).
Or we just send heartbeats, if explicitly requested.
So an actor, which did not get any message in some time, contacts its Coordinator, asking, whether it is still alive.

@bilderbuchi
Copy link
Member Author

What happens in zmq if the endpoint/recipient of your message is not alive?

The message waits happily in the outbound buffer until the endpoint comes back online, then the message is sent. You do not get an error.

That is the reason I went for the ping pong heartbeat: https://zguide.zeromq.org/docs/chapter4/#Heartbeating-for-Paranoid-Pirate

that of course is very valuable context! I'll have to think...

@bklebel
Copy link
Collaborator

bklebel commented Jan 30, 2023

What happens in zmq if the endpoint/recipient of your message is not alive?

The message waits happily in the outbound buffer until the endpoint comes back online, then the message is sent. You do not get an error.

This depends on the socket, I would think - a ROUTER which wants to send something to a dead connection might drop the message (silently if not checked for internally by looking at the return value of the sending function), or do I misunderstand something now? If the recipient is dead but the connection is "more or less still there"? I think the DEALER would store it in the outbound buffer, but the ROUTER would drop it.

@BenediktBurger
Copy link
Member

Messages just get dropped if the buffer overflows.

@bklebel
Copy link
Collaborator

bklebel commented Jan 30, 2023

Messages just get dropped if the buffer overflows.

Mmmmm no, I don't think so, as per zmqguide, undeliverable messages will get dropped by a ROUTER. This we should however definitely catch.
I am not entirely sure what makes a message undeliverable, whether it becomes undeliverable if the peer dies, or only if the ROUTER has no idea of such a peer at all, and whether/when the ROUTER socket might erase a dead peer from its own list of known peers.

Also, for reliability we might want to take a closer look at this flowchart - which funnily does not make it into the "Reliable messaging patterns" chapter of the guide, I think because it is more about flaws in the implementation than catching problems which occur "in the wild". It is still a very valuable resource.

@BenediktBurger
Copy link
Member

BenediktBurger commented Jan 31, 2023

It says in the guide, that we could (and should, I think) catch non routable messages instead of dropping them.

Set ROUTER socket option ZMQ_ROUTER_MANDATORY to True.

"Since ZeroMQ v3.2 there’s a socket option you can set to catch this error: ZMQ_ROUTER_MANDATORY. Set that on the ROUTER socket and then when you provide an unroutable identity on a send call, the socket will signal an EHOSTUNREACH error."

Python code (if r is the socket): r.ROUTER_MANDATORY = True (FAIL_UNROUTABLE is the same option.)

@LongnoseRob
Copy link

It says in the guide, that we could (and should, I think) catch non routable messages instead of dropping them.

We should also think about - beside logging - if dropped packages over a certain time could be used to trigger follow up actions.
Like an actor cannot get his packages out or does not receive updates from "upstream", should it go to a failed state and tigger its device to return to a safe-state? Like disable the voltage source..
For me this also an aspect of reliability.

@bilderbuchi
Copy link
Member Author

bilderbuchi commented Jan 31, 2023

We should also think about - beside logging - if dropped packages over a certain time could be used to trigger follow up actions. Like an actor cannot get his packages out or does not receive updates from "upstream", should it go to a failed state and tigger its device to return to a safe-state? Like disable the voltage source.. For me this also an aspect of reliability.

I agree that it's important to enable that handling, but I'm not sure how much behaviour we should specify in the protocol. I think we should maintain a Status/error register or somesuch for Components (all_good, connection_congested, connection_dead, device_errored,...), and set this accordingly, but then let the implementations decide how to react to that. (Haven't thought deeply, though)
Let's work out the Status details/flow in another issue.

@LongnoseRob
Copy link

... and set this accordingly, but then let the implementations decide how to react to that. (Haven't thought deeply, though)
Let's work out the Status details/flow in another issue.

Yes this is really more an implementation topic, only the status definition(s) should be part of the protocol description

@bklebel
Copy link
Collaborator

bklebel commented Jan 31, 2023

Thank you @LongnoseRob, this is an important aspect! I think too that the status definition is good and useful, and should be part of the protocol, but what Actors and Directors do with the respective stati should be up to the implementation. For example, for me it would rather mean that if an Actor does not get anything anymore from upstream, they continue to do whatever they have been told, as I would like that most of an experiment can continue to live on and generate data if parts of the network fail, e.g. if the responsible control-Coordinator fails shortly after I started a temperature sweep, most of the time I would like that both the sweep and whatever is being measured continue to work. This way, the sweep in one direction can be fully recorded, although maybe the other sweep direction is then blocked because the Director cannot tell the Actor to now start the second sweep - and by the time the first sweep finishes, I might have checked on the system, and restarted whatever link broke in the middle.
For me, in this case, reliability is that if an insignificant part of the setup crashes, the rest can keep calm and carry on, instead of being dragged into a crash/block too.

@LongnoseRob
Copy link

Yes, what you describe @bklebel is also a good approach.
Maybe later int he implementation we can think deeper about how to combine safety and resilience in a meaningful way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion-needed A solution still needs to be determined enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants