-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable concurrent processing of incoming and outgoing MQTT messages #2327
Enable concurrent processing of incoming and outgoing MQTT messages #2327
Conversation
Codecov Report
Additional details and impacted files
|
Robot Results
|
@jarhodes314 Running the same benchmarks with this PR fixes the bug. I'll spend a bit of time to see if I can add reasonable system test to the PR, though as you already mentioned, it might be tricky to "stable" benchmark that will run on every machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal to asynchronously process the MQTT messages make sense. My only concern is about the relevance of the MqttMessageBox
struct having both the send
and recv
aspects together, as the new design is using those aspects separately from two different contexts anway. So, instead an MqttMessageBox
, the actor could just keep one SimpleMessageBox
to receive the peer messages and another wrapper over the peer_senders
, with that specialized send()
method in the existing MqttMessageBox
impl, as part of its state. But, that refactoring feels a bit out of scope for this PR, hence approving. I've left some minor suggestions/queries though.
I agree that that refactoring away |
I've now done an internal refactoring that includes splitting the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. I ran the new benchmark system tests 5 times and they seem to be ok.
Signed-off-by: James Rhodes <jarhodes314@gmail.com>
Signed-off-by: Reuben Miller <reuben.d.miller@gmail.com>
7f94262
to
7509cbf
Compare
@@ -126,6 +126,14 @@ pub struct MqttMessageBox { | |||
peer_senders: Vec<(TopicFilter, LoggingSender<MqttMessage>)>, | |||
} | |||
|
|||
pub struct FromPeers<'a> { | |||
input_receiver: &'a mut LoggingReceiver<MqttMessage>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope we can keep owned copies of these once run
starts taking self
instead &mut self
.
} | ||
|
||
pub struct ToPeers<'a> { | ||
peer_senders: &'a mut Vec<(TopicFilter, LoggingSender<MqttMessage>)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have just cloned that Vec
to avoid these lifetime parameters, but its' okay, as it is consistent with its counterpart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the lifetimes are that complicated, and I'll modify the run(self)
PR to remove these once this is merged as they won't be necessary after that.
@@ -137,6 +145,43 @@ impl MqttMessageBox { | |||
} | |||
} | |||
|
|||
fn split(&mut self) -> (FromPeers<'_>, ToPeers<'_>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This split
method as well as the whole struct MqttMessageBox
can be removed, as they are used only by the MqttActor
.
Proposed changes
If tedge-mapper is overwhelmed with messages, this causes a deadlock, preventing any further processing of MQTT messages until the mapper is restarted. The root cause of the deadlock is detailed in #2326 (comment), but the pertinent information here is that the MQTT actor blocks itself, waiting on outgoing messages to be processed before it can process incoming messages, but stuck in the middle of processing an incoming message.
This PR spawns a task to process outgoing messages, which allows the mapper to process incoming and outgoing messages concurrently. This avoids the deadlock scenario that is the root cause of the bug.
Running this change locally, I can reliably send 20,000 messages to the mapper running on my laptop using the
simulate.py
script (see the issue description) without dropping any messages (previously sending just 500 messages would reliably fail). Sending significantly more (e.g. 100,000 messages) will cause messages to be dropped, but the mapper does continue to run as normal once the pressure is relieved.I haven't added any tests for this change. Verifying we don't deadlock depends on knowing the capacity of the machine we're running the tests on (and also involves hammering Cumulocity to some extent). But the existing tests should verify that the changes don't break anything (given all that's changed is where existing code is run, I would expect any bugs with the change to be really obvious).
Types of changes
Paste Link to the issue
#2326
Checklist
cargo fmt
as mentioned in CODING_GUIDELINEScargo clippy
as mentioned in CODING_GUIDELINES