Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite retry loop with heavy CPU load #184

Closed
ColinFinck opened this issue Apr 17, 2023 · 5 comments · Fixed by #186
Closed

Infinite retry loop with heavy CPU load #184

ColinFinck opened this issue Apr 17, 2023 · 5 comments · Fixed by #186

Comments

@ColinFinck
Copy link
Collaborator

I have tested tokio-modbus on a misbehaving Modbus RTU bus, which currently responds with an infinite stream of zeros to all read requests. This is not a theoretical situation, but can happen when a single device on the bus is misconfigured.
I don't expect tokio-modbus to recover from such situations, which is why I've encapsulated the read_holding_registers call in a tokio::time::timeout and drop the client session after every error/timeout.

However, tokio-modbus currently tries to recover from that and thereby enters an infinite loop without delay or even exponential backoff. My timeout kills that read operation after a few seconds without data, but in the meantime, it retries every millisecond and thereby causes heavy CPU load.

In particular, I'm talking about this loop:

loop {
let mut retry = false;
let res = get_pdu_len(buf)
.and_then(|pdu_len| {
debug_assert!(!retry);
if let Some(pdu_len) = pdu_len {
frame_decoder.decode(buf, pdu_len)
} else {
// Incomplete frame
Ok(None)
}
})
.or_else(|err| {
log::warn!("Failed to decode {} frame: {}", pdu_type, err);
frame_decoder.recover_on_error(buf);
retry = true;
Ok(None)
});
if !retry {
return res;
}
}

get_pdu_len reads a zero byte here, therefore returns with an Err that is caught by the or_else branch. That branch calls recover_on_error to clear that zero byte, sets retry to true, and tries again immediately.

I'm open for any solution that adds a short delay here.
Either via a retry_delay parameter that is passed to tokio::time::sleep before every retry, or via a passed function that is called before every retry (allows the user to implement exponential backoff).

CC @flosse @uklotzde

@uklotzde
Copy link
Member

This edge case has probably not been considered back then. The loop should definitely terminate if no progress is made and recovery fails. I would prefer exiting the loop and pass control back to the client application which could handle the retry depending on the context.

@uklotzde
Copy link
Member

I remember that we needed this recovery loop to handle spurious failures from a device that was connected via a Serial-over-USB adapter. The code is from pre-async Tokio 0.3 were recovery from errors in async code was much more complicated.

Making the library code less stateful and opinionated should be the goal. If you see a chance to simplify the code by removing this poor-man's approach for recovery from errors then don't hesitate to do it.

@ColinFinck
Copy link
Collaborator Author

I remember that we needed this recovery loop to handle spurious failures from a device that was connected via a Serial-over-USB adapter.

Recovering from spurious failures that actually happen in practice sounds like a good reason to keep this code instead of giving up immediately and passing control back to the caller.

Instead of adding a short delay, another option would be to define a maximum number of retries.
The retry boolean variable would be replaced by a counter that is incremented on every retry and reset to zero on every success. When the maximum number of retries is reached without success (e.g. 20), we could fail and pass back control to the caller. What do you think about that?

@uklotzde
Copy link
Member

Sure, a retry limit would also be reasonable. Ideally configurable, but a constant is probably sufficient for now.

@ColinFinck ColinFinck mentioned this issue Apr 21, 2023
@ColinFinck
Copy link
Collaborator Author

I've implemented the retry limit in PR #186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants