Infinite retry loop with heavy CPU load #184

ColinFinck · 2023-04-17T14:43:29Z

I have tested tokio-modbus on a misbehaving Modbus RTU bus, which currently responds with an infinite stream of zeros to all read requests. This is not a theoretical situation, but can happen when a single device on the bus is misconfigured.
I don't expect tokio-modbus to recover from such situations, which is why I've encapsulated the read_holding_registers call in a tokio::time::timeout and drop the client session after every error/timeout.

However, tokio-modbus currently tries to recover from that and thereby enters an infinite loop without delay or even exponential backoff. My timeout kills that read operation after a few seconds without data, but in the meantime, it retries every millisecond and thereby causes heavy CPU load.

In particular, I'm talking about this loop:

tokio-modbus/src/codec/rtu.rs

Lines 242 to 263 in b8bdb90

    
           loop { 
        
               let mut retry = false; 
        
               let res = get_pdu_len(buf) 
        
                   .and_then(|pdu_len| { 
        
                       debug_assert!(!retry); 
        
                       if let Some(pdu_len) = pdu_len { 
        
                           frame_decoder.decode(buf, pdu_len) 
        
                       } else { 
        
                           // Incomplete frame 
        
                           Ok(None) 
        
                       } 
        
                   }) 
        
                   .or_else(|err| { 
        
                       log::warn!("Failed to decode {} frame: {}", pdu_type, err); 
        
                       frame_decoder.recover_on_error(buf); 
        
                       retry = true; 
        
                       Ok(None) 
        
                   }); 
        
               if !retry { 
        
                   return res; 
        
               } 
        
           }

get_pdu_len reads a zero byte here, therefore returns with an Err that is caught by the or_else branch. That branch calls recover_on_error to clear that zero byte, sets retry to true, and tries again immediately.

I'm open for any solution that adds a short delay here.
Either via a retry_delay parameter that is passed to tokio::time::sleep before every retry, or via a passed function that is called before every retry (allows the user to implement exponential backoff).

CC @flosse @uklotzde

The text was updated successfully, but these errors were encountered:

uklotzde · 2023-04-17T16:29:20Z

This edge case has probably not been considered back then. The loop should definitely terminate if no progress is made and recovery fails. I would prefer exiting the loop and pass control back to the client application which could handle the retry depending on the context.

uklotzde · 2023-04-17T17:54:59Z

I remember that we needed this recovery loop to handle spurious failures from a device that was connected via a Serial-over-USB adapter. The code is from pre-async Tokio 0.3 were recovery from errors in async code was much more complicated.

Making the library code less stateful and opinionated should be the goal. If you see a chance to simplify the code by removing this poor-man's approach for recovery from errors then don't hesitate to do it.

ColinFinck · 2023-04-18T08:48:23Z

I remember that we needed this recovery loop to handle spurious failures from a device that was connected via a Serial-over-USB adapter.

Recovering from spurious failures that actually happen in practice sounds like a good reason to keep this code instead of giving up immediately and passing control back to the caller.

Instead of adding a short delay, another option would be to define a maximum number of retries.
The retry boolean variable would be replaced by a counter that is incremented on every retry and reset to zero on every success. When the maximum number of retries is reached without success (e.g. 20), we could fail and pass back control to the caller. What do you think about that?

uklotzde · 2023-04-18T09:54:26Z

Sure, a retry limit would also be reasonable. Ideally configurable, but a constant is probably sufficient for now.

ColinFinck · 2023-04-21T12:46:42Z

I've implemented the retry limit in PR #186

ColinFinck mentioned this issue Apr 21, 2023

Max retries #186

Merged

uklotzde closed this as completed in #186 Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite retry loop with heavy CPU load #184

Infinite retry loop with heavy CPU load #184

ColinFinck commented Apr 17, 2023

uklotzde commented Apr 17, 2023

uklotzde commented Apr 17, 2023

ColinFinck commented Apr 18, 2023

uklotzde commented Apr 18, 2023

ColinFinck commented Apr 21, 2023

Infinite retry loop with heavy CPU load #184

Infinite retry loop with heavy CPU load #184

Comments

ColinFinck commented Apr 17, 2023

uklotzde commented Apr 17, 2023

uklotzde commented Apr 17, 2023

ColinFinck commented Apr 18, 2023

uklotzde commented Apr 18, 2023

ColinFinck commented Apr 21, 2023