New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop returning tonic errors from polls instead log warnings and retry. #202
Stop returning tonic errors from polls instead log warnings and retry. #202
Conversation
6bf9d5b
to
0f77240
Compare
I think I'm actually going to undo half of this. Keep the ability to return tonic errors because some things really are fatal (like bad auth). Instead I need to use a custom retryer for polls. |
return RetryPolicy::ForwardError(e); | ||
if self.call_type == CallType::Normal { | ||
return RetryPolicy::ForwardError(e); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic/subjective: no else needed
src/pollers/retry.rs
Outdated
if RETRYABLE_ERROR_CODES.contains(&e.code()) { | ||
|
||
// Handle edge cases where we would retry normally not-retryable error codes | ||
let mut special_retry = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic/subjective: let special_retry = self.call_type == CallType::LongPoll && e.code() == Code::NotFound;
/// The server returned a malformed polling response. Either we aren't handling a valid form, | ||
/// or the server is bugging out. Likely fatal. | ||
#[error("Poll workflow response from server was malformed: {0:?}")] | ||
BadPollResponseFromServer(Box<PollWorkflowTaskQueueResponse>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't there ever bad poll responses that should be sent to lang and considered fatal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will you do with that information? Nothing interesting to be done. I'll send you an eviction if it matters
cfg.max_elapsed_time = None; | ||
} | ||
Self { | ||
max_attempts: cfg.max_retries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_attempts
is not the same as max_retries
.
e.g: 3 attempts = 2 retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit of logic is unchanged. Attempt starts 0 indexed so it works out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is misleading though, I'd change it (doesn't have to be in this PR).
cfg.max_elapsed_time = None; | ||
} | ||
Self { | ||
max_attempts: cfg.max_retries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The long poll handler shouldn't have max attempts set.
Maybe change the type to Option<usize>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some way to decide when to start logging warnings if we've been retrying forever. Normal max retries provides a decent signal.
TonicErrorHandler { | ||
backoff, | ||
max_attempts, | ||
fn new(mut cfg: RetryConfig, call_type: CallType) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of introducing this CallType
enum just pass a different config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason about needing to log a warning
src/worker/mod.rs
Outdated
Ok(pr) => pr, | ||
Err(resp) => { | ||
warn!(resp=?resp, | ||
"Server returned a poll response that could not be validated"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of when this would happen?
Preferably in docstring form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docstring would go on ValidPollWFTQResponse
which is very straightforward about what would be invalid. Missing fields in the returned protobuf that need to be filled for the response to make any sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a fatal error to me, maybe server / client version mismatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server should explicitly respond with an error if our version is outside of the range it supports, but yeah that's reasonable. We don't really need a specific error for it though. I'll just change it to a tonic error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see some things done differently for better DX but we can merge this for now.
cfg.max_elapsed_time = None; | ||
} | ||
Self { | ||
max_attempts: cfg.max_retries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is misleading though, I'd change it (doesn't have to be in this PR).
057e414
to
accf52b
Compare
What was changed
Polling now never returns network errors
Why?
There was no reason to. Lang could never do anything sensible about it anyway. Ok, it failed, so what? Now the errors are logged as warnings.
Checklist
Closes [Feature Request] Reconnecting to the server instead of crashing sdk-typescript#290
How was this tested:
Added UT. Verified manually by shutting server on and off.
Any docs updates needed?