Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop returning tonic errors from polls instead log warnings and retry. #202

Merged

Conversation

Sushisource
Copy link
Member

What was changed

Polling now never returns network errors

Why?

There was no reason to. Lang could never do anything sensible about it anyway. Ok, it failed, so what? Now the errors are logged as warnings.

Checklist

  1. Closes [Feature Request] Reconnecting to the server instead of crashing sdk-typescript#290

  2. How was this tested:
    Added UT. Verified manually by shutting server on and off.

  3. Any docs updates needed?

@Sushisource
Copy link
Member Author

I think I'm actually going to undo half of this. Keep the ability to return tonic errors because some things really are fatal (like bad auth). Instead I need to use a custom retryer for polls.

@Sushisource Sushisource marked this pull request as draft October 11, 2021 23:38
@Sushisource Sushisource marked this pull request as ready for review October 12, 2021 00:25
return RetryPolicy::ForwardError(e);
if self.call_type == CallType::Normal {
return RetryPolicy::ForwardError(e);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic/subjective: no else needed

if RETRYABLE_ERROR_CODES.contains(&e.code()) {

// Handle edge cases where we would retry normally not-retryable error codes
let mut special_retry = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic/subjective: let special_retry = self.call_type == CallType::LongPoll && e.code() == Code::NotFound;

/// The server returned a malformed polling response. Either we aren't handling a valid form,
/// or the server is bugging out. Likely fatal.
#[error("Poll workflow response from server was malformed: {0:?}")]
BadPollResponseFromServer(Box<PollWorkflowTaskQueueResponse>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there ever bad poll responses that should be sent to lang and considered fatal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will you do with that information? Nothing interesting to be done. I'll send you an eviction if it matters

cfg.max_elapsed_time = None;
}
Self {
max_attempts: cfg.max_retries,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_attempts is not the same as max_retries.
e.g: 3 attempts = 2 retries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of logic is unchanged. Attempt starts 0 indexed so it works out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is misleading though, I'd change it (doesn't have to be in this PR).

cfg.max_elapsed_time = None;
}
Self {
max_attempts: cfg.max_retries,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long poll handler shouldn't have max attempts set.
Maybe change the type to Option<usize>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some way to decide when to start logging warnings if we've been retrying forever. Normal max retries provides a decent signal.

TonicErrorHandler {
backoff,
max_attempts,
fn new(mut cfg: RetryConfig, call_type: CallType) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of introducing this CallType enum just pass a different config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason about needing to log a warning

src/pollers/retry.rs Outdated Show resolved Hide resolved
Ok(pr) => pr,
Err(resp) => {
warn!(resp=?resp,
"Server returned a poll response that could not be validated");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of when this would happen?
Preferably in docstring form.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docstring would go on ValidPollWFTQResponse which is very straightforward about what would be invalid. Missing fields in the returned protobuf that need to be filled for the response to make any sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a fatal error to me, maybe server / client version mismatch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server should explicitly respond with an error if our version is outside of the range it supports, but yeah that's reasonable. We don't really need a specific error for it though. I'll just change it to a tonic error

Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see some things done differently for better DX but we can merge this for now.

cfg.max_elapsed_time = None;
}
Self {
max_attempts: cfg.max_retries,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is misleading though, I'd change it (doesn't have to be in this PR).

src/pollers/retry.rs Outdated Show resolved Hide resolved
@Sushisource Sushisource merged commit ed28b73 into temporalio:master Oct 12, 2021
@Sushisource Sushisource deleted the fix-dying-on-lost-connection branch October 12, 2021 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Reconnecting to the server instead of crashing
3 participants