Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Allow configuration of RetryClassifier on the client #2417

Closed
DanielBauman88 opened this issue Feb 24, 2023 · 12 comments
Closed

Comments

@DanielBauman88
Copy link
Contributor

DanielBauman88 commented Feb 24, 2023

This lack of functionality is a significant regression over clients in other languages.
The default retry classifier in smithy is the AwsResponseRetryClassifier.

Smithy allows for the retry config to be modified (max attempts and backoff params) but it doesn't allow for any way to inject customer retry classification. It's possible (but not super ergonomic) to inject one's own retry policy but then one loses out on being able to use the supplied configuration for max attempts and backoff and the good default classification which one might want to just add a few cases to.

The generated code does create a customize method on the client but it is parametrized to the specific AwsResponseRetryClassifier so even when using customize on the request level a custom classifier cannot be supplied. Even if it could, doing this at the request level might be needed in some cases but the client should allow a client level configuration of a retry classifier.

The locked down cuztomize method:

        pub async fn customize(
            self,
        ) -> std::result::Result<
            crate::operation::customize::CustomizableOperation<
                crate::operation::SOME-OPERATION,
                aws_http::retry::AwsResponseRetryClassifier,
            >,
            aws_smithy_http::result::SdkError<crate::error::SOME-ERROR>,

TLDR, support a function on the client builder to set a retry classifier. This will allow consumers of smithy clients to use the AwsResponseRetryClassifier and based on its result to modify the classification however they choose. This can be done with java and go, and is functionality that a client to be used in production by customers should support.

This more specific issue was created after some digging - resolved the old issue.

@jdisanti
Copy link
Collaborator

What is your use-case for needing to do this?

@DanielBauman88
Copy link
Contributor Author

I may want to retry a wider set of exceptions that the default retrclassifier.

For example - I'm interacting with DDB and my transactions are written in such a way that I want to retry any TransactionCanceledException that happens because of a TransactionConflict reason.

Or if I'm using a service that throws ConflictExceptions and for my application I always want to retry these.

Or if I'm using an eventually consistent service and I always want to retry the not-found errors that might come after fresh creates.

@DanielBauman88
Copy link
Contributor Author

One more thing to tag onto this issue - the smithy rs client is in general extremely restrictive.

In other languages its easy to keep the default behaviour but plug into the logic to write some logs or post a metric when retries happen.

The rust sdk makes that impossible. Either an entire retry config has to be written which cannot benefit from the default logic, or no additional behaviour can be customized.

I've self-served some of this in a hacky way by depending on some presumably unstable interfaces for the way the smithy clients use tracing, but this should be supported in a well-defined way.

@Velfi
Copy link
Contributor

Velfi commented Feb 28, 2023

I want to first say that I understand your frustration. I often feel that the user experience of the SDK is lacking, and it's not easy to extend the client with custom behavior outside of things like operation customization.

In other languages its easy to keep the default behaviour but plug into the logic to write some logs or post a metric when retries happen.

I agree that this should not be so difficult. However, it helps to remember that we're maintaining a client that is used to interact with 300+ services and we're still in developer preview. We've had to prioritize what features we work on and that means that some features don't get much polish, if any.

The rust sdk makes that impossible. Either an entire retry config has to be written which cannot benefit from the default logic, or no additional behaviour can be customized.

Rust is a statically typed language, which means that any time we want something to be fully customizable by users, we need to rely on traits. However, as we add more traits, our code becomes more and more difficult to maintain. One need only look at the trait bounds on our middleware to see the challenges we're facing. In many cases, we believe that the right thing to do is to design less, and design simpler. While this prevents users like yourself from fully customizing the client, it makes the client easier to maintain for our team, and easier to use for customers with simpler requirements.

I've self-served some of this in a hacky way by depending on some presumably unstable interfaces for the way the smithy clients use tracing, but this should be supported in a well-defined way.

I agree that this should be supported in a well defined way that's pleasant to use and I'm confident that it will be by the time we release our 1.0 version. I am currently working on a re-write of our client which will allow users more fine-grained control over what happens during the sending of a request. Additionally, the rewrite is based on a new internal standard for how AWS SDKs should work, so it will also bring our SDK closer to the design of other SDKs, meaning less surprises for people like yourself that use several languages to interact with AWS.

I appreciate your passion for this issue, and I hope that you'll find our SDK to be more useful in the future. In the meantime, I hope that our default tracing logs are helpful to you. If you're looking for some specific improvements to our current logging, then let's get a list of possible improvements started. We can then discuss the ideas on a point-by-point basis. It's most helpful to us when you can describe what you hope to accomplish, rather than how you'd like to accomplish it.

As a stopgap, I'd suggest running your request in a loop with your own classification method:

let sdk_config = aws_config::load_from_env().await;
let client = aws_sdk_dynamodb::Client::new(&sdk_config);

let mut attempts = 0;
let max_attempts = 3;
loop {
    if attempts >= max_attempts {
        break;
    }
    match client.your_operation.send().await {
        Ok(res) => {
            /* it worked, so we can stop */
            break;
        }
        Err(err) => {
            let service_err = err.into_service_error();
            let cancellation_reasons = service_err.cancellation_reason().unwrap_or_default();
            if cancellation_reason
                .flat_map(|reason| reason.code())
                .find(|code| code == "TransactionConflict")
            {
                // it failed due to a transaction conflict, we'll try again
                attempts += 1;
                continue;
            } else {
                // else, it's some error we don't want to retry
                break;
            }
        }
    }
}

I wrote that code in GitHub, so it may not be quite right, but I hope it gets you closer to your goal.

@DanielBauman88
Copy link
Contributor Author

DanielBauman88 commented Feb 28, 2023

Thanks for the comprehensive response @Velfi.

I understand that the client is in developer preview, I'm not making this issue to say that there's a bug, but to make sure this is tracked so that it can be put on the radar for implementation in future. I appreciate all the work that's been done so far.

it makes the client easier to maintain for our team, and easier to use for customers with simpler requirements

I understand this desire, but I think for the tooling that aws customers are going to use this functionality has to be there. It can be hidden from the simple customers, but the functionality should not be blocked. We (aws customers) like you go oncall, and have to investigate errors and latency issues in our systems, many of these will be related to our dependencies. We need to have clear insight into cases were retries are happening, what the failures are and how long each attempt takes so that we can have good metrics, and good logs for debugging specific cases where (as one example) we have a request that took very long and we want to find out why.

I agree that this should be supported in a well defined way that's pleasant to use and I'm confident that it will be by the time we release our 1.0 version. I am currently working on a re-write of our client which will allow users more fine-grained control over what happens during the sending of a request. Additionally, the rewrite is based on a new internal standard for how AWS SDKs should work, so it will also bring our SDK closer to the design of other SDKs, meaning less surprises for people like yourself that use several languages to interact with AWS.

I see that we are in agreement :) Looking forward to seeing the next version!

I definitely do understand the challenges though! I am constantly spending much longer than expected digging into rust code and jumping around trying to find the bounds on the mysterious type parameters on the functions I care about.

I am planning to implement a similar stopgap to what you suggest, the unfortunate part is that we make many different calls to deps our service and so need to add this layer at all those call sites instead of being able to configure it once. I will most likely create my own client that wraps the smithy one, implements all the methods we use, and routes to the smithy client wrapped with our retry logic. Thanks for the code snippet.

@DanielBauman88
Copy link
Contributor Author

To phrase what I am looking for without going into implementation details

  • Ability to record information for debugging my services using aws. When I make a request to an aws service I want to be able to discover how many retries took place, why each attempt failed and how long it took, how long the final success took (if there was a success at the end of the attempts)
  • Ability to use the default retry policy but to modify it by being able to add additional cases of errors to retry. Ideally this code would know how many attempts have been made already, and be able to set how long to wait before the next attempt.

Tangentially related to this, I'd love to have support for https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics.html - this doesn't give all the retry insights I'm talking about but it gives a lot of other useful operational metrics for free.

@DanielBauman88
Copy link
Contributor Author

DanielBauman88 commented Mar 1, 2023

There's one more thing that came up when I bit the bullet to go ahead and implement this retry at the application layer.

It would be nice to have this functionality wrapped up somewhere so we don't have to add this logic all-over our code wherever we make a service call.

One idea to do this would be to make my own FooServiceRetryingClient that has it's own FooServiceClient and to then implement all the service functions with the retry logic added in.

Unfortunately in this case - that doesn't work either. Because the rust client only exposes calls through the fluent builders and send() and none of these are traits. This means there's no good way to create your own client that wraps the existing client and adds some functionality. You have to rather add your logic all over the code whenever .send() is called.

EG: if the client had functions like foo(i: FooInput) -> Result<FooOutput, SdkError> this would at least allow for constructing a client that wraps each call with some generic retry logic.

Here's the most generic way I could come up with - it ain't super pretty and it still has to be called at every .send() location but at least the main logic can be re-used. Fortunately status code works for me, because the typed rust enum ErrorKinds for each operation are different and so generic logic can't be implemented for those.

pub async fn retry_conflict<
    R,
    T: Future + Future<Output = Result<R, SdkError<E>>>,
    F,
    E,
>(
    mut f: F,
) -> Result<R, SdkError<E>>
where
    F: FnMut() -> T,
{
    let mut num_attempts = 0;
    loop {
        let res = f().await;
        match &res {
            Ok(_) => return res,
            Err(e) => match e {
                ServiceError(r) => {
                    if r.raw().http().status().as_u16() != 409 {
                        return res;
                    }
                }
                _ => {
                    return res;
                }
            },
        }
        tracing::info!(
            "Retrying attempt {} which failed on resource conflict",
            num_attempts + 1
        );
        num_attempts += 1;
        if num_attempts >= 2 {
            return res;
        }
    }
}

Used like

    let req = client.foo().some_param("bar");
    let response = retry_conflict(|| req.clone().send()).await;

@Velfi
Copy link
Contributor

Velfi commented Mar 22, 2023

I definitely do understand the challenges though! I am constantly spending much longer than expected digging into rust code and jumping around trying to find the bounds on the mysterious type parameters on the functions I care about.

We're working on a replacement for the current smithy client that will remove all (or nearly all) mysterious type parameters needed to send a request. Additionally, it will make granular logging of the request/response lifecycle a simple manner. I'm very excited to get it in your hands and see if we're still not meeting all of your needs. I'll reach out to you once it's ready for testing. You can view the preliminary version of the change in these PRs:

Thanks for your additional comments. Regarding your "retry resource conflicts", I'll be sure to consider that while I integrate our retry handler into the new client. I'm imagining an API similar to:

let res = client.foo()
  .some_param("bar")
  .with_runtime_plugin(
    RetryOrchestrator::default()
      .also_retry_err(aws_sdk_blah::Error::ResourceConflict)
      .build()
  )
  .send()
  .await?;

@DanielBauman88
Copy link
Contributor Author

DanielBauman88 commented May 15, 2023

The changes you made look really cool. Thanks for sharing! I only saw this comment now or I would have looked at it sooner!

The API you suggest with also_retry_err looks great! I think it would also be cool to support something more generic (eg: it's possible some downstream service doesn't nicely model two cases that a caller cares about as different error types and so there could be a use case to inject customer logic to parse a message for deciding about a retry). This could be a closure which takes in the error and outputs a true/false.

The RFC and PRs you submitted also address some observability issues we are having with the rust SDK. Rust (and many non java languages) don't have the awesome sdk metrics that java has https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics.html

We however do want metrics for our dependency calls and in addition to knowing how long a whole depdendency interaction takes we also want metrics for how many retries there were and how long each individual attempt took.

We currently time the whole interaction by making all our SDK calls in a closure that we pass to an instrumentation function which adds the metric.
However, the retry and individual attempt metrics are missing! I think we should be able to use your changes with the appropriate hooks like read_before_execution and after or read_before_attempt and after and some state tracking on our side to get the metrics we want, especially if the context in those methods lets us extract simple info like the service and operation name. It would be amazing if the context would let us define a generic 'state' type to help us record things we care about at the request level - or even just a string map that we could throw data into that would live through the request and let us have mutable access in the hooks.

This would let us to get all the info we want (including the internal retry / attempt_duration stuff) by instrumenting the client and we'd be able to stop having to use our instrumentation function each time we make a service call with the SDK.

We have currently self-serviced some of this by hooking into the spans that the SDK uses, but we understand that this is not a public interface and isn't ideal for us to base our metrics on.

@Velfi
Copy link
Contributor

Velfi commented May 16, 2023

Retry classifier customization in my current prototype works like this:

  • There is a list of classifier functions which can be customized in codegen at the service level or at the operation level. When it comes time to decide if we need to retry, we run the classifiers in order.
  • Classifiers take an Option<Result<SomeOpOutput, SomeOpError>> and an Option<HttpResponse>. Even though they take a result, the orchestrator itself will always skip classification if the deserializer deserializes the response and it's Ok.
  • Classifiers can either return Some(ShouldAttemptRetry::Yes), Some(ShouldAttemptRetry::YesAfterDuration), or None.
  • We return early if any classifier says "yes". If all classifiers say "no", then the error response is unretryable.

Users will be able to configure this by "mapping" the classifiers with a function that will probably look just like this:

let res = client.get_object()
  <normal get object params>
  .customize()
  .map_retry_classifier(|classifiers, output_or_error, response| -> Option<ShouldAttempt> {
   let should_attempt = classifiers.run(output_or_error, response);
    // do whatever you want to classify retries
   should_attempt
  })
  .send()
  .await;

This way you can front-run the classifier, second-guess it, or ignore it altogether.

github-merge-queue bot pushed a commit that referenced this issue Oct 10, 2023
[Read the RFC here](#3018)

## Motivation and Context
<!--- Why is this change required? What problem does it solve? -->
<!--- If it fixes an open issue, please link to the issue here -->
#2417

## Description
<!--- Describe your changes in detail -->
Exactly what it says on the tin. I have a related RFC to publish that
goes into more depth.

## Testing
<!--- Please describe in detail how you tested your changes -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->
I wrote an integration test that ensures a custom retry classifier can
be set and is called.

## Checklist
<!--- If a checkbox below is not applicable, then please DELETE it
rather than leaving it unchecked -->
- [x] I have updated `CHANGELOG.next.toml` if I made changes to the
smithy-rs codegen or runtime crates
- [x] I have updated `CHANGELOG.next.toml` if I made changes to the AWS
SDK, generated SDK code, or SDK runtime crates

----

_By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice._
@Velfi
Copy link
Contributor

Velfi commented Oct 10, 2023

#3050

We're about to release a new version with the capability to set classifiers. See the above link for more details.

@jdisanti
Copy link
Collaborator

jdisanti commented Apr 5, 2024

I believe this was released a while ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants