Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zuul won't stop retry after hystrix is timed out #1772

Closed
ts-huyao opened this issue Mar 13, 2017 · 43 comments
Closed

Zuul won't stop retry after hystrix is timed out #1772

ts-huyao opened this issue Mar 13, 2017 · 43 comments

Comments

@ts-huyao
Copy link

ts-huyao commented Mar 13, 2017

I'm on spring cloud dependencies Camden.SR6, and a setup of zuul with the following config

ribbon:
    MaxAutoRetriesNextServer: 0
    MaxAutoRetries: 2
    ReadTimeout: 3000
    OkToRetryOnAllOperations: true #false to only allow get to retry

hystrix:
  command:
    sc-test-node-0:
      execution:
        isolation:
          strategy: THREAD
          thread:
            timeoutInMilliseconds: 5000

and I put a simple resource behind zuul which sleeps 4000ms when invoked.

    @GET
    @Path("getthensleep/{sleepms}")
    @Produces(MediaType.APPLICATION_JSON)
    public Response getThenSleep(@PathParam("sleepms") int sleepms){
        LOGGER.info("get then sleep {} ms", sleepms);
        try {
            Thread.sleep(sleepms);
        } catch (InterruptedException e) {
            LOGGER.error("Sleep error", e);
        }
        return Response.ok().build();
    }

and I use postman to call the getthensleep resource to let it sleep 4000ms, I expect zuul's RibbonRoutingFilter times out after 3000ms and retry until hystrix times out, 2 separate call should be made from zuul. However, after hystrix times out, zuul returns with code 500, another call is made to getthensleep resource again. In total 3 calls are made by zuul. I think the zuul should stop trying when hystrix times out, and It shouldn't make other attempt after hystrix times out.

@ryanjbaxter
Copy link
Contributor

I found this issue in the Hystrix project that is particularly interesting.

It seems like the application thread should try to respond to a thread interruption (when using an isolation strategy of thread).

Might need to dig into this a little bit more to understand it better.

@spencergibb
Copy link
Member

Because of the way the RibbonRoutingFilter is coded, the default way of specifying THREAD isolation doesn't work. Try zuul.ribbonIsolationStrategy=THREAD. You are probably still using semaphore.

@ts-huyao
Copy link
Author

ts-huyao commented Mar 14, 2017

@spencergibb thank you for pointing out it's hystrix isolation issue, I tried to set

zuul:
  ignored-patterns: /**/health/**,/**/internal/**
  routes:
    test-node-0:
      path: /node0/**
      serviceId: sc-test-node-0
      sensitiveHeaders: Cookie,Set-Cookie
      retryable: true
  ribbon-isolation-strategy: THREAD

but it seems not working.

@spencergibb
Copy link
Member

What does "but it seems not working." mean?

@rterentiev
Copy link
Contributor

rterentiev commented Mar 16, 2017

Hystrix uses RxJava, and when command should be failed by timeout, hystrix unsubscribes from command observable (see com.netflix.hystrix.AbstractCommand.HystrixObservableTimeoutOperator)

If THREAD isolation level is used, command thread can be interrupted. But @ryanjbaxter found issue.

For SEMAPHORE isolation level, which default for Zuul, command thread is not interrupted.

Now we have Spring's AbstractRibbonCommand. It extends HystrixCommand and implements run method. We can rewrite AbstractRibbonCommand to extend HystrixObservableCommand and implement construct method that returns observable. This will give unsubscription propagation.

I tried to do this, see my commit. I am using ribbon's LoadBalancerCommand that also returns observable.

But I faced another problem: retryable clients. When retryable client is used, ribbons retry logic is ignored and retry is performed in client. I don't know what is the best way to make it work with retryable clients. Add smth like toObservable to client, and implement execution in react way?

Anyway, see my branch. It has tests for retry abort. If you revert all changes and leave only tests - they will fail.

@ryanjbaxter
Copy link
Contributor

My gut tells me that we can probably do something in RibbonLoadBalancedRetryPolicyFactory so that when we check if we can retry we return false because of the timeout.

@rterentiev
Copy link
Contributor

rterentiev commented Mar 17, 2017

I think, what I did in my branch is wrong. Retry logic should be only in client.

My gut tells me that we can probably do something in RibbonLoadBalancedRetryPolicyFactory so that when we check if we can retry we return false because of the timeout.

Yeah, but you will need to ask hystrix somehow, is command timed out.

I have a another question. What was the point to add those retryable clients? To be able to write custom LoadBalancedRetryPolicy?

@ryanjbaxter
Copy link
Contributor

What was the point to add those retryable clients? To be able to write custom LoadBalancedRetryPolicy?

Beginning in Camden we no longer used the Netflix RestClientsince it was deprecated. The new HTTP clients we used did not have the retry logic that the Netflix client did so the retry clients were made to fill that gap.

Can you perhaps submit a PR with your changes so we can take an easier look at the code changes?

@rterentiev
Copy link
Contributor

Can you perhaps submit a PR with your changes so we can take an easier look at the code changes?

#1793

@rterentiev
Copy link
Contributor

Beginning in Camden we no longer used the Netflix RestClient since it was deprecated. The new HTTP clients we used did not have the retry logic that the Netflix client did so the retry clients were made to fill that gap.

Deprecated RestClient extends AbstractLoadBalancerAwareClient that has executeWithLoadBalancer method with retry logic. And execute method does not retry requests.

New retryable clients also extend AbstractLoadBalancerAwareClient, but overrides getRequestSpecificRetryHandler to disable logic in executeWithLoadBalancer method. And retry logic is implemented in execute method.

Comparing those implementations

method\client RestClient New retryable client
execute does not retry does retry
executeWithLoadBalancer does retry does retry

What is the contract for client? execute method should retry failed requests?

@ryanjbaxter
Copy link
Contributor

ryanjbaxter commented Mar 20, 2017

Yes execute should retry the failed requests, executeWithLoadBalancer calls execute.
We couldnt change the retry logic in AbstractLoadalancerAwareClient since it is upstream, so we just disable it and move the retry logic to the client in Spring Cloud Netflix.

@ryanjbaxter
Copy link
Contributor

So I did some investigation into how Hystrix communicates that a timeout has occurred to the wrapped operation and it looks like it is just interrupting the thread, so all we should have to do is to check Thread.interruped(). Here is a quick sample I put together that seems to work.

@Service
public class HelloService {
	@HystrixCommand(fallbackMethod = "reliable")
	public String getHelloString() throws Exception {
		MySimplePolicy policy = new MySimplePolicy();
		RetryTemplate template = new RetryTemplate();
		template.setRetryPolicy(policy);
		return template.execute(new RetryCallback<String, Exception>() {
			@Override
			public String doWithRetry(RetryContext context) throws Exception {
				RestTemplate rest = new RestTemplate();
				((SimpleClientHttpRequestFactory)rest.getRequestFactory()).setReadTimeout(100);
				return rest.getForObject("http://localhost:8080/hello", String.class);
			}
		});

	}
	
	public String reliable() {
		return "fallback";
	}
}

class MySimplePolicy extends SimpleRetryPolicy {
	@Override
	public boolean canRetry(RetryContext context) {
		if(Thread.interrupted()) {
			context.setExhaustedOnly();
			return false;
		}
		return super.canRetry(context);
	}
}

I think you should be able to apply this to the retryable clients as well.

@rterentiev
Copy link
Contributor

rterentiev commented Mar 22, 2017

Thank you for clarifications!

Unfortunately Thread.interruped will not fix case when hystrix isolation level is SEMAPHORE. This is critical when you use Zuul - it has SEMAPHORE isolation level. Switching Zuul to use THREAD isolation level might affect performance - you will have 2 times more number of busy threads - one servlet container thread plus one hystrix thread for each request.

I am trying to refactor clients right now, to make some general fix. Long short story is to add AbstractLoadBalancingClient.getExecutionObservable method and use it everywhere. Here is example:

@Override
public T execute(S request, IClientConfig clientConfig) throws Exception {
	return executeWithLoadBalancer(request, clientConfig);
}

@Override
public T executeWithLoadBalancer(S request, IClientConfig clientConfig) throws ClientException {
	try {
		RequestSpecificRetryHandler handler = getRequestSpecificRetryHandler(request, clientConfig);
		return getExecutionObservable(request, clientConfig, handler)
				.toBlocking()
				.single();
	} catch (Exception e) {
		Throwable t = e.getCause();
		if (t instanceof ClientException) {
			throw (ClientException) t;
		} else {
			throw new ClientException(e);
		}
	}
}

public Observable<T> getExecutionObservable(final S request, final IClientConfig requestConfig, RetryHandler handler) {
	LoadBalancerCommand<T> command = LoadBalancerCommand.<T>builder()
			.withLoadBalancerContext(this)
			.withRetryHandler(handler)
			.withLoadBalancerURI(request.getUri())
			.build();

	try {
		return command.submit(new ServerOperation<T>() {
			@Override
			public Observable<T> call(Server server) {
				URI finalUri = reconstructURIWithServer(server, request.getURI());
				S requestForServer = (S) request.replaceUri(finalUri);

				try {
					T response = AbstractLoadBalancingClient.this.executeInternal(requestForServer, requestConfig);
					return Observable.just(response);
				} catch (Exception e) {
					return Observable.error(e);
				}
			}
		});
	} catch (Exception e) {
		return Observable.error(e);
	}
}

protected abstract T executeInternal(S request, IClientConfig requestConfig) throws Exception;

@ryanjbaxter
Copy link
Contributor

I meant to add a note to my previous comment stating that it would only work for thread isolation, sorry about that.

I think the Observable approach will work for both thread and semaphore isolation. I went back and read this comment, which seems to confirm that approach. Thanks for diving on this!

@rterentiev
Copy link
Contributor

rterentiev commented Mar 23, 2017

I updated my pull request #1793 with latest changes. OkHttpLoadBalancingClient and RibbonLoadBalancingHttpClient start retrying failed requests on execute.

Is it ok to remove RetryableRibbonLoadBalancingHttpClient and RetryableOkHttpLoadBalancingClient? I had to ask about this before, but decided to remove them, because OkHttpLoadBalancingClient and RibbonLoadBalancingHttpClient are now do same.

@ryanjbaxter
Copy link
Contributor

No it is not OK we want those clients so we can use Spring Retry to retry the failed requests and not rely on the deprecated Netflix client to do so.

@rterentiev
Copy link
Contributor

rterentiev commented Mar 23, 2017

Well, I was confused. Old and new retryable clients extends Netflix's AbstractLoadBalancerAwareClient. Netflix's client does retry using Ribbon. Also, new retryable clients configured via Ribbon's configuration. So it looks like those retryable clients just replace Ribbon by custom implementation using spring retry, and make execute method retryable.
Probably, I just didn't get something or don't know about some future refactoring you keep in mind.

Anyway, see my updated pull request. All clients are doing same as before now. Basically, I just fixed the issue.

@spencergibb
Copy link
Member

I'd be very uncomfortable merging the changes you have made.

@rterentiev
Copy link
Contributor

No problem. At least I showed one of possible ways how to fix the issue. Thanks.

@ryanjbaxter
Copy link
Contributor

I am going to close this issue for now. If we find there is others who want retries to stop after a hystrix timeout we can come back and address this.

@Aloren
Copy link
Contributor

Aloren commented Mar 27, 2017

I was following your discussion as this issue is actually happening for our service in Production. As soon as services behind Zuul begin to respond slower, Zuul retries requests until some of our nodes die. :(
Since this is actually hard to fix, what should be done next? Or Spring Cloud Zuul will live with this bug?

Thanks

@ryanjbaxter
Copy link
Contributor

Are you saying the retries that dont stop from Zuul are causing your apps to crash?

@Aloren
Copy link
Contributor

Aloren commented Mar 28, 2017

It caused Zuul to crash. We had issues with network and downstream service was unreachable.
Even if call was cancelled by hystrix, ribbon was still retrying (ribbon has default timeout set to 5 secs). This led to 10 secs response time and huge spike of busy threads on tomcat. This also led to JVM freeze and made node non-operational.
screen shot 2017-03-28 at 11 17 04 am copy

@ts-huyao
Copy link
Author

@spencergibb even ribbon-isolation-strategy is set to THREAD, retry can't be interrupted by hystrix timeout.

@ryanjbaxter
Copy link
Contributor

@ts-huyao yes because we are not checking if the thread has been interrupted

@ryanjbaxter
Copy link
Contributor

@Aloren and @rterentiev we talked about this issue on our team call today and we can to the consensus that the proper way to configure the Hystrix timeout is to make sure you account for the total timeout of all retries. This means, for example, if you have the Ribbon timeout configured to be 1 second and you are going to retry the request 3 times that the Hystrix timeout should be at least 3 seconds. Is there a reason why you have not configured your Hystrix timeout this way?

@spencergibb
Copy link
Member

If I recall correctly, the hystrix timeout should be slightly greater than the max expected duration of the underlying command. So if the http client is configured with a 1s connect+read timeout and 2 potential retries, the hystrix timeout should be slightly more than 3s (to account for jvm overhead). I'm trying to find where I read that.

@vasilievip
Copy link

vasilievip commented Mar 30, 2017

@ryanjbaxter , then defaults in spring cloud configs needs to be adjusted as well, 5 secs for zuul hystrix was set to override 1 sec default. WDYT?

@ryanjbaxter
Copy link
Contributor

If that is the case then yes we should make them match, however as soon as a second instance of a service is added than that timeout needs to be adjusted.

I wonder if we could automatically calculate the hystrix timeout based on the number of instances and the configured ribbon timeout....

@spencergibb
Copy link
Member

We don't currently set the defaults for hystrix in zuul except for the max number of semaphores.

@vasilievip
Copy link

here is the list of settings which needs to match:

hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds
ribbon.ReadTimeout
ribbon.ConnectTimeout
ribbon.MaxHttpConnectionsPerHost
ribbon.MaxTotalConnections
ribbon.MaxAutoRetriesNextServer
ribbon.MaxAutoRetries

By default read timeout is 5000ms and auto retry is 1...

@spencergibb
Copy link
Member

I'm not sure we should be setting those values automatically.

@rterentiev
Copy link
Contributor

There is no sense in hystrix timeout, if command can't be interrupted. I would just disable hystrix timeout by default. Maybe log some warning message, if it was enabled.

@ryanjbaxter
Copy link
Contributor

@rterentiev why do you want to want Hystrix to interrupt the command if it hasn't finished retrying the requests?

@rterentiev
Copy link
Contributor

rterentiev commented Mar 31, 2017

Because Hystrix will return timeout error, even if command was executed successfully after retrying. In high load applications, this also will add additional load and problems described by @Aloren.

I am not pushing you to fix this issue. I was able to do workaround implementing custom command and command factory.

But if you going to not fixing it for now, I would disable hystrix timeout by default. It just makes things more consistent - hystrix timeout is not working and disabled by default.

@spencergibb
Copy link
Member

@rterentiev you wouldn't have those problems if the hystrix timeout were > the combined potential timeouts of all retries, correct?

@ryanjbaxter
Copy link
Contributor

Because Hystrix will return timeout error, even if command was executed successfully after retrying.

Right so why not configure Hystrix so it at least waits until all retrys complete?

In high load applications, this also will add additional load and problems described by @Aloren.

This could potentially be dealt with by implementing some kind of backoff policy, it would be an enhancement.

@vasilievip
Copy link

@spencergibb, correct, but default setup is broken. There is 1 sec of hystrix timeout and 5 secs of ribbon timeout.
So, by default hystrix timeout can be disabled to avoid broken behaviour. (or default ribbon needs to match 1 sec max to accommodate hystrix)

@spencergibb
Copy link
Member

I've opened #1827 to see about adjusting the ribbon timeouts for the default apache httpclient and the okhttp client.

I've opened #1828 to document hystrix/retry/timeout relationships.

BTW, I've chatted with a netflix engineer and I am correct in those relationships. As a side note, even they struggle with this some.

@rterentiev
Copy link
Contributor

rterentiev commented Apr 3, 2017

Right so why not configure Hystrix so it at least waits until all retrys complete?

@rterentiev you wouldn't have those problems if the hystrix timeout were > the combined potential timeouts of all retries, correct?

You will need to keep in mind this problem. Each time you update ribbon configuration, you need to recalculate new timeout value for hystrix. Its much more simplier to just disable hystrix timeout. Behavior is same for disabled hystrix timeout, and when hystrix timeout > the combined potential timeouts of all retries.

@spencergibb
Copy link
Member

Disabling the hystrix timeout seems like a bad idea. May as well remove hystrix at that point.

@matt-shaw
Copy link

Hi,

I've been following this thread but am slightly confused by the which settings work and don't work with Zuul, Ribbon and Hytrix.

We are using Spring Cloud Edgeware release. We are just using all the default values. We are connecting Zuul to our services using Ribbon connected to Eureka. Some of our services are a little slow ~10 secs ish. How do I increase the various timeouts that seem to be triggered to allow this normal use case to not timeout??

I've set these;
ribbon:
ReadTimeout: 10000
ConnectTimeout: 10000

which stopped some of the timeouts but I'm still left with the Hystrix timeouts.

Many thanks

Matt

@ryanjbaxter
Copy link
Contributor

@matt-shaw please dont ask the same question on multiple issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants