Zuul won't stop retry after hystrix is timed out #1772

ts-huyao · 2017-03-13T10:40:23Z

I'm on spring cloud dependencies Camden.SR6, and a setup of zuul with the following config

ribbon:
    MaxAutoRetriesNextServer: 0
    MaxAutoRetries: 2
    ReadTimeout: 3000
    OkToRetryOnAllOperations: true #false to only allow get to retry

hystrix:
  command:
    sc-test-node-0:
      execution:
        isolation:
          strategy: THREAD
          thread:
            timeoutInMilliseconds: 5000

and I put a simple resource behind zuul which sleeps 4000ms when invoked.

    @GET
    @Path("getthensleep/{sleepms}")
    @Produces(MediaType.APPLICATION_JSON)
    public Response getThenSleep(@PathParam("sleepms") int sleepms){
        LOGGER.info("get then sleep {} ms", sleepms);
        try {
            Thread.sleep(sleepms);
        } catch (InterruptedException e) {
            LOGGER.error("Sleep error", e);
        }
        return Response.ok().build();
    }

and I use postman to call the getthensleep resource to let it sleep 4000ms, I expect zuul's RibbonRoutingFilter times out after 3000ms and retry until hystrix times out, 2 separate call should be made from zuul. However, after hystrix times out, zuul returns with code 500, another call is made to getthensleep resource again. In total 3 calls are made by zuul. I think the zuul should stop trying when hystrix times out, and It shouldn't make other attempt after hystrix times out.

The text was updated successfully, but these errors were encountered:

ryanjbaxter · 2017-03-13T18:36:33Z

I found this issue in the Hystrix project that is particularly interesting.

It seems like the application thread should try to respond to a thread interruption (when using an isolation strategy of thread).

Might need to dig into this a little bit more to understand it better.

spencergibb · 2017-03-13T18:57:51Z

Because of the way the RibbonRoutingFilter is coded, the default way of specifying THREAD isolation doesn't work. Try zuul.ribbonIsolationStrategy=THREAD. You are probably still using semaphore.

ts-huyao · 2017-03-14T07:55:56Z

@spencergibb thank you for pointing out it's hystrix isolation issue, I tried to set

zuul:
  ignored-patterns: /**/health/**,/**/internal/**
  routes:
    test-node-0:
      path: /node0/**
      serviceId: sc-test-node-0
      sensitiveHeaders: Cookie,Set-Cookie
      retryable: true
  ribbon-isolation-strategy: THREAD

but it seems not working.

spencergibb · 2017-03-14T15:55:07Z

What does "but it seems not working." mean?

rterentiev · 2017-03-16T17:38:18Z

Hystrix uses RxJava, and when command should be failed by timeout, hystrix unsubscribes from command observable (see com.netflix.hystrix.AbstractCommand.HystrixObservableTimeoutOperator)

If THREAD isolation level is used, command thread can be interrupted. But @ryanjbaxter found issue.

For SEMAPHORE isolation level, which default for Zuul, command thread is not interrupted.

Now we have Spring's AbstractRibbonCommand. It extends HystrixCommand and implements run method. We can rewrite AbstractRibbonCommand to extend HystrixObservableCommand and implement construct method that returns observable. This will give unsubscription propagation.

I tried to do this, see my commit. I am using ribbon's LoadBalancerCommand that also returns observable.

But I faced another problem: retryable clients. When retryable client is used, ribbons retry logic is ignored and retry is performed in client. I don't know what is the best way to make it work with retryable clients. Add smth like toObservable to client, and implement execution in react way?

Anyway, see my branch. It has tests for retry abort. If you revert all changes and leave only tests - they will fail.

ryanjbaxter · 2017-03-16T20:23:05Z

My gut tells me that we can probably do something in RibbonLoadBalancedRetryPolicyFactory so that when we check if we can retry we return false because of the timeout.

rterentiev · 2017-03-17T10:31:50Z

I think, what I did in my branch is wrong. Retry logic should be only in client.

My gut tells me that we can probably do something in RibbonLoadBalancedRetryPolicyFactory so that when we check if we can retry we return false because of the timeout.

Yeah, but you will need to ask hystrix somehow, is command timed out.

I have a another question. What was the point to add those retryable clients? To be able to write custom LoadBalancedRetryPolicy?

ryanjbaxter · 2017-03-17T19:14:42Z

What was the point to add those retryable clients? To be able to write custom LoadBalancedRetryPolicy?

Beginning in Camden we no longer used the Netflix RestClientsince it was deprecated. The new HTTP clients we used did not have the retry logic that the Netflix client did so the retry clients were made to fill that gap.

Can you perhaps submit a PR with your changes so we can take an easier look at the code changes?

rterentiev · 2017-03-19T10:43:06Z

Can you perhaps submit a PR with your changes so we can take an easier look at the code changes?

#1793

rterentiev · 2017-03-20T12:40:54Z

Beginning in Camden we no longer used the Netflix RestClient since it was deprecated. The new HTTP clients we used did not have the retry logic that the Netflix client did so the retry clients were made to fill that gap.

Deprecated RestClient extends AbstractLoadBalancerAwareClient that has executeWithLoadBalancer method with retry logic. And execute method does not retry requests.

New retryable clients also extend AbstractLoadBalancerAwareClient, but overrides getRequestSpecificRetryHandler to disable logic in executeWithLoadBalancer method. And retry logic is implemented in execute method.

Comparing those implementations

method\client	RestClient	New retryable client
execute	does not retry	does retry
executeWithLoadBalancer	does retry	does retry

What is the contract for client? execute method should retry failed requests?

ryanjbaxter · 2017-03-20T15:33:14Z

Yes execute should retry the failed requests, executeWithLoadBalancer calls execute.
We couldnt change the retry logic in AbstractLoadalancerAwareClient since it is upstream, so we just disable it and move the retry logic to the client in Spring Cloud Netflix.

ryanjbaxter · 2017-03-21T16:09:43Z

So I did some investigation into how Hystrix communicates that a timeout has occurred to the wrapped operation and it looks like it is just interrupting the thread, so all we should have to do is to check Thread.interruped(). Here is a quick sample I put together that seems to work.

@Service
public class HelloService {
	@HystrixCommand(fallbackMethod = "reliable")
	public String getHelloString() throws Exception {
		MySimplePolicy policy = new MySimplePolicy();
		RetryTemplate template = new RetryTemplate();
		template.setRetryPolicy(policy);
		return template.execute(new RetryCallback<String, Exception>() {
			@Override
			public String doWithRetry(RetryContext context) throws Exception {
				RestTemplate rest = new RestTemplate();
				((SimpleClientHttpRequestFactory)rest.getRequestFactory()).setReadTimeout(100);
				return rest.getForObject("http://localhost:8080/hello", String.class);
			}
		});

	}
	
	public String reliable() {
		return "fallback";
	}
}

class MySimplePolicy extends SimpleRetryPolicy {
	@Override
	public boolean canRetry(RetryContext context) {
		if(Thread.interrupted()) {
			context.setExhaustedOnly();
			return false;
		}
		return super.canRetry(context);
	}
}

I think you should be able to apply this to the retryable clients as well.

rterentiev · 2017-03-22T09:32:07Z

Thank you for clarifications!

Unfortunately Thread.interruped will not fix case when hystrix isolation level is SEMAPHORE. This is critical when you use Zuul - it has SEMAPHORE isolation level. Switching Zuul to use THREAD isolation level might affect performance - you will have 2 times more number of busy threads - one servlet container thread plus one hystrix thread for each request.

I am trying to refactor clients right now, to make some general fix. Long short story is to add AbstractLoadBalancingClient.getExecutionObservable method and use it everywhere. Here is example:

@Override
public T execute(S request, IClientConfig clientConfig) throws Exception {
	return executeWithLoadBalancer(request, clientConfig);
}

@Override
public T executeWithLoadBalancer(S request, IClientConfig clientConfig) throws ClientException {
	try {
		RequestSpecificRetryHandler handler = getRequestSpecificRetryHandler(request, clientConfig);
		return getExecutionObservable(request, clientConfig, handler)
				.toBlocking()
				.single();
	} catch (Exception e) {
		Throwable t = e.getCause();
		if (t instanceof ClientException) {
			throw (ClientException) t;
		} else {
			throw new ClientException(e);
		}
	}
}

public Observable<T> getExecutionObservable(final S request, final IClientConfig requestConfig, RetryHandler handler) {
	LoadBalancerCommand<T> command = LoadBalancerCommand.<T>builder()
			.withLoadBalancerContext(this)
			.withRetryHandler(handler)
			.withLoadBalancerURI(request.getUri())
			.build();

	try {
		return command.submit(new ServerOperation<T>() {
			@Override
			public Observable<T> call(Server server) {
				URI finalUri = reconstructURIWithServer(server, request.getURI());
				S requestForServer = (S) request.replaceUri(finalUri);

				try {
					T response = AbstractLoadBalancingClient.this.executeInternal(requestForServer, requestConfig);
					return Observable.just(response);
				} catch (Exception e) {
					return Observable.error(e);
				}
			}
		});
	} catch (Exception e) {
		return Observable.error(e);
	}
}

protected abstract T executeInternal(S request, IClientConfig requestConfig) throws Exception;

ryanjbaxter · 2017-03-22T13:54:34Z

I meant to add a note to my previous comment stating that it would only work for thread isolation, sorry about that.

I think the Observable approach will work for both thread and semaphore isolation. I went back and read this comment, which seems to confirm that approach. Thanks for diving on this!

rterentiev · 2017-03-23T10:03:54Z

I updated my pull request #1793 with latest changes. OkHttpLoadBalancingClient and RibbonLoadBalancingHttpClient start retrying failed requests on execute.

Is it ok to remove RetryableRibbonLoadBalancingHttpClient and RetryableOkHttpLoadBalancingClient? I had to ask about this before, but decided to remove them, because OkHttpLoadBalancingClient and RibbonLoadBalancingHttpClient are now do same.

ryanjbaxter · 2017-03-23T15:04:01Z

No it is not OK we want those clients so we can use Spring Retry to retry the failed requests and not rely on the deprecated Netflix client to do so.

rterentiev · 2017-03-23T20:43:44Z

Well, I was confused. Old and new retryable clients extends Netflix's AbstractLoadBalancerAwareClient. Netflix's client does retry using Ribbon. Also, new retryable clients configured via Ribbon's configuration. So it looks like those retryable clients just replace Ribbon by custom implementation using spring retry, and make execute method retryable.
Probably, I just didn't get something or don't know about some future refactoring you keep in mind.

Anyway, see my updated pull request. All clients are doing same as before now. Basically, I just fixed the issue.

spencergibb · 2017-03-24T20:17:49Z

I'd be very uncomfortable merging the changes you have made.

rterentiev · 2017-03-24T21:41:30Z

No problem. At least I showed one of possible ways how to fix the issue. Thanks.

ryanjbaxter · 2017-03-27T14:18:24Z

I am going to close this issue for now. If we find there is others who want retries to stop after a hystrix timeout we can come back and address this.

Aloren · 2017-03-27T15:09:01Z

I was following your discussion as this issue is actually happening for our service in Production. As soon as services behind Zuul begin to respond slower, Zuul retries requests until some of our nodes die. :(
Since this is actually hard to fix, what should be done next? Or Spring Cloud Zuul will live with this bug?

Thanks

ryanjbaxter · 2017-03-27T23:28:30Z

Are you saying the retries that dont stop from Zuul are causing your apps to crash?

Aloren · 2017-03-28T08:20:12Z

It caused Zuul to crash. We had issues with network and downstream service was unreachable.
Even if call was cancelled by hystrix, ribbon was still retrying (ribbon has default timeout set to 5 secs). This led to 10 secs response time and huge spike of busy threads on tomcat. This also led to JVM freeze and made node non-operational.

ts-huyao · 2017-03-29T07:23:40Z

@spencergibb even ribbon-isolation-strategy is set to THREAD, retry can't be interrupted by hystrix timeout.

ryanjbaxter · 2017-03-30T17:15:17Z

@ts-huyao yes because we are not checking if the thread has been interrupted

ryanjbaxter · 2017-03-30T17:17:33Z

@Aloren and @rterentiev we talked about this issue on our team call today and we can to the consensus that the proper way to configure the Hystrix timeout is to make sure you account for the total timeout of all retries. This means, for example, if you have the Ribbon timeout configured to be 1 second and you are going to retry the request 3 times that the Hystrix timeout should be at least 3 seconds. Is there a reason why you have not configured your Hystrix timeout this way?

spencergibb · 2017-03-30T17:18:29Z

If I recall correctly, the hystrix timeout should be slightly greater than the max expected duration of the underlying command. So if the http client is configured with a 1s connect+read timeout and 2 potential retries, the hystrix timeout should be slightly more than 3s (to account for jvm overhead). I'm trying to find where I read that.

vasilievip · 2017-03-30T19:40:17Z

@ryanjbaxter , then defaults in spring cloud configs needs to be adjusted as well, 5 secs for zuul hystrix was set to override 1 sec default. WDYT?

ryanjbaxter · 2017-03-30T20:18:31Z

If that is the case then yes we should make them match, however as soon as a second instance of a service is added than that timeout needs to be adjusted.

I wonder if we could automatically calculate the hystrix timeout based on the number of instances and the configured ribbon timeout....

spencergibb · 2017-03-30T20:24:39Z

We don't currently set the defaults for hystrix in zuul except for the max number of semaphores.

vasilievip · 2017-03-30T20:35:50Z

here is the list of settings which needs to match:

hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds
ribbon.ReadTimeout
ribbon.ConnectTimeout
ribbon.MaxHttpConnectionsPerHost
ribbon.MaxTotalConnections
ribbon.MaxAutoRetriesNextServer
ribbon.MaxAutoRetries

By default read timeout is 5000ms and auto retry is 1...

spencergibb · 2017-03-31T01:19:03Z

I'm not sure we should be setting those values automatically.

rterentiev · 2017-03-31T08:36:37Z

There is no sense in hystrix timeout, if command can't be interrupted. I would just disable hystrix timeout by default. Maybe log some warning message, if it was enabled.

ryanjbaxter · 2017-03-31T12:01:20Z

@rterentiev why do you want to want Hystrix to interrupt the command if it hasn't finished retrying the requests?

rterentiev · 2017-03-31T14:50:43Z

Because Hystrix will return timeout error, even if command was executed successfully after retrying. In high load applications, this also will add additional load and problems described by @Aloren.

I am not pushing you to fix this issue. I was able to do workaround implementing custom command and command factory.

But if you going to not fixing it for now, I would disable hystrix timeout by default. It just makes things more consistent - hystrix timeout is not working and disabled by default.

spencergibb · 2017-03-31T16:20:55Z

@rterentiev you wouldn't have those problems if the hystrix timeout were > the combined potential timeouts of all retries, correct?

ryanjbaxter · 2017-03-31T16:21:19Z

Because Hystrix will return timeout error, even if command was executed successfully after retrying.

Right so why not configure Hystrix so it at least waits until all retrys complete?

In high load applications, this also will add additional load and problems described by @Aloren.

This could potentially be dealt with by implementing some kind of backoff policy, it would be an enhancement.

vasilievip · 2017-03-31T16:26:08Z

@spencergibb, correct, but default setup is broken. There is 1 sec of hystrix timeout and 5 secs of ribbon timeout.
So, by default hystrix timeout can be disabled to avoid broken behaviour. (or default ribbon needs to match 1 sec max to accommodate hystrix)

spencergibb · 2017-03-31T16:53:17Z

I've opened #1827 to see about adjusting the ribbon timeouts for the default apache httpclient and the okhttp client.

I've opened #1828 to document hystrix/retry/timeout relationships.

BTW, I've chatted with a netflix engineer and I am correct in those relationships. As a side note, even they struggle with this some.

rterentiev · 2017-04-03T13:25:46Z

Right so why not configure Hystrix so it at least waits until all retrys complete?

@rterentiev you wouldn't have those problems if the hystrix timeout were > the combined potential timeouts of all retries, correct?

You will need to keep in mind this problem. Each time you update ribbon configuration, you need to recalculate new timeout value for hystrix. Its much more simplier to just disable hystrix timeout. Behavior is same for disabled hystrix timeout, and when hystrix timeout > the combined potential timeouts of all retries.

spencergibb · 2017-04-03T16:49:30Z

Disabling the hystrix timeout seems like a bad idea. May as well remove hystrix at that point.

matt-shaw · 2018-01-03T09:34:34Z

Hi,

I've been following this thread but am slightly confused by the which settings work and don't work with Zuul, Ribbon and Hytrix.

We are using Spring Cloud Edgeware release. We are just using all the default values. We are connecting Zuul to our services using Ribbon connected to Eureka. Some of our services are a little slow ~10 secs ish. How do I increase the various timeouts that seem to be triggered to allow this normal use case to not timeout??

I've set these;
ribbon:
ReadTimeout: 10000
ConnectTimeout: 10000

which stopped some of the timeouts but I'm still left with the Hystrix timeouts.

Many thanks

Matt

ryanjbaxter · 2018-01-03T19:51:55Z

@matt-shaw please dont ask the same question on multiple issues

spencergibb added the waiting for feedback label Mar 13, 2017

ryanjbaxter mentioned this issue Mar 15, 2017

Zuul default behaviour #1779

Closed

rterentiev mentioned this issue Mar 18, 2017

Stop retry after hystrix is timed out #1793

Closed

ryanjbaxter closed this as completed Mar 27, 2017

ryanjbaxter removed the waiting for feedback label Mar 27, 2017

This was referenced Mar 31, 2017

Investigate lowering timeout values in AbstractLoadBalancingClient #1827

Closed

Document zuul hystrix/retry timeouts #1828

Closed

Zuul won't stop retry after hystrix is timed out #1772

Zuul won't stop retry after hystrix is timed out #1772

Comments

ts-huyao commented Mar 13, 2017 • edited

ryanjbaxter commented Mar 13, 2017

spencergibb commented Mar 13, 2017

ts-huyao commented Mar 14, 2017 • edited

spencergibb commented Mar 14, 2017

rterentiev commented Mar 16, 2017 • edited

ryanjbaxter commented Mar 16, 2017

rterentiev commented Mar 17, 2017 • edited

ryanjbaxter commented Mar 17, 2017

rterentiev commented Mar 19, 2017

rterentiev commented Mar 20, 2017

ryanjbaxter commented Mar 20, 2017 • edited

ryanjbaxter commented Mar 21, 2017

rterentiev commented Mar 22, 2017 • edited

ryanjbaxter commented Mar 22, 2017

rterentiev commented Mar 23, 2017 • edited

ryanjbaxter commented Mar 23, 2017

rterentiev commented Mar 23, 2017 • edited

spencergibb commented Mar 24, 2017

rterentiev commented Mar 24, 2017

ryanjbaxter commented Mar 27, 2017

Aloren commented Mar 27, 2017

ryanjbaxter commented Mar 27, 2017

Aloren commented Mar 28, 2017

ts-huyao commented Mar 29, 2017

ryanjbaxter commented Mar 30, 2017

ryanjbaxter commented Mar 30, 2017

spencergibb commented Mar 30, 2017

vasilievip commented Mar 30, 2017 • edited

ryanjbaxter commented Mar 30, 2017

spencergibb commented Mar 30, 2017

vasilievip commented Mar 30, 2017

spencergibb commented Mar 31, 2017

rterentiev commented Mar 31, 2017

ryanjbaxter commented Mar 31, 2017

rterentiev commented Mar 31, 2017 • edited

spencergibb commented Mar 31, 2017

ryanjbaxter commented Mar 31, 2017

vasilievip commented Mar 31, 2017

spencergibb commented Mar 31, 2017

rterentiev commented Apr 3, 2017 • edited

spencergibb commented Apr 3, 2017

matt-shaw commented Jan 3, 2018

ryanjbaxter commented Jan 3, 2018

ts-huyao commented Mar 13, 2017 •

edited

ts-huyao commented Mar 14, 2017 •

edited

rterentiev commented Mar 16, 2017 •

edited

rterentiev commented Mar 17, 2017 •

edited

ryanjbaxter commented Mar 20, 2017 •

edited

rterentiev commented Mar 22, 2017 •

edited

rterentiev commented Mar 23, 2017 •

edited

rterentiev commented Mar 23, 2017 •

edited

vasilievip commented Mar 30, 2017 •

edited

rterentiev commented Mar 31, 2017 •

edited

rterentiev commented Apr 3, 2017 •

edited