New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zuul/Hystrix: HTTP connection leak in case of timeout #327

Closed
brenuart opened this Issue Apr 30, 2015 · 8 comments

Comments

Projects
None yet
2 participants
@brenuart
Contributor

brenuart commented Apr 30, 2015

We came into scenarios where Zuul is leaking http connections (i.e. not returned to the pool) after execution. After a while, the pool's maximum limit is hit and Zuul can't provide any service anymore.

In short, it turns out that Zuul leaks HTTP connections if a service takes longer than the Hystrix execution timeout but before the ReadTimeout occurs.

We could reproduce the case using a Zuul instance left to its default settings except for the http pool size which is set to 1 to rapidly reproduce the problem:

  • Hystrix timeout = 1000ms
  • ribbon.ReadTimeout = 5000ms (default value)
  • ribbon.MaxTotalHttpConnections = 1
  • ribbon.MaxHttpConnectionsPerHost = 1
  • either use Eureka to discover the service or register it statically as explained in the documentation

Our test service exposes a simple REST API with a single argument to tell it how much time it should wait before returning a result. The service is proxied by Zuul such that a request for http://zuul/service/hello?delay=500 is forwarded to the actual service instance. The delayparameter tells the service to wait 500ms before returning. Fairly simple.

Scenario 1

Invoke the service with a delay of 500ms. This delay (simulated execution time) is lower than all Zuul timeout (Hystrix and Ribbon ReadTimeout).
Everything is working fine.

Scenario 2

Invoke the service with a delay of 1500ms.
This should trigger the Hystrix timeout but is still within the ReadTimeout limit. A timeout error is returned to the client, but only after the service returns (because of the SEMAPHORE isolation), i.e. after 1500ms.
In this case, the HTTP connection is NEVER RETURNED TO THE POOL.

From the logs:

17:17:13.283 DEBUG --- [RibbonCommand-2] com.netflix.niws.client.http.RestClient  : RestClient sending new Request(GET: ) http://localhost:8085/hello
17:17:13.283 DEBUG --- [RibbonCommand-2] c.n.http4.MonitoredConnectionManager     : Get connection: {}->http://localhost:8085, timeout = 2000
17:17:13.283 DEBUG --- [RibbonCommand-2] com.netflix.http4.NamedConnectionPool    : [{}->http://localhost:8085] total kept alive: 1, total issued: 0, total allocated: 1 out of 1
17:17:13.283 DEBUG --- [RibbonCommand-2] com.netflix.http4.NamedConnectionPool    : Getting free connection [{}->http://localhost:8085][null]
17:17:13.284 DEBUG --- [RibbonCommand-2] com.netflix.http4.NFHttpClient           : Stale connection check
17:17:13.285 DEBUG --- [RibbonCommand-2] com.netflix.http4.NFHttpClient           : Attempt 1 to execute request

--> Hystrix timeout after 1000ms <--
17:17:14.289 DEBUG --- [ HystrixTimer-1] com.netflix.hystrix.AbstractCommand      : No fallback for HystrixCommand. 

java.lang.UnsupportedOperationException: No fallback available.
    at com.netflix.hystrix.HystrixCommand.getFallback(HystrixCommand.java:284)
    at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:311)
    at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:306)
    at rx.Observable$1.call(Observable.java:144)
...
Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: helloserviceRibbonCommand timed-out and no fallback available.
    at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:782)
    at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:769)
...
Caused by: java.util.concurrent.TimeoutException: null
    at com.netflix.hystrix.AbstractCommand$9.call(AbstractCommand.java:589)
    at com.netflix.hystrix.AbstractCommand$9.call(AbstractCommand.java:570)

--> 1500ms later, the hello service sent its response
17:17:14.789 DEBUG --- [RibbonCommand-2] com.netflix.http4.NFHttpClient           : Connection can be kept alive indefinitely

Strange enough, the log says 'Connection can be kept alive indefinitely' - so it seems to care about the connection, but it is never returned to the pool.
To make sure, simply issue a second call with a delay of 500ms (as in scenario 1). This time it will fail because of a timeout waiting for a connection from the pool:

17:29:46.733 DEBUG --- [RibbonCommand-3] com.netflix.niws.client.http.RestClient  : RestClient sending new Request(GET: ) http://localhost:8085/hello
17:29:46.734 DEBUG --- [RibbonCommand-3] c.n.http4.MonitoredConnectionManager     : Get connection: {}->http://localhost:8085, timeout = 2000
17:29:46.734 DEBUG --- [RibbonCommand-3] com.netflix.http4.NamedConnectionPool    : [{}->http://localhost:8085] total kept alive: 0, total issued: 1, total allocated: 1 out of 1
17:29:46.734 DEBUG --- [RibbonCommand-3] com.netflix.http4.NamedConnectionPool    : No free connections [{}->http://localhost:8085][null]
17:29:46.734 DEBUG --- [RibbonCommand-3] com.netflix.http4.NamedConnectionPool    : Available capacity: 0 out of 1 [{}->http://localhost:8085][null]
17:29:46.734 DEBUG --- [RibbonCommand-3] com.netflix.http4.NamedConnectionPool    : Need to wait for connection [{}->http://localhost:8085][null]
17:29:47.733 DEBUG --- [ HystrixTimer-2] com.netflix.hystrix.AbstractCommand      : No fallback for HystrixCommand. 

java.lang.UnsupportedOperationException: No fallback available.
    at com.netflix.hystrix.HystrixCommand.getFallback(HystrixCommand.java:284)
    at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:311)
    at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:306)
...
17:29:47.734 DEBUG --- [RibbonCommand-3] c.n.l.reactive.LoadBalancerCommand       : Got error com.sun.jersey.api.client.ClientHandlerException: java.io.InterruptedIOException when executed on server localhost:8085

Look at the NamedConnectionPool traces: it basically says it cannot create a new connection because the capacity limit is reached and has to wait instead for one to be returned to the pool.

Scenario 3

Invoke the service with a delay of 6000ms
The Hystrix timeout (1000ms) AND the ReadTimeout (5000ms) should trigger. A timeout error is returned to the client (as for scenario 2), but once the ReadTimeout occurs. So the service response is not read.
In this case, everything is working fine: the http connection is properly returned to the pool.

Scenario 4

Change the configuration to have the Hystrix timeout higher than the Ribbon ReadTimeout:

  • hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds: 6000

Scenario 4.1

A request with a delay of 5500ms (readTimeout < service exec time < hystrixTimeout)
No problem, works as expected, no http connection leaked (client received a timeout answer as expected).

Scenario 4.2

A request with a delay of 6500ms (readTimeout < hystrixTimeout < service exec time)
PROBLEM, http connection leaked.

Conclusion

It turns out that Zuul leaks HTTP connections if a service takes longer than the Hystrix execution timeout but less than the ReadTimeout. To summarise:

  • execTime < hystrixTimeout < readTimeout --> OK (no leak)
  • hystrixTimeout < execTime < readTimeout --> LEAK
  • hystrixTimeout < readTimeout < execTime --> OK (no leak)
  • readTimeout < execTime < hystrixTimeout --> OK (no leak)
  • readTimeout < hystrixTimeout < execTime --> LEAK

Changing the Hystrix isolation from SEMAPHORE to THREAD doesn't solve the problem (hystrix.command.helloserviceRibbonCommand.execution.isolation.strategy: THREAD)

The problem can be reproduced with the above scenarios. However, I have no idea of where to look in the code to address the issue.

This issue look similar to:

@spencergibb

This comment has been minimized.

Member

spencergibb commented Apr 30, 2015

@brenuart thanks for the detailed report. Another scenario is to have the Hystrix timeout be greater than the ribbon timeout. Does that avoid the problem as well?

@spencergibb

This comment has been minimized.

Member

spencergibb commented Apr 30, 2015

The description of the problem seems to lead me to believe that the Hystrix timeout short circuits the Ribbon/apache mechanism to clean up after itself.

@brenuart

This comment has been minimized.

Contributor

brenuart commented Apr 30, 2015

I have added scenarios 4.1 & 4.2 to cover the cases where the Hystrix timeout is set higher than the ReadTimeout. See the Conclusion section as well for a summary of the different cases.

@brenuart

This comment has been minimized.

Contributor

brenuart commented Apr 30, 2015

To be honest, I could not even find where the HTTP connection is supposed to be closed and returned to the pool. The reactive code style of Hystrix is so hard to debug.. :-(
As far as I can tell, the connection should be released once the HttpResponse is fully read - which probably never happens when the Hystrix timeout fires. Not sure though...

@brenuart

This comment has been minimized.

Contributor

brenuart commented May 6, 2015

Under normal circumstances, the RibbonCommand returns an HttpResponse instance that somehow wraps the input stream of the HTTP connection used to contact the origin server. The HTTP connection stays open until that stream is fully read and is automatically closed when the stream reaches EOF. This allows for streaming content from the origin directly into the client's output stream (via ZUUL).

The RibbonRoutingFilter ("route" phase) is responsible for executing the RibbonCommand and storing the returned HttpResponse into Zuul's RequestContext. It is then picked-up by the SendResponseFilter in the "post" phase and its content is sent to the caller.

However, because the RibbonCommand is also an HystrixCommand its execution won't return anything in case of an Hystrix timeout but an HystrixRuntimeException will be thrown instead. Unfortunately, the RibbonCommand may have executed successfully (but not fast enough to satisfy Hystrix). An HttpResponse is produced but never returned to the caller. It is lost, never consumed and therefore never closed. The underlying HTTP connection is therefore never returned to the pool.

To fix the issue temporarily, I made changes to the RibbonCommand.forward() method to explicitly close the HttpResponse before returning if the command is declared as "timed out" by Hystrix. Here is what it looks like:

private HttpResponse forward() throws Exception {
...
   HttpResponse response = this.restClient.executeWithLoadBalancer(httpClientRequest);
   context.set("ribbonResponse", response);

   // close the response if the command timed out
   if( this.isResponseTimedOut() ) {
      if( response!= null ) {
         response.close();
      }
   }
   return response;
}

This is more a hack than a real fix for the issue - but it works and illustrates clearly the problem. A cleaner solution requires a better understanding of Hystrix and/or rxjava...

I forgot to mention that the proposed solution works for both the SEMAPHORE and THREAD execution strategies, and whether or not the executing thread is interrupted in case of timeout.

What do you think?

@brenuart

This comment has been minimized.

Contributor

brenuart commented Jun 10, 2015

Any news on this?

@spencergibb

This comment has been minimized.

Member

spencergibb commented Jun 10, 2015

@brenuart how about a PR with your fix. I know you think it's a hack, but you know more about it than us. Didn't I see an issue or PR on the eureka side as well?

@brenuart

This comment has been minimized.

Contributor

brenuart commented Jun 12, 2015

@spencergibb Can we close this issue since you applied the PR (2ba5c4b) ?
Or maybe you prefer to have some news about it from Netflix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment