Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High percent of failures (timeout) under load from server-side WebClient requests [SPR-15584] #20143

Closed
spring-issuemaster opened this issue May 24, 2017 · 18 comments

Comments

Projects
None yet
2 participants
@spring-issuemaster
Copy link
Collaborator

commented May 24, 2017

Jakub Rutkowski opened SPR-15584 and commented

I just test by sample PoC project some blocking / non blocking solutions in simple common scenario.

Scenario:

  • There are rest blocking endpoint which is quite slow - each request tooks 200 ms.
  • There are other - client application, which call this slow endpoint.
    I have tested current (blocking) Spring boot client (tomcat), Spring Boot 2.0 (netty) with WebFlux - WebClient, Ratpack and Lagom. In each cases I have stressed client application by gatling test simple scenario (100-1000 users / second).

I have tested ratpack and lagom as reference non blocking io servers to compare results to spring boot (blocking and non blocking).

In all cases i have results as expected, except spring boot 2.0 test. Its working only for small load levels but even then with high latency. If load level rises up - all requests are time outed.
(see attachments)

WebClient usage :

@RestController
public class NonBlockingClientController {
    private WebClient client = WebClient.create("http://localhost:9000");

    @GetMapping("/client")
    public Mono<String> getData() {
        return client.get()
                .uri("/routing")
                .accept(TEXT_PLAIN)
                .exchange().timeout(Duration.ofSeconds(30))
                .flatMap(clientResponse -> clientResponse.bodyToMono(String.class));
    }
}

I have no idea what goes wrong or current M1 version just working that.

All sources published at https://github.com/rutkowskij/blocking-non-blocking-poc

blocking-service - slow blocking endpoint
non-blocking-client - Spring Boot 2.0M1 and WebClient based client

I have asked for this problem on


Affects: 5.0 RC1, 5.0 RC2, 5.0 RC3

Attachments:

Issue Links:

  • #20338 Spring webflux app consumes more resources than non-reactive equivalent app implementation

0 votes, 5 watchers

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 19, 2017

Jakub Rutkowski commented

I've tested on 2.0.0.M2 spring boot version (spring-webflux, spring-core - 5.0.0RC2) and issue still exists (there are minimal progress but requests still failing) - see attachment

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 24, 2017

Rossen Stoyanchev commented

I've updated the title (originally "Spring WebFlux WebClient resilience and performance") to reflect the concrete issue to investigate.

The larger question of resilience and performance is valid too but we can't discuss much until we figure out the cause for the high failure count.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 24, 2017

Rossen Stoyanchev commented

Can you please provide basic instructions for your test repository? Also the sample is currently at Boot 2.0 M1 (RC1) and a lot has happened since (we're RC3 as of today).

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 24, 2017

Jakub Rutkowski commented

I have updated and pushed dependencies to Boot 2.0 M2

Steps to reproduce:

  1. Run blocking-service/BlockingServiceApplication (it will expose http://localhost:9000/routing endpoint - it sleeps 200ms in each request)
  2. Run non-blocking-client/NonBlockingClientApplication (it will expose http://localhost:8000/client endpoint which call above blocking service)
  3. Run gatling test - gatling-load-tests/mvn gatling:test

after test scenerio ~2min You have generated test report:

Please open the following file: PATH_TO_REPORT
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 24, 2017

Jakub Rutkowski commented

I have seen that new RC3 was released today, so I wanted to test on it, but I need to wait for Boot 2.0 M3 (which will use SF RC3 I guess)

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 24, 2017

Jakub Rutkowski commented

I just have tested on Boot 2.0.0-SNAPSHOT which uses RC3 - and it still happens.
There are many TimeoutExceptions in logs and some logs contains reference to reactor/reactor-netty#138

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 25, 2017

Jakub Rutkowski commented

I've attached screen with resources utilization during test. In my opinion it seems like fixed thread pool cause to starvation... like in classic blocking approach

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 25, 2017

Rossen Stoyanchev commented

Sorry but what are you basing that opinion on?

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 25, 2017

Rossen Stoyanchev commented

There are many TimeoutExceptions in logs and some logs contains reference to reactor/reactor-netty#138

Okay so that is a more likely explanation for the error count. We need to have that fixed first.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 25, 2017

Jakub Rutkowski commented

Sorry but what are you basing that opinion on?

  • It's look like default netty thread pool contains approximately 10 threads
  • It's working fine for single requests and low load
  • There are no high cpu load
  • There are many timeouts when load rises - The results are as expected if you will run the same test for example on tomcat with pool 10 threads
@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 31, 2017

Rossen Stoyanchev commented

The problem is not with the thread pool size or and with non-blocking code there is no need for extra threads to handle concurrency. There is something else at play here.

I've been able to reproduce the problem when the load goes up to 1000 concurrent users (works with 500). When the load goes high enough, initially I see a few "Connection reset by peer" exceptions, then a few seconds later a flood of timeouts. Superficial observation is that something goes wrong and then all requests begin to time out. I can also confirm that with httpClientOptions.disablePool() it runs successfully at half the throughput and that a Servlet / Spring MVC server runs without any issues.

We'll probably need to wait for the investigation of reactor/reactor-netty#138. Either way the fact that disabling the connection pool makes a difference points strongly to an issue at the level of the Reactor Netty client (/cc smaldini, Violeta Georgieva).

Note also that that testing a scenario like this with 3 tiers on a single machine is likely to lead to strange issues. That said there is likely something more going on here so I'm scheduling this for resolution one way or another.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 4, 2017

Brian Clozel commented

Hello Jakub Rutkowski

Violeta Georgieva has changed a few things around the connection pool configuration, and some related issues are gone.
Could you rerun your benchmark to compare?

For that, you'll need to be on 100% SNAPSHOTs (living dangerously):

  • use Spring Boot 2.0.0.BUILD-SNAPSHOT
  • in your pom.xml, override two maven properties with <spring.version>5.0.0.BUILD-SNAPSHOT</spring.version> and <reactor.version>Bismuth.BUILD-SNAPSHOT</reactor.version>
  • in case your app is reporting strange ClassNotFoundExceptions, don't hesitate to clean your snapshots with mvn dependency:purge-local-repository

Let us know if you don't have time - the next Framework Milestone is around the corner.

Thanks!

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 5, 2017

Jakub Rutkowski commented

Hi Brian
I've run test again on snapshot version, but results are same as before.
There are still a lot exceptions with reactor/reactor-netty#138 reference.

snapshot libraries:
spring-boot-2.0.0.BUILD-20170905.071138-928.jar
spring-context-5.0.0.BUILD-20170904.142806-524.jar

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 5, 2017

Brian Clozel commented

Thanks a lot Jakub Rutkowski, this really helps.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 5, 2017

Rossen Stoyanchev commented

Note that we ran into some issues with reactor-netty not being fully up-to-date with the latest reactor-core. This was just fixed and it might have impacted the testing. We can also give it another try on our side as well.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 6, 2017

Rossen Stoyanchev commented

Using the latest snapshot in non-blocking-client and running the performance test, I no longer get any errors.

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 8, 2017

Jakub Rutkowski commented

I confirm - It looks that everything ok now. (see attachment)

@spring-issuemaster

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 8, 2017

Brian Clozel commented

Nice! I'm closing this issue now - we can still improve performance overall, but this particular problem is now gone.

Don't hesitate to keep an eye on your benchmarks and let us know - this is really useful.
Thanks Jakub Rutkowski for all the hard work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.