Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve gateway performance? #301

Closed
maoyunfei opened this issue Apr 25, 2018 · 25 comments
Closed

How to improve gateway performance? #301

maoyunfei opened this issue Apr 25, 2018 · 25 comments

Comments

@maoyunfei
Copy link

maoyunfei commented Apr 25, 2018

I hava read the performance comparison from spring-cloud-gateway-bench.

Proxy Avg Latency Avg Req/Sec/Thread
gateway 6.61ms 3.24k
linkered 7.62ms 2.82k
zuul 12.56ms 2.09k
none 2.09ms 11.77k

According to the comparison result, although gateway is best compared with linkered and zuul, but it's performance only reach about 1/4 of none proxy.

I made some tests with different situation, such as different response time and response body size. I used wrk for tests. I found it seems that gateway's performance only influenced by response body size, with bigger than about 10kb size, it will drop rapidly.

So how to optimalize gateway to do better?

@re6exp
Copy link
Contributor

re6exp commented Apr 25, 2018

Use third-party solutions. For example, https://varnish-cache.org.

@maoyunfei maoyunfei reopened this Apr 27, 2018
@maoyunfei
Copy link
Author

maoyunfei commented Apr 27, 2018

@re6exp Thank you for your comment. But In my business case, I can't use cache.

@lhotari
Copy link

lhotari commented May 1, 2018

@maoyunfei Do you use https in your use case?
For TLS/https connections, using netty-tcnative improves performance (lower latency & higher throughput). See reactor/reactor-netty#344 for details.

@maoyunfei
Copy link
Author

@lhotari
I just used http.

@blockmar
Copy link

blockmar commented Jun 19, 2018

Documentation request.

Is it even possible to use netty-tcnative with the Gateway starter in some way? It is not documented anywhere in the docs and returns zero usable results on Google. I think this needs to be added to the documentation.

"Just use http" - is not a way forward for us and right now the performance of non-native TLS is holding us back from a production deploy.

But maybe it already works as described in reactor/reactor-netty#344 but I see no logs neither confirming or denying the fact.

I did basic benchmarks (using very lowtech ab) and I see no difference by just including the netty-native uber jar.

@spencergibb
Copy link
Member

@maoyunfei any test run on a single machine will have contention problems. Can you provide a complete, minimal, verifiable sample that reproduces the slowdowns with increased response size? It should be available as a GitHub (or similar) project or attached to this issue as a zip file.

@maoyunfei
Copy link
Author

maoyunfei commented Jul 10, 2018

@spencergibb I created a demo project on github that reproduces the issue, look at gateway-performance-test please!

@thirunar
Copy link
Contributor

Is this still valid? We are evaluating zuul and gateway. Our case is also similar, the response size will be around 1-2MBs.

@maoyunfei
Copy link
Author

@thirunar It's still valid. By the way, we turned to zuul2 finally.

@dalegaspi
Copy link

dalegaspi commented Oct 24, 2018

i would take this benchmark exercise with a huge grain of salt. the origin, test harness and proxy are all on the same box? that's no way close to how you're going to deploy it prod why would you load test in such manner? the test harness should be on one box, the proxy(ies) in another and the origin should be in yet another box.

i'm using spring boot 2.x with zuul with dynamic content and without any caching. the responses are 32k on average. the only difference is that i am using undertow container with okhttp and we have a custom pre filter that validates JWT from Redis so there is overhead but even with that i can get close to 70% TPS compared to just going directly to origin using wrk -t 10 -c 200 -d 30s similar to what is being done in the bench mark github project. also performed test with Apache Bench and JMeter with comparable results.

i dunno guys. Netflix is using Zuul 1 for their services at one point and they're dealing with video. they even admit that Zuul 2 has 25% net effect in througput. not sure how one can claim that it's not good enough for whatever it's going to be used for.

@spencergibb
Copy link
Member

I'm going to close this.

I agree with the benchmark on one machine problem first of all.

Second running the benchmark with the latest releases (Finchley and Greenwich) does not yield the large drops in performance as on mentioned in the sample (which was running a RC).

I also threw in zuul1 on port 8083

# direct to app, 30000 char responses
$ wrk -t16 -c200 -d30s "http://localhost:8081/demo?delay=50&length=large"      
Running 30s test @ http://localhost:8081/demo?delay=50&length=large
  16 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    61.43ms   11.91ms 135.87ms   82.32%
    Req/Sec   195.54     36.09   274.00     61.71%
  93722 requests in 30.08s, 2.63GB read
Requests/sec:   3115.74
Transfer/sec:     89.49MB

# gateway, 30000 char responses
$ wrk -t16 -c200 -d30s "http://localhost:8082/proxy/demo?delay=50&length=large"
Running 30s test @ http://localhost:8082/proxy/demo?delay=50&length=large
  16 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    72.03ms   21.23ms 189.44ms   82.60%
    Req/Sec   167.31     42.80   242.00     60.98%
  80076 requests in 30.07s, 2.25GB read
Requests/sec:   2663.33
Transfer/sec:     76.50MB

# zuul, 30000 char responses
$ wrk -t16 -c200 -d30s "http://localhost:8083/proxy/demo?delay=50&length=large"
Running 30s test @ http://localhost:8083/proxy/demo?delay=50&length=large
  16 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   505.51ms  415.11ms   2.00s    70.26%
    Req/Sec    24.78     13.43    90.00     52.88%
  11310 requests in 30.09s, 325.30MB read
  Socket errors: connect 0, read 0, write 0, timeout 205
Requests/sec:    375.88
Transfer/sec:     10.81MB

# direct to app, 50000 char responses
$ wrk -t16 -c200 -d30s "http://localhost:8081/demo?delay=50&length=xlarge"     
Running 30s test @ http://localhost:8081/demo?delay=50&length=xlarge
  16 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    64.59ms   16.27ms 172.98ms   82.39%
    Req/Sec   185.65     42.94   265.00     53.17%
  89115 requests in 30.10s, 4.16GB read
Requests/sec:   2960.95
Transfer/sec:    141.52MB

# gateway, 50000 char responses
~% wrk -t16 -c200 -d30s "http://localhost:8082/proxy/demo?delay=50&length=xlarge"
Running 30s test @ http://localhost:8082/proxy/demo?delay=50&length=xlarge
  16 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    72.50ms   21.83ms 236.20ms   84.15%
    Req/Sec   166.38     43.79   242.00     62.42%
  79536 requests in 30.09s, 3.71GB read
Requests/sec:   2643.68
Transfer/sec:    126.36MB

# zuul, 50000 char responses
$ wrk -t16 -c200 -d30s "http://localhost:8083/proxy/demo?delay=50&length=xlarge"
Running 30s test @ http://localhost:8083/proxy/demo?delay=50&length=xlarge
  16 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   504.15ms  408.22ms   2.00s    70.64%
    Req/Sec    24.71     13.14    90.00     54.27%
  11434 requests in 30.10s, 547.20MB read
  Socket errors: connect 0, read 0, write 0, timeout 191
Requests/sec:    379.89
Transfer/sec:     18.18MB

@dalegaspi
Copy link

testing localhost aside...

i can't comment on the cloud gateway but there is something seriously wrong with your the zuul1 setup. 80% drop? ok...

@spencergibb
Copy link
Member

There was no setup, a default zuul app from start.spring.io using tomcat with a single route.

@dalegaspi
Copy link

look, i'm not here to knock spring-cloud-gateway. i have no sound opinion of it because i've never used it. i'm sure it's great.

i find it suspicious, however, when someone tries to knock zuul 1 and i came to this thread to point out that this is not my experience at all. i'm compelled to write my findings here so people considering spring cloud 2.x + zuul wouldn't just rule it out completely upon reading this thread.

again, zuul 1 is from Netflix and have used it in production; i highly doubt they would have even promoted it let alone open source it if there's 80% drop in throughput when using. there are a number of online blog posts and articles pitting zuul 1 against nginx and it holds its own, if not perform better in some scenarios. that's all i'm saying.

@spencergibb
Copy link
Member

I don't disagree. We went with zuul 1 for that reason. The vanilla experience is NOT optimized.

@VinodKandula
Copy link

@spencergibb @dalegaspi @maoyunfei We had similar performance issues with Spring Cloud Gateway(Finchley.SR1), please find comparison metrics below.

				   Spring Boot 1.5.4 + Zuul Gateway	Spring Boot 2.0.4 + Spring Cloud Gateway(Finchley.SR1)
		
Throughput(Req/Sec)				    460				152
Average Response Time(ms)			    107				323

Test Server Configuration: M4.xLarge AWS Instance — 4 Core CPU, 16GB of Memory

It's very confused state if spring cloud gateway can be used in prod, please provide your comments.

@spencergibb
Copy link
Member

@VinodKandula can you share more than just metrics? What do the individual apps look like, how were they configured, how did you test them?

@VinodKandula
Copy link

@spencergibb
The overall system looks like the following

  1. Config Server
  2. Discovery Server (Eureka)
  3. Zuul/Spring Gateway Server
  4. Spring Data (JPA) Rest Repositories(CRUD) Service

All rest endpoints are fired via Zuul/Gateway which uses the discovery server to get the list of available instances.
JMeter is used for Performance tests. It is very straight forward to see the performance results between Zuul vs Sping Cloud Gateway.

@kimmking
Copy link

@spencergibb hi, spencer, can you retry your wrk test command with option --latency ? I find 99percent latency is 2-3 times than the case of direct access with -c200.

@atverma91
Copy link

spring cloud gateway and zuul 1 both performance is very low ......
how we can increase spring cloud gateway performance

@dalegaspi
Copy link

dalegaspi commented Jan 30, 2019

we are using Spring Boot 2.x and Zuul 1.x and we found the performance really good. however, the default settings are just severely under-optimized. after considerable experimentation and research, i came up with these settings.

our use case has the following:

  • 3 custom pre-filters: one retrieves JWT session from Redis and validate, one adds get parameters and headers, and another asynchronously sends messages via kafka (the 3rd filter is only applied to about 50% of traffic)
  • we are using sleuth (zipkin) that's configured to record 10% of traffic
  • the app is running in docker containers (ECS) fronted by an ALB
  • our (response) payloads are about ~30K on average

with the optimized settings and with those 3 filters and sleuth enabled running in ECS with ALB, there is less than 10% drop in throughput on several load tests (only 1 instance of app running in ECS on load tests)...honestly, this shouldn't really come as a surprise since Zuul 1.x is battle-tested and performs really well if configured correctly.

@atverma91
Copy link

Thanks Dalegaspi....as u mentioned u performed load test with zuu1.x and spring boot 2 so can u explain
up to how much TPS u performed load test and 10% request dropped among how many request

@dalegaspi
Copy link

@atverma91 actually the recent tests shows there is no perceptible drop in throughput; if we turn on compression in Zuul we even perform better. This is the results in our latest test with compression enabled in Zuul.

type ave latency throughput ave bytes error
direct 610.11 ms 608.6/sec 28637.5 0.01%
zuul 1.x + spring boot 2.x 461.09 ms 746.2/sec 4225.5 0.03%

this is with JMeter, 50 threads and 15 minutes continuous run.

it's key that when you perform load tests that the JVM is warmed up and that the client is not on the same box as service; clients are not going to be running on the same box as your service it baffles me why some load tests insist on having the service and the benchmarking app on one box.

@spencergibb
Copy link
Member

This isn't the place to discuss zuul performance without gateway. Please take the conversation offline.

@andrewfinnell

This comment has been minimized.

@spring-cloud spring-cloud locked as off-topic and limited conversation to collaborators Aug 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests