Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get high throughput #60

Closed
djannot opened this issue Feb 28, 2016 · 10 comments
Closed

Can't get high throughput #60

djannot opened this issue Feb 28, 2016 · 10 comments

Comments

@djannot
Copy link
Contributor

djannot commented Feb 28, 2016

I have a very simple program:

package main

import (
    "flag"
    "log"
"time"
  "github.com/valyala/fasthttp"
  "github.com/valyala/fasthttp/reuseport"
  "github.com/davecheney/profile"
)

var (
    addr = flag.String("addr", ":10000", "TCP address to listen to")
  c = &fasthttp.HostClient{
    Addr: "192.168.1.1:80",
    ReadTimeout: 30 * time.Second,
    WriteTimeout: 30 * time.Second,
    ReadBufferSize: 64 * 1024,
    WriteBufferSize: 64 * 1024,
  }
)

func main() {
  defer profile.Start(profile.CPUProfile).Stop()
    flag.Parse()


listener, err := reuseport.Listen("tcp4", *addr)
if err != nil {
  panic(err)
}
defer listener.Close()

    if err := fasthttp.Serve(listener, requestHandler); err != nil {
        log.Fatalf("Error in ListenAndServe: %s", err)
    }
}

func requestHandler(ctx *fasthttp.RequestCtx) {
  err := c.Do(&ctx.Request, &ctx.Response)
  if err != nil {
    log.Printf("Error: %s", err)
  }
  ctx.Response.Header.DisableNormalizing()
  etag := string(ctx.Response.Header.Peek("Etag"))
  ctx.Response.Header.Del("Etag")
  ctx.Response.Header.Set("ETag", etag)
}

I can't get more than 100 MB/s, but if I run the same benchmark using 192.168.1.1:80 directly, I get more than twice this throughput.

Here is the profile output:

Entering interactive mode (type "help" for commands)
(pprof) top10
68.64s of 71.57s total (95.91%)
Dropped 207 nodes (cum <= 0.36s)
Showing top 10 nodes out of 54 (cum >= 54.31s)
      flat  flat%   sum%        cum   cum%
    38.15s 53.30% 53.30%     38.36s 53.60%  syscall.Syscall
    28.83s 40.28% 93.59%     28.83s 40.28%  runtime.memclr
     0.85s  1.19% 94.77%      0.85s  1.19%  runtime.memmove
     0.37s  0.52% 95.29%      0.37s  0.52%  runtime.futex
     0.16s  0.22% 95.51%     25.67s 35.87%  net.(*netFD).Read
     0.07s 0.098% 95.61%     25.78s 36.02%  bufio.(*Reader).Read
     0.06s 0.084% 95.70%     25.73s 35.95%  net.(*conn).Read
     0.06s 0.084% 95.78%      0.78s  1.09%  runtime.(*mspan).sweep
     0.05s  0.07% 95.85%      0.44s  0.61%  runtime.findrunnable
     0.04s 0.056% 95.91%     54.31s 75.88%  github.com/valyala/fasthttp.appendBodyFixedSize
@valyala
Copy link
Owner

valyala commented Feb 29, 2016

Below are few recommendations regarding the code:

  • Use gofmt for formatting go source code.
  • Import github.com/pkg/profile instead of github.com/davecheney/profile - see http://dave.cheney.net/2014/10/22/simple-profiling-package-moved-updated for the reasoning.
  • There are no any benefits in using reuseport listener for single-process setup.
  • Prefer memory profile over CPU profile and analyze it with go tool pprof --alloc_objects.
  • The CPU profile provided by you has significant discrepancy comparing to my profile. Make sure you passed correct executable to go tool. Below is my CPU profile after removing the []byte->string conversion for etag (see below for details):
(pprof) top
10600ms of 17580ms total (60.30%)
Dropped 203 nodes (cum <= 87.90ms)
Showing top 10 nodes out of 116 (cum >= 270ms)
      flat  flat%   sum%        cum   cum%
    8160ms 46.42% 46.42%     8560ms 48.69%  syscall.Syscall
     410ms  2.33% 48.75%      660ms  3.75%  github.com/valyala/fasthttp.(*ResponseHeader).parseHeaders
     380ms  2.16% 50.91%      380ms  2.16%  runtime.epollwait
     290ms  1.65% 52.56%    16120ms 91.70%  github.com/valyala/fasthttp.(*Server).serveConn
     270ms  1.54% 54.10%      270ms  1.54%  runtime.memmove
     230ms  1.31% 55.40%      230ms  1.31%  runtime/internal/atomic.Cas
     220ms  1.25% 56.66%     2880ms 16.38%  net.(*netFD).Read
     220ms  1.25% 57.91%      280ms  1.59%  runtime.deferreturn
     210ms  1.19% 59.10%      210ms  1.19%  runtime.indexbytebody
     210ms  1.19% 60.30%      270ms  1.54%  runtime.netpollblock

This profile shows that more than 46% of all the time is spent in system calls. peek command shows that two syscalls were used - read and write:

(pprof) peek syscall.Syscall
15.46s of 17.58s total (87.94%)
Dropped 203 nodes (cum <= 0.09s)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                             6.50s 77.20% |   syscall.write
                                             1.92s 22.80% |   syscall.read
     8.16s 46.42% 46.42%      8.56s 48.69%                | syscall.Syscall
                                             0.31s 77.50% |   runtime.entersyscall
                                             0.09s 22.50% |   runtime.exitsyscall
----------------------------------------------------------+-------------

It looks like there are no significant bottlenecks in the code. It could be optimized further by pipelining buffered requests to the server in order to minimize the number of read and write syscalls. Currently fasthttp client doesn't provide requests' pipelining, though it is in the TODO. So the only option at the moment is to implement it yourself on top of Request and Response objects.

  • 40% CPU time in runtime.memclr in your CPU profile may indicate that you proxy quite big responses. Currently fasthttp client isn't optimized for big responses, since it reads the whole response body in memory before passing it to the caller. The better solution is to stream big responses directly to the client.
  • The following code may lead to unnecessary memory allocation and copy during []byte -> string conversion:
  etag := string(ctx.Response.Header.Peek("Etag"))
  ctx.Response.Header.Del("Etag")
  ctx.Response.Header.Set("ETag", etag)

So it would be better to rewrite it in zero-alloc fashion:

        h.SetBytesV("ETag", h.Peek("Etag"))
        h.Del("Etag")
  • Make sure you send requests to the proxy from a dedicated set of machines. If you run load tests on the same machine where the proxy is located, your results will be skewed, since load tests may eat significant share of CPU time.
  • Make sure you have enough network bandwidth for the proxy. It would be better to have two distinct physical network interfaces on the proxy machine - the first one is for incoming requests to the proxy and the second one is for outgoing requests to the server. If you have only a single network interface on the proxy, results may be skewed, since proxy usually doubles load on the network, so the network may become a bottleneck.
  • Proxy isn't free. It always eats CPU and network resources.

The final code I profiled above:

package main

import (
        "flag"
        "github.com/pkg/profile"
        "github.com/valyala/fasthttp"
        "log"
        "time"
)

var (
        addr = flag.String("addr", ":10000", "TCP address to listen to")
        c    = &fasthttp.HostClient{
                Addr:            "127.0.0.1:80",
                ReadTimeout:     30 * time.Second,
                WriteTimeout:    30 * time.Second,
                ReadBufferSize:  64 * 1024,
                WriteBufferSize: 64 * 1024,
        }
)

func main() {
        flag.Parse()
        defer profile.Start(profile.CPUProfile).Stop()

        s := &fasthttp.Server{
                Handler: requestHandler,
                DisableHeaderNamesNormalizing: true,
        }
        if err := s.ListenAndServe(*addr); err != nil {
                log.Fatalf("Error in ListenAndServe: %s", err)
        }
}

func requestHandler(ctx *fasthttp.RequestCtx) {
        err := c.Do(&ctx.Request, &ctx.Response)
        if err != nil {
                log.Printf("Error: %s", err)
        }
        h := &ctx.Response.Header
        h.SetBytesV("ETag", h.Peek("Etag"))
        h.Del("Etag")
}

@valyala
Copy link
Owner

valyala commented Feb 29, 2016

@djannot , FYI, I fixed the problem in fasthttp, which could reduce its' throughput when working with big bodies in request and/or response.

@valyala valyala added the bug label Feb 29, 2016
@valyala
Copy link
Owner

valyala commented Feb 29, 2016

Try verifying proxy throughput now

@djannot
Copy link
Contributor Author

djannot commented Feb 29, 2016

Thanks. I'll check it and let you know

valyala added a commit that referenced this issue Mar 1, 2016
…out performance when dealing with big request and/or response bodies
@djannot
Copy link
Contributor Author

djannot commented Mar 1, 2016

I've checked and only get a slight improvement.
I'm trying to build a reverse proxy and I'll have to handle requests with both small and large body.
Do you plan to implement pipelining soon ?

@valyala
Copy link
Owner

valyala commented Mar 1, 2016

Do you plan to implement pipelining soon ?

I have no near-term plans regarding requests' pipelining. Actually I tried implementing it in our internal project. But results weren't very good, because of the following problems:

  • Certain servers don't support pipelined requests.
  • Pipelined requests usually have higher response times because of head of line blocking. So they must be used with caution if response latency is in priority.

@valyala
Copy link
Owner

valyala commented Mar 2, 2016

@djannot , I'd recommend starting with nginx or haproxy and measuring their throughput in proxy mode for your case. Since both apps are highly optimized at the lowest level possible, it is unlikely fasthttp will beat them without requests' pipelining. Moreover, haproxy may skip requests' and responses' parsing and just proxy http connections to upstream server. The results collected from these apps will show the maximum throughput possible in your setup. Then compare these results to fasthttp.

While haproxy and nginx usually outperform fasthttp in proxy mode, fasthttp allows implementing arbitrary custom logic in Go. This is much easier comparing to customizing low-level C inside event loops and state machines present in nginx and haproxy.

@valyala
Copy link
Owner

valyala commented Mar 15, 2016

Closing this issue. Feel free opening new one if throughput problems related to fasthttp occur again.

@valyala valyala closed this as completed Mar 15, 2016
@valyala
Copy link
Owner

valyala commented Apr 15, 2016

@djannot , just FYI, fasthttp now supports pipelined requests with PipelineClient.

@djannot
Copy link
Contributor Author

djannot commented Apr 17, 2016

@valyala Awesome. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants