Can't get high throughput #60

djannot · 2016-02-28T13:40:42Z

I have a very simple program:

package main

import (
    "flag"
    "log"
"time"
  "github.com/valyala/fasthttp"
  "github.com/valyala/fasthttp/reuseport"
  "github.com/davecheney/profile"
)

var (
    addr = flag.String("addr", ":10000", "TCP address to listen to")
  c = &fasthttp.HostClient{
    Addr: "192.168.1.1:80",
    ReadTimeout: 30 * time.Second,
    WriteTimeout: 30 * time.Second,
    ReadBufferSize: 64 * 1024,
    WriteBufferSize: 64 * 1024,
  }
)

func main() {
  defer profile.Start(profile.CPUProfile).Stop()
    flag.Parse()


listener, err := reuseport.Listen("tcp4", *addr)
if err != nil {
  panic(err)
}
defer listener.Close()

    if err := fasthttp.Serve(listener, requestHandler); err != nil {
        log.Fatalf("Error in ListenAndServe: %s", err)
    }
}

func requestHandler(ctx *fasthttp.RequestCtx) {
  err := c.Do(&ctx.Request, &ctx.Response)
  if err != nil {
    log.Printf("Error: %s", err)
  }
  ctx.Response.Header.DisableNormalizing()
  etag := string(ctx.Response.Header.Peek("Etag"))
  ctx.Response.Header.Del("Etag")
  ctx.Response.Header.Set("ETag", etag)
}

I can't get more than 100 MB/s, but if I run the same benchmark using 192.168.1.1:80 directly, I get more than twice this throughput.

Here is the profile output:

Entering interactive mode (type "help" for commands)
(pprof) top10
68.64s of 71.57s total (95.91%)
Dropped 207 nodes (cum <= 0.36s)
Showing top 10 nodes out of 54 (cum >= 54.31s)
      flat  flat%   sum%        cum   cum%
    38.15s 53.30% 53.30%     38.36s 53.60%  syscall.Syscall
    28.83s 40.28% 93.59%     28.83s 40.28%  runtime.memclr
     0.85s  1.19% 94.77%      0.85s  1.19%  runtime.memmove
     0.37s  0.52% 95.29%      0.37s  0.52%  runtime.futex
     0.16s  0.22% 95.51%     25.67s 35.87%  net.(*netFD).Read
     0.07s 0.098% 95.61%     25.78s 36.02%  bufio.(*Reader).Read
     0.06s 0.084% 95.70%     25.73s 35.95%  net.(*conn).Read
     0.06s 0.084% 95.78%      0.78s  1.09%  runtime.(*mspan).sweep
     0.05s  0.07% 95.85%      0.44s  0.61%  runtime.findrunnable
     0.04s 0.056% 95.91%     54.31s 75.88%  github.com/valyala/fasthttp.appendBodyFixedSize

The text was updated successfully, but these errors were encountered:

valyala · 2016-02-29T13:47:30Z

Below are few recommendations regarding the code:

Use gofmt for formatting go source code.
Import github.com/pkg/profile instead of github.com/davecheney/profile - see http://dave.cheney.net/2014/10/22/simple-profiling-package-moved-updated for the reasoning.
There are no any benefits in using reuseport listener for single-process setup.
Prefer memory profile over CPU profile and analyze it with go tool pprof --alloc_objects.
The CPU profile provided by you has significant discrepancy comparing to my profile. Make sure you passed correct executable to go tool. Below is my CPU profile after removing the []byte->string conversion for etag (see below for details):

(pprof) top
10600ms of 17580ms total (60.30%)
Dropped 203 nodes (cum <= 87.90ms)
Showing top 10 nodes out of 116 (cum >= 270ms)
      flat  flat%   sum%        cum   cum%
    8160ms 46.42% 46.42%     8560ms 48.69%  syscall.Syscall
     410ms  2.33% 48.75%      660ms  3.75%  github.com/valyala/fasthttp.(*ResponseHeader).parseHeaders
     380ms  2.16% 50.91%      380ms  2.16%  runtime.epollwait
     290ms  1.65% 52.56%    16120ms 91.70%  github.com/valyala/fasthttp.(*Server).serveConn
     270ms  1.54% 54.10%      270ms  1.54%  runtime.memmove
     230ms  1.31% 55.40%      230ms  1.31%  runtime/internal/atomic.Cas
     220ms  1.25% 56.66%     2880ms 16.38%  net.(*netFD).Read
     220ms  1.25% 57.91%      280ms  1.59%  runtime.deferreturn
     210ms  1.19% 59.10%      210ms  1.19%  runtime.indexbytebody
     210ms  1.19% 60.30%      270ms  1.54%  runtime.netpollblock

This profile shows that more than 46% of all the time is spent in system calls. peek command shows that two syscalls were used - read and write:

(pprof) peek syscall.Syscall
15.46s of 17.58s total (87.94%)
Dropped 203 nodes (cum <= 0.09s)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                             6.50s 77.20% |   syscall.write
                                             1.92s 22.80% |   syscall.read
     8.16s 46.42% 46.42%      8.56s 48.69%                | syscall.Syscall
                                             0.31s 77.50% |   runtime.entersyscall
                                             0.09s 22.50% |   runtime.exitsyscall
----------------------------------------------------------+-------------

It looks like there are no significant bottlenecks in the code. It could be optimized further by pipelining buffered requests to the server in order to minimize the number of read and write syscalls. Currently fasthttp client doesn't provide requests' pipelining, though it is in the TODO. So the only option at the moment is to implement it yourself on top of Request and Response objects.

40% CPU time in runtime.memclr in your CPU profile may indicate that you proxy quite big responses. Currently fasthttp client isn't optimized for big responses, since it reads the whole response body in memory before passing it to the caller. The better solution is to stream big responses directly to the client.
The following code may lead to unnecessary memory allocation and copy during []byte -> string conversion:

  etag := string(ctx.Response.Header.Peek("Etag"))
  ctx.Response.Header.Del("Etag")
  ctx.Response.Header.Set("ETag", etag)

So it would be better to rewrite it in zero-alloc fashion:

        h.SetBytesV("ETag", h.Peek("Etag"))
        h.Del("Etag")

Make sure you send requests to the proxy from a dedicated set of machines. If you run load tests on the same machine where the proxy is located, your results will be skewed, since load tests may eat significant share of CPU time.
Make sure you have enough network bandwidth for the proxy. It would be better to have two distinct physical network interfaces on the proxy machine - the first one is for incoming requests to the proxy and the second one is for outgoing requests to the server. If you have only a single network interface on the proxy, results may be skewed, since proxy usually doubles load on the network, so the network may become a bottleneck.
Proxy isn't free. It always eats CPU and network resources.

The final code I profiled above:

package main

import (
        "flag"
        "github.com/pkg/profile"
        "github.com/valyala/fasthttp"
        "log"
        "time"
)

var (
        addr = flag.String("addr", ":10000", "TCP address to listen to")
        c    = &fasthttp.HostClient{
                Addr:            "127.0.0.1:80",
                ReadTimeout:     30 * time.Second,
                WriteTimeout:    30 * time.Second,
                ReadBufferSize:  64 * 1024,
                WriteBufferSize: 64 * 1024,
        }
)

func main() {
        flag.Parse()
        defer profile.Start(profile.CPUProfile).Stop()

        s := &fasthttp.Server{
                Handler: requestHandler,
                DisableHeaderNamesNormalizing: true,
        }
        if err := s.ListenAndServe(*addr); err != nil {
                log.Fatalf("Error in ListenAndServe: %s", err)
        }
}

func requestHandler(ctx *fasthttp.RequestCtx) {
        err := c.Do(&ctx.Request, &ctx.Response)
        if err != nil {
                log.Printf("Error: %s", err)
        }
        h := &ctx.Response.Header
        h.SetBytesV("ETag", h.Peek("Etag"))
        h.Del("Etag")
}

…ig bodies

valyala · 2016-02-29T15:39:42Z

@djannot , FYI, I fixed the problem in fasthttp, which could reduce its' throughput when working with big bodies in request and/or response.

valyala · 2016-02-29T15:40:36Z

Try verifying proxy throughput now

…aller than 8Kb

djannot · 2016-02-29T17:43:41Z

Thanks. I'll check it and let you know

…out performance when dealing with big request and/or response bodies

djannot · 2016-03-01T14:01:18Z

I've checked and only get a slight improvement.
I'm trying to build a reverse proxy and I'll have to handle requests with both small and large body.
Do you plan to implement pipelining soon ?

valyala · 2016-03-01T19:05:42Z

Do you plan to implement pipelining soon ?

I have no near-term plans regarding requests' pipelining. Actually I tried implementing it in our internal project. But results weren't very good, because of the following problems:

Certain servers don't support pipelined requests.
Pipelined requests usually have higher response times because of head of line blocking. So they must be used with caution if response latency is in priority.

valyala · 2016-03-02T12:51:01Z

@djannot , I'd recommend starting with nginx or haproxy and measuring their throughput in proxy mode for your case. Since both apps are highly optimized at the lowest level possible, it is unlikely fasthttp will beat them without requests' pipelining. Moreover, haproxy may skip requests' and responses' parsing and just proxy http connections to upstream server. The results collected from these apps will show the maximum throughput possible in your setup. Then compare these results to fasthttp.

While haproxy and nginx usually outperform fasthttp in proxy mode, fasthttp allows implementing arbitrary custom logic in Go. This is much easier comparing to customizing low-level C inside event loops and state machines present in nginx and haproxy.

valyala · 2016-03-15T08:35:03Z

Closing this issue. Feel free opening new one if throughput problems related to fasthttp occur again.

valyala · 2016-04-15T15:50:29Z

@djannot , just FYI, fasthttp now supports pipelined requests with PipelineClient.

djannot · 2016-04-17T12:21:24Z

@valyala Awesome. Thanks

valyala added the question label Feb 29, 2016

valyala added a commit that referenced this issue Feb 29, 2016

Issue #60: added benchmarks for big response bodies

203ba58

valyala added a commit that referenced this issue Feb 29, 2016

Issue #60: increased client and server throughput when working with b…

fcfda9f

…ig bodies

valyala added the bug label Feb 29, 2016

valyala added a commit that referenced this issue Feb 29, 2016

Issue #60: throughput tuning: re-use body buffers if their size is sm…

8985a56

…aller than 8Kb

valyala added a commit that referenced this issue Feb 29, 2016

Issue #60: give up to find an optimal strategy for body buffers' re-use

70316d0

valyala added a commit that referenced this issue Mar 1, 2016

Issue #60: skip body copying in DoTimeout. This should improve DoTime…

3546c31

…out performance when dealing with big request and/or response bodies

valyala closed this as completed Mar 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get high throughput #60

Can't get high throughput #60

djannot commented Feb 28, 2016

valyala commented Feb 29, 2016

valyala commented Feb 29, 2016

valyala commented Feb 29, 2016

djannot commented Feb 29, 2016

djannot commented Mar 1, 2016

valyala commented Mar 1, 2016

valyala commented Mar 2, 2016

valyala commented Mar 15, 2016

valyala commented Apr 15, 2016

djannot commented Apr 17, 2016

Can't get high throughput #60

Can't get high throughput #60

Comments

djannot commented Feb 28, 2016

valyala commented Feb 29, 2016

valyala commented Feb 29, 2016

valyala commented Feb 29, 2016

djannot commented Feb 29, 2016

djannot commented Mar 1, 2016

valyala commented Mar 1, 2016

valyala commented Mar 2, 2016

valyala commented Mar 15, 2016

valyala commented Apr 15, 2016

djannot commented Apr 17, 2016