Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After 1.29 graceful shutdown is broken #1258

Closed
rupor-github opened this issue Mar 24, 2022 · 8 comments · Fixed by #1260
Closed

After 1.29 graceful shutdown is broken #1258

rupor-github opened this issue Mar 24, 2022 · 8 comments · Fixed by #1260

Comments

@rupor-github
Copy link

rupor-github commented Mar 24, 2022

We are observing panic reports during service shutdown:

panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
goroutine 30732118 [running]:
github.com/valyala/fasthttp.(*Server).serveConn(0xc00022c240, {0xd46f50, 0xc354f
         github.com/valyala/fasthttp@v1.33.0/server.go:2353 +0x2099
github.com/valyala/fasthttp.(*workerPool).workerFunc(0xc00027c000, 0xc24ef082a0)
         github.com/valyala/fasthttp@v1.33.0/workerpool.go:223 +0xa9
github.com/valyala/fasthttp.(*workerPool).getCh.func1()
         github.com/valyala/fasthttp@v1.33.0/workerpool.go:195 +0x38
created by github.com/valyala/fasthttp.(*workerPool).getCh
         github.com/valyala/fasthttp@v1.33.0/workerpool.go:194 +0x1b5

This seems to be introduced by #1081 - all versions before that seems to be fine. This cannot be intercepted using recover() because it happens in goroutine started by workpool:

		vch := wp.workerChanPool.Get()
		ch = vch.(*workerChan)
		go func() {
			wp.workerFunc(ch)
			wp.workerChanPool.Put(vch)
		}()

Reliably reproducible and quite annoying.

@erikdubbelboer
Copy link
Collaborator

Can you try updating to the latest version, v1.34.0 and see if it still happens?

Seems like it's this line causing the panic:

if err := c.SetWriteDeadline(time.Now().Add(writeTimeout)); err != nil {

But that's during the handling of a request, unless you are closing the connection in your handler I don't see how that would be possible. Can you share your handler and how you call Shutdown?

@rupor-github
Copy link
Author

I do not see enough difference in 1.34.0 to warrant any changes - but I will try. Unfortunately it will have to wait for several days as we are in a freeze period.

Here is log extract from a single box I used for experimenting over past several days:

Mar 16 15:22:52 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 16 15:22:52 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 17 15:40:34 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 21 13:39:17 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 21 13:39:17 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 21 13:55:44 panic: BUG: error in SetReadDeadline(30s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 21 14:51:08 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 21 14:51:08 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 21 19:17:03 panic: BUG: error in SetReadDeadline(30s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 19:49:23 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 19:49:23 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 21:01:30 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 21:05:07 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 21:44:58 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 21:44:58 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 21:48:14 panic: BUG: error in SetReadDeadline(30s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 22:22:37 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 22:23:25 panic: BUG: error in SetReadDeadline(30s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 23:14:56 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 23:15:47 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 23:34:08 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 23:34:08 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 23:54:57 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 22 23:57:13 panic: BUG: error in SetReadDeadline(1s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 11:10:19 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 11:10:20 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 11:17:38 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 11:33:33 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 11:39:41 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 11:39:41 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 12:39:40 panic: BUG: error in SetReadDeadline(30s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 12:43:32 panic: BUG: error in SetReadDeadline(30s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 13:13:08 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 13:18:34 panic: BUG: error in SetReadDeadline(30s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 14:01:49 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 14:11:37 panic: BUG: error in SetReadDeadline(30s): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 14:43:48 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 14:43:48 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 15:34:05 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 15:34:05 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 23 16:43:15 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection
Mar 24 17:03:48 panic: BUG: error in SetWriteDeadline(5ms): set tcp XXX.XXX.XXX.XXX:15100: use of closed network connection

Each time it happens diring service shutdown (restart).

Be assured that handler does not call the Shutdown() - after all this would break old versions of fastHttp too - and we have this code working in production under pretty heavy load for years.

Here is the sequence of events on service shutdown (simplified) with relevant (I hope) code fragments:

from utils:

type FastHTTPGraceful struct {
	Name     string
	Servers  []*fasthttp.Server
	Addrs    []string
	Notifier func(error)
	log      *zap.Logger
	done     int32
}

// Close implements io.Closer.
func (g *FastHTTPGraceful) Close() error {

	// Close() may be called from multiple go-routines several times - do it only once
	if atomic.AddInt32(&g.done, 1) == 1 {
		g.log.Info("Shutting down", zap.String("interface", g.Name))
		for i, srv := range g.Servers {
			// Note: at the moment fasthttp shutdown does not accept context and cannot be interrupted, so...
			end := make(chan struct{}, 1)
			go func() {
				// here we may spend really long time
				if err := srv.Shutdown(); err != nil {
					if g.Notifier != nil {
						g.Notifier(fmt.Errorf("%s (%s): %w", g.Name, g.Addrs[i], err))
					}
				} else {
					if g.Notifier != nil {
						g.Notifier(fmt.Errorf("%s %s is down", g.Name, g.Addrs[i]))
					}
				}
				end <- struct{}{}
			}()
			timer := time.NewTimer(time.Second * 2)
			defer timer.Stop()
			select {
			case <-end:
			case <-timer.C:
				return fmt.Errorf("%s %s fasthttp interface incomplete shutdown", g.Name, g.Addrs[i])
			}
		}
		return fmt.Errorf("%s fasthttp interface shutdown", g.Name)
	}
	return nil
}

from service itself

func startAPIInterfaces(_ *cli.Context, env *state.LocalEnv) (io.Closer, error) {
	g := util.NewFastHTTPGraceful("api", len(cps), env.Done, env.Log)
	for i, cp := range cps {
		h := apiHandler(env, timeout)
		g.Addrs[i] = cp.AddrString()
		g.Servers[i] = &fasthttp.Server{
			Name:               "Server",
			ReadTimeout:        cp.ReadTimeout,
			WriteTimeout:       cp.WriteTimeout,
			IdleTimeout:        cp.IdleTimeout,
			MaxConnsPerIP:      cp.MaxConnsPerIP,
			MaxRequestsPerConn: cp.MaxRequestsPerConn,
			Concurrency:        cp.Concurrency,
			Logger:             util.NewFastHTTPLog(env.Log),
			Handler: func(ctx *fasthttp.RequestCtx) {
				switch string(ctx.Path()) {
				case "/status":
					_, _ = ctx.WriteString(fmt.Sprintf("ready: %d\n", env.Health.Value()))
				case "/api/read":
					h(ctx)
				}
			},
		}
		go func(addr string, srv *fasthttp.Server) {
			l, err := cfg.NewListener("tcp4", addr)
			if err != nil {
				env.Log.Error("Failed to start API interface with custom listener", zap.Error(err))
				return
			}
			if err := srv.Serve(l); err != nil {
				env.Done(err)
			} else {
				env.Done(errors.New("fasthttp: Server closed"))
			}
		}(g.Addrs[i], g.Servers[i])
	}
        return g, nil
}

// Run is "service" command main execution body.
func Run(ctx *cli.Context) error {

        env := state.NewLocalEnv(cfg, al, log)

	var api io.Closer
	if api, err = startAPIInterfaces(ctx, env); err != nil {
		return cli.NewExitError(fmt.Errorf("%s: %w", errPrefix, err), errCode)
	}
	defer api.Close()

	go func() {
		c := make(chan os.Signal, 1)
		signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)
		// Initiating shutdown
		env.Done(fmt.Errorf("service Run() %s", <-c))
		env.Done(api.Close())
	}()

       // Waits for shutdown initiation indefinetly. 
       // After shutdown was initiated loops through Done() events and exits when no new events are available during timeout
       env.WaitForShutdown() 
}

@erikdubbelboer
Copy link
Collaborator

Which version of fasthttp are you using?

Just to be sure, your handler doesn't close the connection itself?

@rupor-github
Copy link
Author

1.33.0 - does not work as you could see from original message: github.com/valyala/fasthttp@v1.33.0/
1.15.1 - works

As I wrote - our handler does not close connections itself

@erikdubbelboer
Copy link
Collaborator

I have made a branch with a change that could potentially fix the issue. Can you try this branch? You can either go get github.com/valyala/fasthttp@4175878b13bc or add this to your go.mod:

replace (
    github.com/valyala/fasthttp => github.com/valyala/fasthttp v1.34.1-0.20220326112101-4175878b13bc
)

@rupor-github
Copy link
Author

rupor-github commented Mar 29, 2022

@erikdubbelboer Unfortunately proposed fix does not work:

Mar 29 14:51:46 panic: BUG: error in SetReadDeadline(30s): set tcp 10.13.71.29:15100: use of closed network connection
Mar 29 14:51:46 goroutine 1674013 [running]:
Mar 29 14:51:46 github.com/valyala/fasthttp.(*Server).serveConn(0xc000430240, {0xd48150, 0xc2f52d17c8})
Mar 29 14:51:46          github.com/valyala/fasthttp@v1.34.1-0.20220326112101-4175878b13bc/server.go:2185 +0x21ec
Mar 29 14:51:46 github.com/valyala/fasthttp.(*workerPool).workerFunc(0xc0003860a0, 0xc245526960)
Mar 29 14:51:46          github.com/valyala/fasthttp@v1.34.1-0.20220326112101-4175878b13bc/workerpool.go:224 +0xa9
Mar 29 14:51:46 github.com/valyala/fasthttp.(*workerPool).getCh.func1()
Mar 29 14:51:46          github.com/valyala/fasthttp@v1.34.1-0.20220326112101-4175878b13bc/workerpool.go:196 +0x38
Mar 29 14:51:46 created by github.com/valyala/fasthttp.(*workerPool).getCh
Mar 29 14:51:46          github.com/valyala/fasthttp@v1.34.1-0.20220326112101-4175878b13bc/workerpool.go:195 +0x1b5

@erikdubbelboer
Copy link
Collaborator

I replaced the panic with a break now. I see it's still possible in some cases for the code to reach the setReadDeadline after an idle connection is closed. I think just doing break to stop handling the connection is the best solution in that case.

Can you please try v1.34.1-0.20220329185012-6d94120e500f?

@rupor-github
Copy link
Author

Confirming - looks like shutdown with v1.34.1-0.20220329185012-6d94120e500f behaves identically to 1.15.1 we used before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants