New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FreeBSD >= 12.2, < 13.0: Go runtime deadlocks and/or panics on multicore systems #411
Comments
FYI, I still run into hangs on my RELENG_12 server. However, I can
drastically reduce the instances by pining the process to just 2 CPUs
via cpuset -l0,1 -p <pid>
…On 1/8/2021 6:02 PM, Christian Schwarz wrote:
Happened 2 weeks after upgrading my main server to FreeBSD 12.2.
Made a goroutine dump of the receiving side.
Most interesting stack:
|1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xa6b4f6 0xa6f205
0xa71f3e 0x46e681 # 0x46a6e4 sync.runtime_Semacquire+0x44
/usr/local/go/src/runtime/sema.go:56 # 0x4796c4
sync.(*WaitGroup).Wait+0x64 /usr/local/go/src/sync/waitgroup.go:130 #
0xa6b4f5
github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).ReadStreamedMessage+0x3b5
/home/circleci/project/rpc/dataconn/stream/stream_conn.go:109 #
0xa6f204 github.com/zrepl/zrepl/rpc/dataconn.(*Client).recv+0x204
/home/circleci/project/rpc/dataconn/dataconn_client.go:87 # 0xa71f3d
github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv.func1+0x7d
/home/circleci/project/rpc/dataconn/dataconn_client.go:162 |
But:
* no |zfs| process on the receiving side
* other stacks show that the connection was probably still active
(heartbeat goroutine existed)
* sadly didn't capture |sockstat| output
Maybe I'm hitting the same bug that @mdtancsa
<https://github.com/mdtancsa> had/has on CURRENT.
Let's see whether it happens again at some point.
This machine has hyperthreading disabled but does the (experimental)
parallel replication.
------------------------------------------------------------------------
|goroutine profile: total 27 1 @ 0x40c2d4 0x46aefd 0xa9ae25 0x46e681 #
0x46aefc os/signal.signal_recv+0x9c
/usr/local/go/src/runtime/sigqueue.go:147 # 0xa9ae24
os/signal.loop+0x24 /usr/local/go/src/os/signal/signal_unix.go:23 1 @
0x4397a5 0x4065af 0x4061eb 0xa56178 0xa939ff 0xa93366 0xae22c2
0x46e681 # 0xa56177
github.com/zrepl/zrepl/replication/driver.Do.func2+0x77
/home/circleci/project/replication/driver/replication_driver.go:280 #
0xa939fe github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do+0x37e
/home/circleci/project/daemon/job/active.go:467 # 0xa93365
github.com/zrepl/zrepl/daemon/job.(*ActiveSide).Run+0x425
/home/circleci/project/daemon/job/active.go:424 # 0xae22c1
github.com/zrepl/zrepl/daemon.(*jobs).start.func1+0x141
/home/circleci/project/daemon/daemon.go:248 1 @ 0x4397a5 0x4065af
0x4061eb 0xa57888 0x46e681 # 0xa57887
github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func1+0x47
/home/circleci/project/replication/driver/replication_stepqueue.go:85
1 @ 0x4397a5 0x4065af 0x4061eb 0xa7ccd8 0x46e681 # 0xa7ccd7
github.com/zrepl/zrepl/rpc.NewClient.func1.1+0x37
/home/circleci/project/rpc/rpc_client.go:60 1 @ 0x4397a5 0x4065af
0x4061eb 0xae1d14 0x46e681 # 0xae1d13
github.com/zrepl/zrepl/daemon.Run.func1+0x33
/home/circleci/project/daemon/daemon.go:37 1 @ 0x4397a5 0x43257b
0x468cf5 0x4d71a5 0x4d81e5 0x4d81c3 0x53e80f 0x55214e 0x70b2a2
0x4f6f51 0x70b4f3 0x708315 0x70e5df 0x70e5ea 0x56a922 0x4d3627
0x8fb1c9 0x8fb17a 0x8fba45 0x918f52 0x46e681 # 0x468cf4
internal/poll.runtime_pollWait+0x54
/usr/local/go/src/runtime/netpoll.go:220 # 0x4d71a4
internal/poll.(*pollDesc).wait+0x44
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 # 0x4d81e4
internal/poll.(*pollDesc).waitRead+0x1a4
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92 # 0x4d81c2
internal/poll.(*FD).Read+0x182
/usr/local/go/src/internal/poll/fd_unix.go:159 # 0x53e80e
net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_posix.go:55 # 0x55214d
net.(*conn).Read+0x8d /usr/local/go/src/net/net.go:182 # 0x70b2a1
crypto/tls.(*atLeastReader).Read+0x61
/usr/local/go/src/crypto/tls/conn.go:779 # 0x4f6f50
bytes.(*Buffer).ReadFrom+0xb0 /usr/local/go/src/bytes/buffer.go:204 #
0x70b4f2 crypto/tls.(*Conn).readFromUntil+0xf2
/usr/local/go/src/crypto/tls/conn.go:801 # 0x708314
crypto/tls.(*Conn).readRecordOrCCS+0x114
/usr/local/go/src/crypto/tls/conn.go:608 # 0x70e5de
crypto/tls.(*Conn).readRecord+0x15e
/usr/local/go/src/crypto/tls/conn.go:576 # 0x70e5e9
crypto/tls.(*Conn).Read+0x169
/usr/local/go/src/crypto/tls/conn.go:1252 # 0x56a921
bufio.(*Reader).Read+0x221 /usr/local/go/src/bufio/bufio.go:227 #
0x4d3626 io.ReadAtLeast+0x86 /usr/local/go/src/io/io.go:314 # 0x8fb1c8
io.ReadFull+0x88 /usr/local/go/src/io/io.go:333 # 0x8fb179
golang.org/x/net/http2.readFrameHeader+0x39
***@***.***/http2/frame.go:237
# 0x8fba44 golang.org/x/net/http2.(*Framer).ReadFrame+0xa4
***@***.***/http2/frame.go:492
# 0x918f51
google.golang.org/grpc/internal/transport.(*http2Client).reader+0x171
***@***.***/internal/transport/http2_client.go:1218
1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d81e5 0x4d81c3 0x53e80f
0x55214e 0x7c7638 0x46e681 # 0x468cf4
internal/poll.runtime_pollWait+0x54
/usr/local/go/src/runtime/netpoll.go:220 # 0x4d71a4
internal/poll.(*pollDesc).wait+0x44
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 # 0x4d81e4
internal/poll.(*pollDesc).waitRead+0x1a4
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92 # 0x4d81c2
internal/poll.(*FD).Read+0x182
/usr/local/go/src/internal/poll/fd_unix.go:159 # 0x53e80e
net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_posix.go:55 # 0x55214d
net.(*conn).Read+0x8d /usr/local/go/src/net/net.go:182 # 0x7c7637
net/http.(*connReader).backgroundRead+0x57
/usr/local/go/src/net/http/server.go:690 1 @ 0x4397a5 0x43257b
0x468cf5 0x4d71a5 0x4d9d9c 0x4d9d7e 0x53fd85 0x55c132 0x55af65
0x7d2466 0xae23cc 0xae2362 0x46e681 # 0x468cf4
internal/poll.runtime_pollWait+0x54
/usr/local/go/src/runtime/netpoll.go:220 # 0x4d71a4
internal/poll.(*pollDesc).wait+0x44
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 # 0x4d9d9b
internal/poll.(*pollDesc).waitRead+0x1fb
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92 # 0x4d9d7d
internal/poll.(*FD).Accept+0x1dd
/usr/local/go/src/internal/poll/fd_unix.go:394 # 0x53fd84
net.(*netFD).accept+0x44 /usr/local/go/src/net/fd_unix.go:172 #
0x55c131 net.(*TCPListener).accept+0x31
/usr/local/go/src/net/tcpsock_posix.go:139 # 0x55af64
net.(*TCPListener).Accept+0x64 /usr/local/go/src/net/tcpsock.go:261 #
0x7d2465 net/http.(*Server).Serve+0x265
/usr/local/go/src/net/http/server.go:2937 # 0xae23cb
net/http.Serve+0x8b /usr/local/go/src/net/http/server.go:2498 #
0xae2361
github.com/zrepl/zrepl/daemon.(*pprofServer).controlLoop.func1+0x21
/home/circleci/project/daemon/pprof.go:74 1 @ 0x4397a5 0x43257b
0x468cf5 0x4d71a5 0x4d9d9c 0x4d9d7e 0x53fd85 0x562572 0x560805
0x7d2466 0xae19c5 0x46e681 # 0x468cf4
internal/poll.runtime_pollWait+0x54
/usr/local/go/src/runtime/netpoll.go:220 # 0x4d71a4
internal/poll.(*pollDesc).wait+0x44
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 # 0x4d9d9b
internal/poll.(*pollDesc).waitRead+0x1fb
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92 # 0x4d9d7d
internal/poll.(*FD).Accept+0x1dd
/usr/local/go/src/internal/poll/fd_unix.go:394 # 0x53fd84
net.(*netFD).accept+0x44 /usr/local/go/src/net/fd_unix.go:172 #
0x562571 net.(*UnixListener).accept+0x31
/usr/local/go/src/net/unixsock_posix.go:162 # 0x560804
net.(*UnixListener).Accept+0x64 /usr/local/go/src/net/unixsock.go:260
# 0x7d2465 net/http.(*Server).Serve+0x265
/usr/local/go/src/net/http/server.go:2937 # 0xae19c4
github.com/zrepl/zrepl/daemon.(*controlJob).Run.func5+0x44
/home/circleci/project/daemon/control.go:168 1 @ 0x4397a5 0x44982f
0x838e07 0x46e681 # 0x838e06
github.com/zrepl/zrepl/daemon/logging/trace.init.2.func1+0x3c6
/home/circleci/project/daemon/logging/trace/trace_chrometrace.go:146 1
@ 0x4397a5 0x44982f 0x90971b 0x909f53 0x92999b 0x46e681 # 0x90971a
google.golang.org/grpc/internal/transport.(*controlBuffer).get+0x11a
***@***.***/internal/transport/controlbuf.go:317
# 0x909f52
google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x1d2
***@***.***/internal/transport/controlbuf.go:435
# 0x92999a
google.golang.org/grpc/internal/transport.newHTTP2Client.func3+0x7a
***@***.***/internal/transport/http2_client.go:328
1 @ 0x4397a5 0x44982f 0x919885 0x46e681 # 0x919884
google.golang.org/grpc/internal/transport.(*http2Client).keepalive+0x2e4
***@***.***/internal/transport/http2_client.go:1299
1 @ 0x4397a5 0x44982f 0x95e319 0x46e681 # 0x95e318
google.golang.org/grpc.(*ccBalancerWrapper).watcher+0x118
***@***.***/balancer_conn_wrappers.go:122
1 @ 0x4397a5 0x44982f 0x9619e5 0xa7ce8d 0x46e681 # 0x9619e4
google.golang.org/grpc.(*ClientConn).WaitForStateChange+0x104
***@***.***/clientconn.go:417 #
0xa7ce8c github.com/zrepl/zrepl/rpc.NewClient.func1+0x18c
/home/circleci/project/rpc/rpc_client.go:67 1 @ 0x4397a5 0x44982f
0xa49ee9 0xa46f8d 0xa90d3f 0xa90cfe 0x46e681 # 0xa49ee8
github.com/zrepl/zrepl/daemon/snapper.wait+0x1c8
/home/circleci/project/daemon/snapper/snapper.go:376 # 0xa46f8c
github.com/zrepl/zrepl/daemon/snapper.(*Snapper).Run+0x24c
/home/circleci/project/daemon/snapper/snapper.go:158 # 0xa90d3e
github.com/zrepl/zrepl/daemon/snapper.(*PeriodicOrManual).Run+0x5e
/home/circleci/project/daemon/snapper/snapper_all.go:23 # 0xa90cfd
github.com/zrepl/zrepl/daemon/job.(*modePush).RunPeriodic+0x1d
/home/circleci/project/daemon/job/active.go:132 1 @ 0x4397a5 0x44982f
0xa57c3b 0x46e681 # 0xa57c3a
github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func2+0x15a
/home/circleci/project/replication/driver/replication_stepqueue.go:92
1 @ 0x4397a5 0x44982f 0xa6864e 0x46e681 # 0xa6864d
github.com/zrepl/zrepl/rpc/dataconn/heartbeatconn.(*Conn).sendHeartbeats+0x1ad
/home/circleci/project/rpc/dataconn/heartbeatconn/heartbeatconn.go:84
1 @ 0x4397a5 0x44982f 0xa6fdfd 0xa7ae11 0xa5e0c3 0xa5a85f 0xa57554
0x82683d 0xa521ce 0xa567bc 0x46e681 # 0xa6fdfc
github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv+0x35c
/home/circleci/project/rpc/dataconn/dataconn_client.go:183 # 0xa7ae10
github.com/zrepl/zrepl/rpc.(*Client).Receive+0xd0
/home/circleci/project/rpc/rpc_client.go:108 # 0xa5e0c2
github.com/zrepl/zrepl/replication/logic.(*Step).doReplication+0x462
/home/circleci/project/replication/logic/replication_logic.go:620 #
0xa5a85e github.com/zrepl/zrepl/replication/logic.(*Step).Step+0x3e
/home/circleci/project/replication/logic/replication_logic.go:187 #
0xa57553
github.com/zrepl/zrepl/replication/driver.(*fs).do.func5+0x1f3
/home/circleci/project/replication/driver/replication_driver.go:641 #
0x82683c github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c
/home/circleci/project/util/chainlock/chainlock.go:41 # 0xa521cd
github.com/zrepl/zrepl/replication/driver.(*fs).do+0x94d
/home/circleci/project/replication/driver/replication_driver.go:634 #
0xa567bb
github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems.func1+0x13b
/home/circleci/project/replication/driver/replication_driver.go:433 1
@ 0x4397a5 0x44982f 0xa98809 0x46e681 # 0xa98808
github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do.func1+0x108
/home/circleci/project/daemon/job/active.go:438 1 @ 0x4397a5 0x44982f
0xadd465 0xae22c2 0x46e681 # 0xadd464
github.com/zrepl/zrepl/daemon.(*controlJob).Run+0x744
/home/circleci/project/daemon/control.go:172 # 0xae22c1
github.com/zrepl/zrepl/daemon.(*jobs).start.func1+0x141
/home/circleci/project/daemon/daemon.go:248 1 @ 0x4397a5 0x44982f
0xadf032 0xae1429 0x874fa2 0x5ce502 0x5cf11e 0x8754cd 0x8754be
0xb23bc5 0x4393a9 0x46e681 # 0xadf031
github.com/zrepl/zrepl/daemon.Run+0xb11
/home/circleci/project/daemon/daemon.go:111 # 0xae1428
github.com/zrepl/zrepl/daemon.glob..func1+0x48
/home/circleci/project/daemon/main.go:16 # 0x874fa1
github.com/zrepl/zrepl/cli.(*Subcommand).run+0xe1
/home/circleci/project/cli/cli.go:105 # 0x5ce501
github.com/spf13/cobra.(*Command).execute+0x2c1
***@***.***/command.go:760 # 0x5cf11d
github.com/spf13/cobra.(*Command).ExecuteC+0x2fd
***@***.***/command.go:846 # 0x8754cc
github.com/spf13/cobra.(*Command).Execute+0x2c
***@***.***/command.go:794 # 0x8754bd
github.com/zrepl/zrepl/cli.Run+0x1d
/home/circleci/project/cli/cli.go:152 # 0xb23bc4 main.main+0x24
/home/circleci/project/main.go:24 # 0x4393a8 runtime.main+0x208
/usr/local/go/src/runtime/proc.go:204 1 @ 0x4397a5 0x44982f 0xae0788
0x46e681 # 0xae0787
github.com/zrepl/zrepl/daemon.(*pprofServer).controlLoop+0xe7
/home/circleci/project/daemon/pprof.go:45 1 @ 0x4397a5 0x44a1e5
0x44a1ce 0x46a6e5 0x4796c5 0xa5684a 0x82683d 0xa513f2 0xa4f7ba
0xa55108 0x82683d 0xa5564f 0x46e681 # 0x46a6e4
sync.runtime_Semacquire+0x44 /usr/local/go/src/runtime/sema.go:56 #
0x4796c4 sync.(*WaitGroup).Wait+0x64
/usr/local/go/src/sync/waitgroup.go:130 # 0xa56849
github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems.func2+0x29
/home/circleci/project/replication/driver/replication_driver.go:437 #
0x82683c github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c
/home/circleci/project/util/chainlock/chainlock.go:41 # 0xa513f1
github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems+0x311
/home/circleci/project/replication/driver/replication_driver.go:436 #
0xa4f7b9 github.com/zrepl/zrepl/replication/driver.(*attempt).do+0x79
/home/circleci/project/replication/driver/replication_driver.go:301 #
0xa55107 github.com/zrepl/zrepl/replication/driver.Do.func1.1+0x47
/home/circleci/project/replication/driver/replication_driver.go:219 #
0x82683c github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c
/home/circleci/project/util/chainlock/chainlock.go:41 # 0xa5564e
github.com/zrepl/zrepl/replication/driver.Do.func1+0x3ce
/home/circleci/project/replication/driver/replication_driver.go:218 1
@ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xa6b4f6 0xa6f205
0xa71f3e 0x46e681 # 0x46a6e4 sync.runtime_Semacquire+0x44
/usr/local/go/src/runtime/sema.go:56 # 0x4796c4
sync.(*WaitGroup).Wait+0x64 /usr/local/go/src/sync/waitgroup.go:130 #
0xa6b4f5
github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).ReadStreamedMessage+0x3b5
/home/circleci/project/rpc/dataconn/stream/stream_conn.go:109 #
0xa6f204 github.com/zrepl/zrepl/rpc/dataconn.(*Client).recv+0x204
/home/circleci/project/rpc/dataconn/dataconn_client.go:87 # 0xa71f3d
github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv.func1+0x7d
/home/circleci/project/rpc/dataconn/dataconn_client.go:162 1 @
0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xae206d 0x46e681 #
0x46a6e4 sync.runtime_Semacquire+0x44
/usr/local/go/src/runtime/sema.go:56 # 0x4796c4
sync.(*WaitGroup).Wait+0x64 /usr/local/go/src/sync/waitgroup.go:130 #
0xae206c github.com/zrepl/zrepl/daemon.(*jobs).wait.func1+0x2c
/home/circleci/project/daemon/daemon.go:144 1 @ 0x4397a5 0x46a998
0x46a96e 0x475edd 0xa57de8 0x46e681 # 0x46a96d
sync.runtime_notifyListWait+0xcd /usr/local/go/src/runtime/sema.go:513
# 0x475edc sync.(*Cond).Wait+0x9c /usr/local/go/src/sync/cond.go:56 #
0xa57de7
github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func3+0x67
/home/circleci/project/replication/driver/replication_stepqueue.go:121
1 @ 0x4688dd 0xabc582 0xabc345 0xab8f32 0xac6925 0xac8205 0x7ceaa4
0x7d09cd 0x7d20a3 0x7cd8ad 0x46e681 # 0x4688dc
runtime/pprof.runtime_goroutineProfileWithLabels+0x5c
/usr/local/go/src/runtime/mprof.go:716 # 0xabc581
runtime/pprof.writeRuntimeProfile+0xe1
/usr/local/go/src/runtime/pprof/pprof.go:724 # 0xabc344
runtime/pprof.writeGoroutine+0xa4
/usr/local/go/src/runtime/pprof/pprof.go:684 # 0xab8f31
runtime/pprof.(*Profile).WriteTo+0x3f1
/usr/local/go/src/runtime/pprof/pprof.go:331 # 0xac6924
net/http/pprof.handler.ServeHTTP+0x384
/usr/local/go/src/net/http/pprof/pprof.go:256 # 0xac8204
net/http/pprof.Index+0x944
/usr/local/go/src/net/http/pprof/pprof.go:367 # 0x7ceaa3
net/http.HandlerFunc.ServeHTTP+0x43
/usr/local/go/src/net/http/server.go:2042 # 0x7d09cc
net/http.(*ServeMux).ServeHTTP+0x1ac
/usr/local/go/src/net/http/server.go:2417 # 0x7d20a2
net/http.serverHandler.ServeHTTP+0xa2
/usr/local/go/src/net/http/server.go:2843 # 0x7cd8ac
net/http.(*conn).serve+0x8ac /usr/local/go/src/net/http/server.go:1925 |
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#411>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGMT4QI4V4BTBSL66TN2N6LSY6FGZANCNFSM4V3CI6BQ>.
|
sink
sidepush
side
Happened again:
=> only two connections on receiver !?
And no zfs processes on the receiver. Goroutine dump:
Looking into this later today |
@mdtancsa next time it hangs up, could you run |
Will do. It doesnt happen too much on the sender side. The recv side is
pretty easy to trigger if I let the process use all available CPUs
---Mike
…On 1/10/2021 10:20 AM, Christian Schwarz wrote:
@mdtancsa <https://github.com/mdtancsa> next time it hangs up, could
you run |ps aux | grep defunct| on the sender before restarting?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#411 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGMT4QLLUSVFLIYCIYLLF3DSZHAVDANCNFSM4V3CI6BQ>.
|
Another (maybe interesting) observation: the 'CLOSED' sockets on the sender still have some bytes in their Recv-Q. Not sure whether that's normal.
|
Investigation with tcpdump shows that there was periodic heartbeat connectivity on the wire. The far more ... interesting aspect that I hadn't noticed initially was that there were two zrepl processes running on the sender. Apparently, something got messed up with my internal deployment. |
Interesting, another lockup. Going to do one more restart and see if it locks up again. |
Thats the same behaviour I see. I need to do a kill -9 of the process as zrepl status fails to attach |
Yes, needed to do the same. |
push
sidepush
side
Another lockup much like the first one:
Stacktrace
|
@mdtancsa is that equivalent to 12 + what becomes the next 12.X or is it whatever the latest released 12.X is? |
@problame yes, RELENG_12 is a point in time snapshot of the src for what eventually becomes future 12.x releases. 12.1R, 12.2R, 12.3R etc. IIRC, I started to see the hangs when I upgraded my server from 11.x to 12. It is also referred to as 12 STABLE, and is also considered "stable" in that its generally bug fixes and minor enhancements to the code that will not break the branch's ABI. Generally, the commits are very conservative and are definitely not experimental. Development is always done in the branch number above, also referred to as HEAD. So right now FreeBSD 13 (also HEAD or '.') is the development branch, and any bug fixes discovered there get "MFC'd" or "Merged From Current". |
Interesting, I was always confused between RELENG_X and X-STABLE. Assuming it's a FreeBSD bug, it will have wandered through STABLE into RELENG into 12.2. I solicited help on Twitter yesterday. |
On 1/13/2021 3:32 AM, Christian Schwarz wrote:
Interesting, I was always confused between RELENG_X and X-STABLE.
Assuming it's a FreeBSD bug, it will have wandered through STABLE into
RELENG into 12.2.
And your upgrade from 11 to 12 STABLE might have already included the
regression.
I solicited help on Twitter yesterday.
If you can find the from- and to-revisions of the update that caused
you problems, please post them here (sorry if I asked for that months
ago, I cannot find the issue comment).
I originally posted in the STABLE mailling list, but I didnt get any
traction. At the time, I upgraded the hardware and memory of the server,
so its possible those changes exposed an existing bug that was masked by
memory pressure and slower cores. Hard to tell. But RELENG_12 from
before June 2020 *seemed* to be OK on slower hardware.
https://lists.freebsd.org/pipermail/freebsd-stable/2020-July/092477.html
At the time, I scanned through the commit logs and nothing obvious
jumped outI dont think
https://lists.freebsd.org/pipermail/svn-src-stable-12/2020-June/thread.html
---Mike
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#411 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGMT4QITCLGVPVN7LE5OI3LSZVLBXANCNFSM4V3CI6BQ>.
|
Thanks for following up, maybe this information is useful for a FreeBSD dev to do a bisect. |
FWIW I used
@mdtancsa could you post the output of these commands as well? |
On 1/18/2021 5:43 AM, Christian Schwarz wrote:
||
||
@mdtancsa <https://github.com/mdtancsa> could you post the output of
these commands as well?
sysctl hw.model
hw.model: Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz
grep cpu_microcode /boot/loader.conf
cpu_microcode_load="YES"
cpu_microcode_name="/boot/firmware/intel-ucode.bin"
pkg info devcpu-data
devcpu-data-1.37
Name : devcpu-data
Version : 1.37
Installed on : Tue Dec 1 15:14:59 2020 EST
Origin : sysutils/devcpu-data
Architecture : FreeBSD:12:*
Prefix : /usr/local
Categories : sysutils
Licenses : EULA
Maintainer : sbruno@FreeBSD.org
WWW : UNKNOWN
Comment : Intel and AMD CPUs microcode updates
Annotations :
Flat size : 6.92MiB
Description :
This port supplies microcode updates for use with cpuctl(4) microcode
update facility. These could be used to keep your processor's firmware
up-to-date.
However, the microcode in the CPU does not get updated as its newer than
any patches provided
CPU: Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz (3408.25-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x906ea Family=0x6 Model=0x9e Stepping=10
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x7ffafbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
AMD Features2=0x121<LAHF,ABM,Prefetch>
Structured Extended
Features=0x29c6fbb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE>
Structured Extended
Features3=0x9c002600<MCUOPT,MD_CLEAR,TSXFA,IBPB,STIBP,L1DFL,SSBD>
XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
TSC: P-state invariant, performance statistics
real memory = 68719476736 (65536 MB)
avail memory = 66750193664 (63657 MB)
CPU microcode: no matching update found
This is a newer, faster board than the one I had back in June. It was a
slower Xeon as well. We upgraded the server in early December so I
dont think its a microcode issue, although its possible.
---Mike
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#411 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGMT4QPSEYZU5UC2OXDFJEDS2QGFHANCNFSM4V3CI6BQ>.
|
I'm having the same (or a similar) issue on Linux. That is, both clients are running Debian Buster (sender amd64 and receiver arm64) and the sender becomes stuck during replication. What happens:
I've captured the following goroutine stacks on the sender via delve: (dlv) goroutines -t
(dlv) goroutines -s
If you believe this is a separate issue, I'll open a new one. |
So the lockup in the sender goes away when the sender wakes up because the connection times out? |
Also:
|
It never times out unless I stop zrepl on the receiver, when I do stop it, the lockup goes away immediately. That makes me think the connection closure is the cause for the wakeup.
I seem to be able to reproduce it quite often when there is a backlog of snapshots (don't know the exact value, but let's say 50+ / dataset, 23 datasets). Just now I finished working through a backlog of 800+ snapshots/dataset (backlog was due to a lockup), during that time I restarted zrepl (due to new lockups) probably around 10 times to work through it all.
It can work fine for weeks on end or lock up almost immediately, the lockups in normal operation seem a bit sporadic, but they do seem to happen often when the receiver (machine) has been restarted, but occasionally without restarts.
I'll try to reproduce and get back to you on this, if my memory serves from inspecting it at one point, most were listed as established, some possibly waiting. |
@mafredri thanks for the detailed response. Since I have seen similar stack traces on FreeBSD, we might be in luck in the sense that it's not an issue in FreeBSD or Linux. I noticed yesterday evening that my upgrade to FreeBSD 12.2 might have coincided with deployment of a zrepl build built by Go 1.15. Previous builds were Go 1.14. I'll deploy a 1.14-based build today and see whether there are any new lockups. |
Good lead, I'll be sure to give the new build a try. I also managed to reproduce this again to get the
|
So before deploying the 1.14-based build, I wanted to see if I could provoke the lockup on FreeBSD for the 1.15-based build again by expanding the CPU mask to all available CPUs. It took two days, but it locked up today.
I think I remember that when I tried to reproduce @mdtancsa 's problems on FreeBSD, the stack trace that got me thinking it's a Go runtime / FreeBSD issue was exactly one of these. So there is most likely a Go runtime / FreeBSD bug, because being blocked forever on a mutex somewhere in Go's malloc can't be zrepl's fault. The other question is whether the Linux lockups are due to the same bug. I don't think they are. But let's keep them in this thread for now until we are certain what causes the FreeBSD bug. |
Not yet. But probably in a month or so. I do have one 13R box running
and its fine. However, it was fine too running RELENG_12 :) Its just my
one busy backup server that has this issue. There was a new golang
version that came out a few days ago. I recompiled the same source code,
but still crashes with that about the same frequency.
---Mike
…On 4/17/2021 3:56 PM, Christian Schwarz wrote:
@mdtancsa <https://github.com/mdtancsa> have you upgraded to FreeBSD
13.0 yet? I'd be interested in whether the bug is still present there.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#411 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGMT4QI6LE66TP425AUPTV3TJHRYLANCNFSM4V3CI6BQ>.
|
push
side
I just setup zrepl last night with a very simple snapshot and sink to a second local pool and found my "zrepl status" this morning was complaining it couldn't connect to the daemon. Looking around, I found zrepl had crashed:
I know nothing of "go", so not sure what else to provide here. This is FreeBSD 12.2-p6. |
TBH, I think using CPUset to limit the amount of CPUs might cause other problems with clients giving up if you have a lot of clients. ( I have 444 datasets from 20 different servers). The way I use it on the server side is to run the program in a shell script that logs the failure of the daemon and just restarts it. About ~ 5% the daemon deadlocks and I kill and restart it when its in that condition. I check its status by connecting to the prometheus port to see if its responding. If its not, I kill it and restart the daemon |
FWIW, recently I "found" a regression in stable/12 when hunting for a memory corruption that seemed to be related to copy-on-write after a fork. I put found in quotes, because the regression was introduced and fixed a long time ago, but while the regression was merged to stable/12, the fix was not. |
Interesting, is the proper fix in RELENG_12 now ? At the time of the bug, we did migrate from RELENG_11 to RELENG_12, but it was hard to tell as the bug started to manifest well after the migration and we also started to add more clients etc etc. If there is a patch to test, I would be happy to give it a spin. |
The problem isn't fixed yet, but hopefully will be very soon: https://reviews.freebsd.org/D33413 |
OK, thanks! Is it in RELENG_13 as well ? We were slotted to update the box soon. But it would be nice to try this fix first to see if this is indeed the cause of the problems. |
The problem was fixed in CURRENT months before 13 was branched off and 13.0 was released. |
@avg-I is this review the same thing that @emaste is talking about in this thread on the Go issue tracker? golang/go#46272 (comment) |
@problame no - @avg-I's review is an issue that was fixed before 13.0 released: freebsd/freebsd-src@3fd989da. The fix is now brought back to the stable/12 branch: freebsd/freebsd-src@1820ca2 and will be in 12.4. I expect we will have an errata update (in early January, after the holidays) for 12.3. If the change fixes this #411 issue it's FreeBSD >=12.2 <13.0 that's affected. The issue I mention in golang/go#46272 was just fixed in FreeBSD main freebsd/freebsd-src@73b357b and hasn't yet been cherry-picked anywhere else. It should make it to the FreeBSD stable branches in the next day or two, and be included in the same errata update as the above. |
I have been running the commit for the past 5 days and no deadlocks or crashes. Over the weekend, the server in question runs quite a busy load and normally would result in at least 3-4 crashes a day. I would say this fixed the issue for me at this point. |
Thanks for checking @mdtancsa. |
Since I upgraded to 13 a few weeks ago, I have run zrepl without errors and without a trimmed down Given @mdtancsa 's positive experience on 12.3, I'll close this issue. |
Errata updates with the fix are now available via
|
Happened 2 weeks after upgrading my main server to FreeBSD 12.2.
Made a goroutine dump of the receiving side.
Most interesting stack:
But:
zfs
process on the receiving sidesockstat
outputMaybe I'm hitting the same bug that @mdtancsa had/has on CURRENT.
Let's see whether it happens again at some point.
This machine has hyperthreading disabled but does the (experimental) parallel replication.
The text was updated successfully, but these errors were encountered: