Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock if error while downsampling #4960

Closed
GiedriusS opened this issue Dec 16, 2021 · 1 comment · Fixed by #4962
Closed

Deadlock if error while downsampling #4960

GiedriusS opened this issue Dec 16, 2021 · 1 comment · Fixed by #4962

Comments

@GiedriusS
Copy link
Member

GiedriusS commented Dec 16, 2021

Object Storage Provider:

S3

What happened:

Error occurred while downsampling but Thanos Compactor got stuck

What you expected to happen:

For Thanos Compactor to get into a "halt" state

How to reproduce it (as minimally and precisely as possible):

Seems like it should be enough to have multiple blocks "to be downsampled" and for an error to occur while downsampling one of them.

Full logs to relevant components:

Goroutine dump:

Dump

goroutine profile: total 24
4 @ 0x4380f6 0x4309fe 0x463049 0x4d5532 0x4d689a 0x4d6888 0x56c9c9 0x57e605 0x76fd2e 0x5519c3 0x551b1d 0x770aec 0x468981
#	0x463048	internal/poll.runtime_pollWait+0x88		/usr/lib/go-1.17/src/runtime/netpoll.go:229
#	0x4d5531	internal/poll.(*pollDesc).wait+0x31		/usr/lib/go-1.17/src/internal/poll/fd_poll_runtime.go:84
#	0x4d6899	internal/poll.(*pollDesc).waitRead+0x259	/usr/lib/go-1.17/src/internal/poll/fd_poll_runtime.go:89
#	0x4d6887	internal/poll.(*FD).Read+0x247			/usr/lib/go-1.17/src/internal/poll/fd_unix.go:167
#	0x56c9c8	net.(*netFD).Read+0x28				/usr/lib/go-1.17/src/net/fd_posix.go:56
#	0x57e604	net.(*conn).Read+0x44				/usr/lib/go-1.17/src/net/net.go:183
#	0x76fd2d	net/http.(*persistConn).Read+0x4d		/usr/lib/go-1.17/src/net/http/transport.go:1926
#	0x5519c2	bufio.(*Reader).fill+0x102			/usr/lib/go-1.17/src/bufio/bufio.go:101
#	0x551b1c	bufio.(*Reader).Peek+0x5c			/usr/lib/go-1.17/src/bufio/bufio.go:139
#	0x770aeb	net/http.(*persistConn).readLoop+0x1ab		/usr/lib/go-1.17/src/net/http/transport.go:2087

4 @ 0x4380f6 0x447ed2 0x7727bb 0x468981
#	0x7727ba	net/http.(*persistConn).writeLoop+0xfa	/usr/lib/go-1.17/src/net/http/transport.go:2386

2 @ 0x4380f6 0x4309fe 0x463049 0x4d5532 0x4d689a 0x4d6888 0x56c9c9 0x57e605 0x750e4d 0x5519c3 0x55258f 0x5527e7 0x6df1b9 0x74c359 0x74c35a 0x752205 0x756545 0x468981
#	0x463048	internal/poll.runtime_pollWait+0x88		/usr/lib/go-1.17/src/runtime/netpoll.go:229
#	0x4d5531	internal/poll.(*pollDesc).wait+0x31		/usr/lib/go-1.17/src/internal/poll/fd_poll_runtime.go:84
#	0x4d6899	internal/poll.(*pollDesc).waitRead+0x259	/usr/lib/go-1.17/src/internal/poll/fd_poll_runtime.go:89
#	0x4d6887	internal/poll.(*FD).Read+0x247			/usr/lib/go-1.17/src/internal/poll/fd_unix.go:167
#	0x56c9c8	net.(*netFD).Read+0x28				/usr/lib/go-1.17/src/net/fd_posix.go:56
#	0x57e604	net.(*conn).Read+0x44				/usr/lib/go-1.17/src/net/net.go:183
#	0x750e4c	net/http.(*connReader).Read+0x16c		/usr/lib/go-1.17/src/net/http/server.go:780
#	0x5519c2	bufio.(*Reader).fill+0x102			/usr/lib/go-1.17/src/bufio/bufio.go:101
#	0x55258e	bufio.(*Reader).ReadSlice+0x2e			/usr/lib/go-1.17/src/bufio/bufio.go:360
#	0x5527e6	bufio.(*Reader).ReadLine+0x26			/usr/lib/go-1.17/src/bufio/bufio.go:389
#	0x6df1b8	net/textproto.(*Reader).readLineSlice+0x98	/usr/lib/go-1.17/src/net/textproto/reader.go:57
#	0x74c358	net/textproto.(*Reader).ReadLine+0x78		/usr/lib/go-1.17/src/net/textproto/reader.go:38
#	0x74c359	net/http.readRequest+0x79			/usr/lib/go-1.17/src/net/http/request.go:1029
#	0x752204	net/http.(*conn).readRequest+0x224		/usr/lib/go-1.17/src/net/http/server.go:966
#	0x756544	net/http.(*conn).serve+0x864			/usr/lib/go-1.17/src/net/http/server.go:1855

1 @ 0x40b8f4 0x464f18 0x5fb759 0x468981
#	0x464f17	os/signal.signal_recv+0x97	/usr/lib/go-1.17/src/runtime/sigqueue.go:169
#	0x5fb758	os/signal.loop+0x18		/usr/lib/go-1.17/src/os/signal/signal_unix.go:24

1 @ 0x4380f6 0x40640c 0x405e38 0x1694cb3 0x5fbc2f 0x468981
#	0x1694cb2	main.main.func2+0x32				/home/giedrius/dev/thanos/cmd/thanos/main.go:115
#	0x5fbc2e	github.com/oklog/run.(*Group).Run.func1+0x2e	/home/giedrius/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38

1 @ 0x4380f6 0x40640c 0x405e38 0x5fb99c 0x16946ba 0x437d27 0x468981
#	0x5fb99b	github.com/oklog/run.(*Group).Run+0x7b	/home/giedrius/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:43
#	0x16946b9	main.main+0x15b9			/home/giedrius/dev/thanos/cmd/thanos/main.go:155
#	0x437d26	runtime.main+0x226			/usr/lib/go-1.17/src/runtime/proc.go:255

1 @ 0x4380f6 0x40640c 0x405e78 0xe2cd25 0x468981
#	0xe2cd24	github.com/baidubce/bce-sdk-go/util/log.NewLogger.func1+0x64	/home/giedrius/go/pkg/mod/github.com/baidubce/bce-sdk-go@v0.9.81/util/log/logger.go:362

1 @ 0x4380f6 0x4309fe 0x463049 0x4d5532 0x4d888c 0x4d8879 0x56e175 0x587e28 0x586ffd 0x75b1f4 0xbc78df 0xbc7679 0xcb1fe5 0x168a3b5 0x5fbc2f 0x468981
#	0x463048	internal/poll.runtime_pollWait+0x88						/usr/lib/go-1.17/src/runtime/netpoll.go:229
#	0x4d5531	internal/poll.(*pollDesc).wait+0x31						/usr/lib/go-1.17/src/internal/poll/fd_poll_runtime.go:84
#	0x4d888b	internal/poll.(*pollDesc).waitRead+0x22b					/usr/lib/go-1.17/src/internal/poll/fd_poll_runtime.go:89
#	0x4d8878	internal/poll.(*FD).Accept+0x218						/usr/lib/go-1.17/src/internal/poll/fd_unix.go:402
#	0x56e174	net.(*netFD).accept+0x34							/usr/lib/go-1.17/src/net/fd_unix.go:173
#	0x587e27	net.(*TCPListener).accept+0x27							/usr/lib/go-1.17/src/net/tcpsock_posix.go:140
#	0x586ffc	net.(*TCPListener).Accept+0x3c							/usr/lib/go-1.17/src/net/tcpsock.go:262
#	0x75b1f3	net/http.(*Server).Serve+0x393							/usr/lib/go-1.17/src/net/http/server.go:3001
#	0xbc78de	github.com/prometheus/exporter-toolkit/web.Serve+0x17e				/home/giedrius/go/pkg/mod/github.com/prometheus/exporter-toolkit@v0.6.1/web/tls_config.go:192
#	0xbc7678	github.com/prometheus/exporter-toolkit/web.ListenAndServe+0xf8			/home/giedrius/go/pkg/mod/github.com/prometheus/exporter-toolkit@v0.6.1/web/tls_config.go:184
#	0xcb1fe4	github.com/thanos-io/thanos/pkg/server/http.(*Server).ListenAndServe+0x1a4	/home/giedrius/dev/thanos/pkg/server/http/http.go:68
#	0x168a3b4	main.runCompact.func1+0x34							/home/giedrius/dev/thanos/cmd/thanos/compact.go:190
#	0x5fbc2e	github.com/oklog/run.(*Group).Run.func1+0x2e					/home/giedrius/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38

1 @ 0x4380f6 0x447ed2 0x1690fbb 0x8d4387 0x468981
#	0x1690fba	main.downsampleBucket.func4+0x25a			/home/giedrius/dev/thanos/cmd/thanos/downsample.go:303
#	0x8d4386	golang.org/x/sync/errgroup.(*Group).Go.func1+0x66	/home/giedrius/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57

1 @ 0x4380f6 0x447ed2 0x169506e 0x1694ac5 0x5fbc2f 0x468981
#	0x169506d	main.interrupt+0x10d				/home/giedrius/dev/thanos/cmd/thanos/main.go:166
#	0x1694ac4	main.main.func4+0x24				/home/giedrius/dev/thanos/cmd/thanos/main.go:139
#	0x5fbc2e	github.com/oklog/run.(*Group).Run.func1+0x2e	/home/giedrius/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38

1 @ 0x4380f6 0x447ed2 0x1695365 0x1694a49 0x5fbc2f 0x468981
#	0x1695364	main.reload+0x104				/home/giedrius/dev/thanos/cmd/thanos/main.go:179
#	0x1694a48	main.main.func6+0x28				/home/giedrius/dev/thanos/cmd/thanos/main.go:149
#	0x5fbc2e	github.com/oklog/run.(*Group).Run.func1+0x2e	/home/giedrius/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38

1 @ 0x4380f6 0x447ed2 0xcc1405 0x1688805 0x5fbc2f 0x468981
#	0xcc1404	github.com/thanos-io/thanos/pkg/runutil.Repeat+0xe4	/home/giedrius/dev/thanos/pkg/runutil/runutil.go:78
#	0x1688804	main.runCompact.func14+0x124				/home/giedrius/dev/thanos/cmd/thanos/compact.go:545
#	0x5fbc2e	github.com/oklog/run.(*Group).Run.func1+0x2e		/home/giedrius/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38

1 @ 0x4380f6 0x447ed2 0xcc1405 0x1688a70 0x5fbc2f 0x468981
#	0xcc1404	github.com/thanos-io/thanos/pkg/runutil.Repeat+0xe4	/home/giedrius/dev/thanos/pkg/runutil/runutil.go:78
#	0x1688a6f	main.runCompact.func12+0x4f				/home/giedrius/dev/thanos/cmd/thanos/compact.go:533
#	0x5fbc2e	github.com/oklog/run.(*Group).Run.func1+0x2e		/home/giedrius/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38

1 @ 0x4380f6 0x447ed2 0xebbef9 0x468981
#	0xebbef8	go.opencensus.io/stats/view.(*worker).start+0xb8	/home/giedrius/go/pkg/mod/go.opencensus.io@v0.23.0/stats/view/worker.go:276

1 @ 0x4380f6 0x448fcc 0x448fa6 0x464745 0x474151 0x8d4207 0x1690b18 0x16899a8 0x1688ef3 0xcc13b0 0x1688ded 0x5fbc2f 0x468981
#	0x464744	sync.runtime_Semacquire+0x24				/usr/lib/go-1.17/src/runtime/sema.go:56
#	0x474150	sync.(*WaitGroup).Wait+0x70				/usr/lib/go-1.17/src/sync/waitgroup.go:130
#	0x8d4206	golang.org/x/sync/errgroup.(*Group).Wait+0x26		/home/giedrius/go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:40
#	0x1690b17	main.downsampleBucket+0xc37				/home/giedrius/dev/thanos/cmd/thanos/downsample.go:312
#	0x16899a7	main.runCompact.func7+0x707				/home/giedrius/dev/thanos/cmd/thanos/compact.go:441
#	0x1688ef2	main.runCompact.func8.1+0x52				/home/giedrius/dev/thanos/cmd/thanos/compact.go:470
#	0xcc13af	github.com/thanos-io/thanos/pkg/runutil.Repeat+0x8f	/home/giedrius/dev/thanos/pkg/runutil/runutil.go:75
#	0x1688dec	main.runCompact.func8+0x1cc				/home/giedrius/dev/thanos/cmd/thanos/compact.go:469
#	0x5fbc2e	github.com/oklog/run.(*Group).Run.func1+0x2e		/home/giedrius/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38

1 @ 0x462b65 0xb9bef5 0xb9bd0d 0xb98e8b 0xba7bfa 0xba87ae 0x75770f 0x759009 0x75ac7b 0x7567e8 0x468981
#	0x462b64	runtime/pprof.runtime_goroutineProfileWithLabels+0x24	/usr/lib/go-1.17/src/runtime/mprof.go:746
#	0xb9bef4	runtime/pprof.writeRuntimeProfile+0xb4			/usr/lib/go-1.17/src/runtime/pprof/pprof.go:724
#	0xb9bd0c	runtime/pprof.writeGoroutine+0x4c			/usr/lib/go-1.17/src/runtime/pprof/pprof.go:684
#	0xb98e8a	runtime/pprof.(*Profile).WriteTo+0x14a			/usr/lib/go-1.17/src/runtime/pprof/pprof.go:331
#	0xba7bf9	net/http/pprof.handler.ServeHTTP+0x499			/usr/lib/go-1.17/src/net/http/pprof/pprof.go:253
#	0xba87ad	net/http/pprof.Index+0x12d				/usr/lib/go-1.17/src/net/http/pprof/pprof.go:371
#	0x75770e	net/http.HandlerFunc.ServeHTTP+0x2e			/usr/lib/go-1.17/src/net/http/server.go:2046
#	0x759008	net/http.(*ServeMux).ServeHTTP+0x148			/usr/lib/go-1.17/src/net/http/server.go:2424
#	0x75ac7a	net/http.serverHandler.ServeHTTP+0x43a			/usr/lib/go-1.17/src/net/http/server.go:2878
#	0x7567e7	net/http.(*conn).serve+0xb07				/usr/lib/go-1.17/src/net/http/server.go:1929

1 @ 0x468981

In metrics you can see this:

thanos_compact_downsample_failures_total{group="0@18435695797974204449"} 1

Anything else we need to know:

Reproduced on 0.23.1.

@GiedriusS
Copy link
Member Author

If error happens here:

				if err := processDownsampling(ctx, logger, bkt, m, dir, resolution, hashFunc, metrics); err != nil {
					metrics.downsampleFailures.WithLabelValues(compact.DefaultGroupKey(m.Thanos)).Inc()
					return errors.Wrap(err, errMsg)
				}

Then:

			select {
			case <-ctx.Done():
				return ctx.Err()
---->			case ch <- m:
			}

Will never execute. ctx will never be done because we pass context.Background.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant