query: Hanging calls to store API #5079

FUSAKLA · 2022-01-19T17:20:00Z

Hi, we bumped into issue with calls from query to store API hangs after the connection got interrupted.
(load balancer node handling the connection got shut down, probably not canceling the TCP connection gracefully)

The queries got stuck, and query gate got filled up with hanging queries that never got away from there until restart of the instance

I suspect the request to be hanging infinitely (or with a really long timeout) and blocking the query.

I's running in the official docker container in kubernetes and the system settings for keepalive are

# sysctl -a | grep tcp_keepalive
sysctl: reading key "kernel.unprivileged_userns_apparmor_policy"
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

which is 2h before first keepalive packet

But looking at the default net.Dialer the default should be 15s by default for gRPC client

Anyway, checking with tcpdump it looks like keepalive pings are happening

# tcpdump -i eth0 -nnn "dst yy.yy.y.yyy and tcp[tcpflags] == tcp-ack"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
18:12:09.229890 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 2254374078, win 170, options [nop,nop,TS val 3230079859 ecr 3089291048], length 0
18:12:09.229916 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 188, win 172, options [nop,nop,TS val 3230079859 ecr 3089291048], length 0
18:12:09.230650 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 205, win 172, options [nop,nop,TS val 3230079859 ecr 3089291049], length 0
18:12:14.230030 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 422, win 174, options [nop,nop,TS val 3230084859 ecr 3089296048], length 0
18:12:14.231268 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 439, win 174, options [nop,nop,TS val 3230084860 ecr 3089296050], length 0
18:12:19.229681 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 469, win 174, options [nop,nop,TS val 3230089858 ecr 3089301048], length 0
18:12:19.229782 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 656, win 176, options [nop,nop,TS val 3230089859 ecr 3089301048], length 0
18:12:19.230539 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 673, win 176, options [nop,nop,TS val 3230089859 ecr 3089301049], length 0

Combined with the context timeout set for RPC calls over the stream, I suspect the established connection should fail and trigger reconnect. This leads me to the part when the stream is created where no context timeout is specified. Might this be the issue?

thanos/pkg/store/proxy.go

Line 314 in 78d250b

sc, err := st.Series(seriesCtx, r)

I'd be happy to send PR just checking if I'm following the right track.

Thanos, Prometheus and Golang version used:
Thanos: 0.24.0
Prometheus: 2.32.1

The text was updated successfully, but these errors were encountered:

GiedriusS · 2022-01-20T11:33:02Z

It would be interesting to try this:

func TestProxyStore_SeriesSlowStores(t *testing.T) {
	enable := os.Getenv("THANOS_ENABLE_STORE_READ_TIMEOUT_TESTS")
	if enable == "" {
		t.Skip("enable THANOS_ENABLE_STORE_READ_TIMEOUT_TESTS to run store-read-timeout tests")
	}
	...

And see whether it passes. Plus, maybe we could add a test case there to see if it helps? 🤔

trevorriles · 2022-01-28T18:25:46Z

We are hitting the same issue, any updates or workarounds you've found?

FUSAKLA · 2022-02-03T14:18:21Z

Hi @trevorriles, unfortunately I had not enough time to dig much deeper in this.

Dump solution is to restart the Thanos query, and we have an alert to detect the exact state with longer(possibly use avg_over_time to avoid false positives on really overloaded query nodes)

thanos_query_concurrent_gate_queries_in_flight{app="thanos-query"}
  >=
thanos_query_concurrent_gate_queries_max{app="thanos-query"}

FUSAKLA · 2022-02-03T15:27:56Z

The tricky thing is that it depends heavily on the system setup of the kernel TCP stack

Just out of curiosity @trevorriles what are your keepalive TCP kernel settings?

In our case

$ sysctl -a | grep "net.ipv4.tcp_keepalive"
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

FUSAKLA · 2022-02-03T15:32:51Z

@GiedriusS That test passes for me,

I tried but with no luck of reproducing the issue
But I tested it locally, which may be system dependent...

I'll try to simulate it first, so I can narrow down what the exact problem is

func Test_ConnectionTimeout(t *testing.T) {

	addr := "localhost:33908"
	lis, err := net.Listen("tcp", addr)
	if err != nil {
		t.Fatalf("Error while listening. Err: %v", err)
	}
	go func() {
		conn, err := lis.Accept()
		if err != nil {
			t.Errorf("Error while accepting. Err: %v", err)
			return
		}
		framer := http2.NewFramer(conn, conn)
		frame, err := framer.ReadFrame()
		if err != nil {
			t.Fatal(err)
		}
		fmt.Println(frame.Header())
		//if err := framer.WriteSettings(http2.Setting{}); err != nil {
		//	t.Errorf("Error while writing settings. Err: %v", err)
		//	return
		//}
		//conn, err = lis.Accept()
		time.Sleep(time.Second * 30)
		conn.Close()
		lis.Close()
	}()

	logger := level.NewFilter(log.NewLogfmtLogger(os.Stderr), level.AllowDebug())
	bs, err := store.NewLocalStoreFromJSONMmappableFile(logger, component.Debug, []labels.Label{{Name: "foo", Value: ""}}, "./testdata/issue2401-seriesresponses.json", store.ScanGRPCCurlProtoStreamMessages)
	if err != nil {
		t.Fatal(err)
	}
	fmt.Println("bucket initiated")
	infoSrv := info.NewInfoServer(
		component.Store.String(),
		info.WithLabelSetFunc(func() []labelpb.ZLabelSet {
			return []labelpb.ZLabelSet{{Labels: []labelpb.ZLabel{{Name: "foo", Value: ""}}}}
		}),
		info.WithStoreInfoFunc(func() *infopb.StoreInfo {
			return &infopb.StoreInfo{}
		}),
	)
	fmt.Println("infoSrv initiated")
	s := grpcserver.New(logger, prometheus.NewRegistry(), nil, []grpc_logging.Option{}, []tags.Option{}, component.Store, prober.NewGRPC(),
		grpcserver.WithServer(store.RegisterStoreServer(bs)),
		grpcserver.WithServer(info.RegisterInfoServer(infoSrv)),
		grpcserver.WithListen(addr),
	)
	fmt.Println("grpcserver initiated")
	go func() {
		fmt.Println("starting server on: ", addr)
		fmt.Println(s.ListenAndServe())
	}()

	time.Sleep(time.Second * 2)

	e := NewEndpointSet(
		logger,
		nil,
		func() (specs []*GRPCEndpointSpec) {
			return []*GRPCEndpointSpec{NewGRPCEndpointSpec(addr, true)}
		},
		[]grpc.DialOption{grpc.WithInsecure()},
		time.Second*10,
	)
	e.Update(context.Background())
	fmt.Println(e.endpoints)
	p := store.NewProxyStore(logger, nil, e.GetStoreClients, component.Store, []labels.Label{{Name: "foo", Value: ""}}, time.Second*5)
	resp := &seriesServer{ctx: context.Background()}
	err = p.Series(
		&storepb.SeriesRequest{
			Matchers: []storepb.LabelMatcher{{Name: "foo", Value: "", Type: storepb.LabelMatcher_EQ}},
		},
		resp,
	)
	if err != nil {
		t.Fatal(err)
	}
	if len(resp.warnings) > 0 {
		t.Fatal(resp.warnings)
	}
	for i := range resp.seriesSet {
		fmt.Println(i)
	}
}

trevorriles · 2022-02-03T16:57:51Z

The tricky thing is that it depends heavily on the system setup of the kernel TCP stack

Just out of curiosity @trevorriles what are your keepalive TCP kernel settings?

In our case
$ sysctl -a | grep "net.ipv4.tcp_keepalive"
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

We've got the same settings.

trevorriles · 2022-02-03T20:46:49Z

Hi @FUSAKLA,

I'm made a bit of a mistake and didn't realize we were still on an older version. We're seeing improvement since upgrading from v0.23.1 to v0.24.0 and I believe this PR fixed our issue #4795. I'll report back if we see any other blocking issues.

FUSAKLA · 2022-02-03T21:27:46Z

Right, that's a bit of a different issue. We bumped into it also before.

Actually we are seeing this issue since the upgrade to 0.24 bud I'm certain it's connected to connection disruptions caused by our loadbalancer, and I'm not sure if those were happening before so cannot tell if the upgrade could have been the cause.

FUSAKLA · 2022-02-08T15:20:36Z

Ok I think I'm just chasing ghosts with misconfigured timeouts in our setup.
I'll close this up for now until proven otherwise.

hanjm added component: query needs-investigation labels Jan 20, 2022

FUSAKLA closed this as completed Feb 8, 2022

parkedwards mentioned this issue May 9, 2022

query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: Hanging calls to store API #5079

query: Hanging calls to store API #5079

FUSAKLA commented Jan 19, 2022 •

edited

GiedriusS commented Jan 20, 2022

trevorriles commented Jan 28, 2022

FUSAKLA commented Feb 3, 2022

FUSAKLA commented Feb 3, 2022

FUSAKLA commented Feb 3, 2022

trevorriles commented Feb 3, 2022

trevorriles commented Feb 3, 2022 •

edited

FUSAKLA commented Feb 3, 2022

FUSAKLA commented Feb 8, 2022

query: Hanging calls to store API #5079

query: Hanging calls to store API #5079

Comments

FUSAKLA commented Jan 19, 2022 • edited

GiedriusS commented Jan 20, 2022

trevorriles commented Jan 28, 2022

FUSAKLA commented Feb 3, 2022

FUSAKLA commented Feb 3, 2022

FUSAKLA commented Feb 3, 2022

trevorriles commented Feb 3, 2022

trevorriles commented Feb 3, 2022 • edited

FUSAKLA commented Feb 3, 2022

FUSAKLA commented Feb 8, 2022

FUSAKLA commented Jan 19, 2022 •

edited

trevorriles commented Feb 3, 2022 •

edited