Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: Hanging calls to store API #5079

Closed
FUSAKLA opened this issue Jan 19, 2022 · 9 comments
Closed

query: Hanging calls to store API #5079

FUSAKLA opened this issue Jan 19, 2022 · 9 comments

Comments

@FUSAKLA
Copy link
Member

FUSAKLA commented Jan 19, 2022

Hi, we bumped into issue with calls from query to store API hangs after the connection got interrupted.
(load balancer node handling the connection got shut down, probably not canceling the TCP connection gracefully)

image

The queries got stuck, and query gate got filled up with hanging queries that never got away from there until restart of the instance
image

I suspect the request to be hanging infinitely (or with a really long timeout) and blocking the query.

I's running in the official docker container in kubernetes and the system settings for keepalive are

# sysctl -a | grep tcp_keepalive
sysctl: reading key "kernel.unprivileged_userns_apparmor_policy"
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

which is 2h before first keepalive packet

But looking at the default net.Dialer the default should be 15s by default for gRPC client

Anyway, checking with tcpdump it looks like keepalive pings are happening

# tcpdump -i eth0 -nnn "dst yy.yy.y.yyy and tcp[tcpflags] == tcp-ack"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
18:12:09.229890 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 2254374078, win 170, options [nop,nop,TS val 3230079859 ecr 3089291048], length 0
18:12:09.229916 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 188, win 172, options [nop,nop,TS val 3230079859 ecr 3089291048], length 0
18:12:09.230650 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 205, win 172, options [nop,nop,TS val 3230079859 ecr 3089291049], length 0
18:12:14.230030 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 422, win 174, options [nop,nop,TS val 3230084859 ecr 3089296048], length 0
18:12:14.231268 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 439, win 174, options [nop,nop,TS val 3230084860 ecr 3089296050], length 0
18:12:19.229681 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 469, win 174, options [nop,nop,TS val 3230089858 ecr 3089301048], length 0
18:12:19.229782 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 656, win 176, options [nop,nop,TS val 3230089859 ecr 3089301048], length 0
18:12:19.230539 IP xx.xx.xxx.xxx.60986 > yy.yy.y.yyy.10901: Flags [.], ack 673, win 176, options [nop,nop,TS val 3230089859 ecr 3089301049], length 0

Combined with the context timeout set for RPC calls over the stream, I suspect the established connection should fail and trigger reconnect. This leads me to the part when the stream is created where no context timeout is specified. Might this be the issue?

sc, err := st.Series(seriesCtx, r)

I'd be happy to send PR just checking if I'm following the right track.

Thanos, Prometheus and Golang version used:
Thanos: 0.24.0
Prometheus: 2.32.1

@GiedriusS
Copy link
Member

It would be interesting to try this:

func TestProxyStore_SeriesSlowStores(t *testing.T) {
	enable := os.Getenv("THANOS_ENABLE_STORE_READ_TIMEOUT_TESTS")
	if enable == "" {
		t.Skip("enable THANOS_ENABLE_STORE_READ_TIMEOUT_TESTS to run store-read-timeout tests")
	}
	...

And see whether it passes. Plus, maybe we could add a test case there to see if it helps? 🤔

@trevorriles
Copy link

We are hitting the same issue, any updates or workarounds you've found?

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Feb 3, 2022

Hi @trevorriles, unfortunately I had not enough time to dig much deeper in this.

Dump solution is to restart the Thanos query, and we have an alert to detect the exact state with longer(possibly use avg_over_time to avoid false positives on really overloaded query nodes)

thanos_query_concurrent_gate_queries_in_flight{app="thanos-query"}
  >=
thanos_query_concurrent_gate_queries_max{app="thanos-query"}

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Feb 3, 2022

The tricky thing is that it depends heavily on the system setup of the kernel TCP stack

Just out of curiosity @trevorriles what are your keepalive TCP kernel settings?

In our case

$ sysctl -a | grep "net.ipv4.tcp_keepalive"
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Feb 3, 2022

@GiedriusS That test passes for me,

I tried but with no luck of reproducing the issue
But I tested it locally, which may be system dependent...

I'll try to simulate it first, so I can narrow down what the exact problem is

func Test_ConnectionTimeout(t *testing.T) {

	addr := "localhost:33908"
	lis, err := net.Listen("tcp", addr)
	if err != nil {
		t.Fatalf("Error while listening. Err: %v", err)
	}
	go func() {
		conn, err := lis.Accept()
		if err != nil {
			t.Errorf("Error while accepting. Err: %v", err)
			return
		}
		framer := http2.NewFramer(conn, conn)
		frame, err := framer.ReadFrame()
		if err != nil {
			t.Fatal(err)
		}
		fmt.Println(frame.Header())
		//if err := framer.WriteSettings(http2.Setting{}); err != nil {
		//	t.Errorf("Error while writing settings. Err: %v", err)
		//	return
		//}
		//conn, err = lis.Accept()
		time.Sleep(time.Second * 30)
		conn.Close()
		lis.Close()
	}()

	logger := level.NewFilter(log.NewLogfmtLogger(os.Stderr), level.AllowDebug())
	bs, err := store.NewLocalStoreFromJSONMmappableFile(logger, component.Debug, []labels.Label{{Name: "foo", Value: ""}}, "./testdata/issue2401-seriesresponses.json", store.ScanGRPCCurlProtoStreamMessages)
	if err != nil {
		t.Fatal(err)
	}
	fmt.Println("bucket initiated")
	infoSrv := info.NewInfoServer(
		component.Store.String(),
		info.WithLabelSetFunc(func() []labelpb.ZLabelSet {
			return []labelpb.ZLabelSet{{Labels: []labelpb.ZLabel{{Name: "foo", Value: ""}}}}
		}),
		info.WithStoreInfoFunc(func() *infopb.StoreInfo {
			return &infopb.StoreInfo{}
		}),
	)
	fmt.Println("infoSrv initiated")
	s := grpcserver.New(logger, prometheus.NewRegistry(), nil, []grpc_logging.Option{}, []tags.Option{}, component.Store, prober.NewGRPC(),
		grpcserver.WithServer(store.RegisterStoreServer(bs)),
		grpcserver.WithServer(info.RegisterInfoServer(infoSrv)),
		grpcserver.WithListen(addr),
	)
	fmt.Println("grpcserver initiated")
	go func() {
		fmt.Println("starting server on: ", addr)
		fmt.Println(s.ListenAndServe())
	}()

	time.Sleep(time.Second * 2)

	e := NewEndpointSet(
		logger,
		nil,
		func() (specs []*GRPCEndpointSpec) {
			return []*GRPCEndpointSpec{NewGRPCEndpointSpec(addr, true)}
		},
		[]grpc.DialOption{grpc.WithInsecure()},
		time.Second*10,
	)
	e.Update(context.Background())
	fmt.Println(e.endpoints)
	p := store.NewProxyStore(logger, nil, e.GetStoreClients, component.Store, []labels.Label{{Name: "foo", Value: ""}}, time.Second*5)
	resp := &seriesServer{ctx: context.Background()}
	err = p.Series(
		&storepb.SeriesRequest{
			Matchers: []storepb.LabelMatcher{{Name: "foo", Value: "", Type: storepb.LabelMatcher_EQ}},
		},
		resp,
	)
	if err != nil {
		t.Fatal(err)
	}
	if len(resp.warnings) > 0 {
		t.Fatal(resp.warnings)
	}
	for i := range resp.seriesSet {
		fmt.Println(i)
	}
}

@trevorriles
Copy link

The tricky thing is that it depends heavily on the system setup of the kernel TCP stack

Just out of curiosity @trevorriles what are your keepalive TCP kernel settings?

In our case

$ sysctl -a | grep "net.ipv4.tcp_keepalive"
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

We've got the same settings.

@trevorriles
Copy link

trevorriles commented Feb 3, 2022

Hi @FUSAKLA,

I'm made a bit of a mistake and didn't realize we were still on an older version. We're seeing improvement since upgrading from v0.23.1 to v0.24.0 and I believe this PR fixed our issue #4795. I'll report back if we see any other blocking issues.

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Feb 3, 2022

Right, that's a bit of a different issue. We bumped into it also before.

Actually we are seeing this issue since the upgrade to 0.24 bud I'm certain it's connected to connection disruptions caused by our loadbalancer, and I'm not sure if those were happening before so cannot tell if the upgrade could have been the cause.

@FUSAKLA
Copy link
Member Author

FUSAKLA commented Feb 8, 2022

Ok I think I'm just chasing ghosts with misconfigured timeouts in our setup.
I'll close this up for now until proven otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants