New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: Hanging calls to store API #5079
Comments
It would be interesting to try this: func TestProxyStore_SeriesSlowStores(t *testing.T) {
enable := os.Getenv("THANOS_ENABLE_STORE_READ_TIMEOUT_TESTS")
if enable == "" {
t.Skip("enable THANOS_ENABLE_STORE_READ_TIMEOUT_TESTS to run store-read-timeout tests")
}
... And see whether it passes. Plus, maybe we could add a test case there to see if it helps? 🤔 |
We are hitting the same issue, any updates or workarounds you've found? |
Hi @trevorriles, unfortunately I had not enough time to dig much deeper in this. Dump solution is to restart the Thanos query, and we have an alert to detect the exact state with longer(possibly use
|
The tricky thing is that it depends heavily on the system setup of the kernel TCP stack Just out of curiosity @trevorriles what are your keepalive TCP kernel settings? In our case
|
@GiedriusS That test passes for me, I tried but with no luck of reproducing the issue I'll try to simulate it first, so I can narrow down what the exact problem is func Test_ConnectionTimeout(t *testing.T) {
addr := "localhost:33908"
lis, err := net.Listen("tcp", addr)
if err != nil {
t.Fatalf("Error while listening. Err: %v", err)
}
go func() {
conn, err := lis.Accept()
if err != nil {
t.Errorf("Error while accepting. Err: %v", err)
return
}
framer := http2.NewFramer(conn, conn)
frame, err := framer.ReadFrame()
if err != nil {
t.Fatal(err)
}
fmt.Println(frame.Header())
//if err := framer.WriteSettings(http2.Setting{}); err != nil {
// t.Errorf("Error while writing settings. Err: %v", err)
// return
//}
//conn, err = lis.Accept()
time.Sleep(time.Second * 30)
conn.Close()
lis.Close()
}()
logger := level.NewFilter(log.NewLogfmtLogger(os.Stderr), level.AllowDebug())
bs, err := store.NewLocalStoreFromJSONMmappableFile(logger, component.Debug, []labels.Label{{Name: "foo", Value: ""}}, "./testdata/issue2401-seriesresponses.json", store.ScanGRPCCurlProtoStreamMessages)
if err != nil {
t.Fatal(err)
}
fmt.Println("bucket initiated")
infoSrv := info.NewInfoServer(
component.Store.String(),
info.WithLabelSetFunc(func() []labelpb.ZLabelSet {
return []labelpb.ZLabelSet{{Labels: []labelpb.ZLabel{{Name: "foo", Value: ""}}}}
}),
info.WithStoreInfoFunc(func() *infopb.StoreInfo {
return &infopb.StoreInfo{}
}),
)
fmt.Println("infoSrv initiated")
s := grpcserver.New(logger, prometheus.NewRegistry(), nil, []grpc_logging.Option{}, []tags.Option{}, component.Store, prober.NewGRPC(),
grpcserver.WithServer(store.RegisterStoreServer(bs)),
grpcserver.WithServer(info.RegisterInfoServer(infoSrv)),
grpcserver.WithListen(addr),
)
fmt.Println("grpcserver initiated")
go func() {
fmt.Println("starting server on: ", addr)
fmt.Println(s.ListenAndServe())
}()
time.Sleep(time.Second * 2)
e := NewEndpointSet(
logger,
nil,
func() (specs []*GRPCEndpointSpec) {
return []*GRPCEndpointSpec{NewGRPCEndpointSpec(addr, true)}
},
[]grpc.DialOption{grpc.WithInsecure()},
time.Second*10,
)
e.Update(context.Background())
fmt.Println(e.endpoints)
p := store.NewProxyStore(logger, nil, e.GetStoreClients, component.Store, []labels.Label{{Name: "foo", Value: ""}}, time.Second*5)
resp := &seriesServer{ctx: context.Background()}
err = p.Series(
&storepb.SeriesRequest{
Matchers: []storepb.LabelMatcher{{Name: "foo", Value: "", Type: storepb.LabelMatcher_EQ}},
},
resp,
)
if err != nil {
t.Fatal(err)
}
if len(resp.warnings) > 0 {
t.Fatal(resp.warnings)
}
for i := range resp.seriesSet {
fmt.Println(i)
}
} |
We've got the same settings. |
Right, that's a bit of a different issue. We bumped into it also before. Actually we are seeing this issue since the upgrade to 0.24 bud I'm certain it's connected to connection disruptions caused by our loadbalancer, and I'm not sure if those were happening before so cannot tell if the upgrade could have been the cause. |
Ok I think I'm just chasing ghosts with misconfigured timeouts in our setup. |
Hi, we bumped into issue with calls from query to store API hangs after the connection got interrupted.
(load balancer node handling the connection got shut down, probably not canceling the TCP connection gracefully)
The queries got stuck, and query gate got filled up with hanging queries that never got away from there until restart of the instance
I suspect the request to be hanging infinitely (or with a really long timeout) and blocking the query.
I's running in the official docker container in kubernetes and the system settings for keepalive are
which is 2h before first keepalive packet
But looking at the default
net.Dialer
the default should be 15s by default for gRPC clientAnyway, checking with tcpdump it looks like keepalive pings are happening
Combined with the context timeout set for RPC calls over the stream, I suspect the established connection should fail and trigger reconnect. This leads me to the part when the stream is created where no context timeout is specified. Might this be the issue?
thanos/pkg/store/proxy.go
Line 314 in 78d250b
I'd be happy to send PR just checking if I'm following the right track.
Thanos, Prometheus and Golang version used:
Thanos:
0.24.0
Prometheus:
2.32.1
The text was updated successfully, but these errors were encountered: