Scylla manager backups intermittently fails #3659

shantanugithub · 2023-12-07T18:57:46Z

Backups intermittently show Failed column as non zero randomly on few set of nodes. Please let me know what could be the possible root cause here. Backups are critical part of any prod setup and intermittent failures have become an issue for us.

Using AWS Scylla AMIs
Scylla version : 5.2.7
Scylla manager version : 3.2.3

Below are agent logs --

Dec 07 23:23:27 ip-172-31-129-136 scylla-manager-agent[19262]: {"L":"ERROR","T":"2023-12-07T23:23:27.268+0530","N":"http","M":"GET /storage_service/scylla_release_version","from":"172.31.129.6:46824","status":502,"bytes":0,"duration":"737ms","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\nmain.(*logEntry).Write\n\tgithub.com/scylladb/scylla-manager/v3/pkg/cmd/agent/log.go:53\ngithub.com/go-chi/chi/v5/middleware.RequestLogger.func1.1.1\n\tgithub.com/go-chi/chi/v5@v5.0.0/middleware/logger.go:54\ngithub.com/go-chi/chi/v5/middleware.RequestLogger.func1.1\n\tgithub.com/go-chi/chi/v5@v5.0.0/middleware/logger.go:58\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2122\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:87\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2936\nnet/http.(*conn).serve\n\tnet/http/server.go:1995"}

Dec 07 23:23:26 ip-172-31-129-136 scylla-manager-agent[19262]: {"L":"INFO","T":"2023-12-07T23:23:26.752+0530","M":"http: TLS handshake error from 172.31.129.6:47344: EOF"}
Dec 07 23:23:27 ip-172-31-129-136 scylla-manager-agent[19262]: {"L":"INFO","T":"2023-12-07T23:23:27.267+0530","M":"http: proxy error: context canceled"}

Manager logs :

Dec 08 00:23:29 ip-172-31-129-6 scylla-manager[1273690]: {"L":"INFO","T":"2023-12-08T00:23:29.660+0530","N":"cluster.client","M":"HTTP retry backoff","operation":"NodeInfo","wait":"1s","error":"after 30s: context deadline exceeded","_trace_id":"s80LqIBaRr-UGizsm1BH0Q"}

Lot of times health check for nodes give timeout on the manager --

Dec 08 00:25:19 ip-172-31-129-6 scylla-manager[1273690]: {"L":"INFO","T":"2023-12-08T00:25:19.803+0530","N":"cluster.client","M":"HTTP retry backoff","operation":"JobProgress","wait":"4.284564696s","error":"net/http: TLS handshake timeout","_trace_id":"tzFrgc8ZTeu3XhASku_E2A"}

The text was updated successfully, but these errors were encountered:

Michal-Leszczynski · 2023-12-13T11:36:07Z

There is a possibility that this is connected to #3298 that has been fixed with SM 3.2.5 release (agent runs out of memory when performing a big backup).
Could you upgrade to SM 3.2.5 and verify that this problem is solved?

shantanugithub · 2023-12-20T15:22:33Z

Yes, after upgrade, errors have reduced a lot, although not completely gone. Will monitor more on this.

Michal-Leszczynski · 2024-03-01T11:07:31Z

@shantanugithub can we close this issue or are the backups still failing for you?

shantanugithub · 2024-03-01T13:37:11Z

yes can close now. Will inform if needed to be opened again.

Michal-Leszczynski self-assigned this Dec 13, 2023

Michal-Leszczynski closed this as completed Mar 1, 2024

mykaul closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scylla manager backups intermittently fails #3659

Scylla manager backups intermittently fails #3659

shantanugithub commented Dec 7, 2023

Michal-Leszczynski commented Dec 13, 2023

shantanugithub commented Dec 20, 2023

Michal-Leszczynski commented Mar 1, 2024

shantanugithub commented Mar 1, 2024

Scylla manager backups intermittently fails #3659

Scylla manager backups intermittently fails #3659

Comments

shantanugithub commented Dec 7, 2023

Michal-Leszczynski commented Dec 13, 2023

shantanugithub commented Dec 20, 2023

Michal-Leszczynski commented Mar 1, 2024

shantanugithub commented Mar 1, 2024