Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scylla manager backups intermittently fails #3659

Closed
shantanugithub opened this issue Dec 7, 2023 · 4 comments
Closed

Scylla manager backups intermittently fails #3659

shantanugithub opened this issue Dec 7, 2023 · 4 comments
Assignees

Comments

@shantanugithub
Copy link

Backups intermittently show Failed column as non zero randomly on few set of nodes. Please let me know what could be the possible root cause here. Backups are critical part of any prod setup and intermittent failures have become an issue for us.

Using AWS Scylla AMIs
Scylla version : 5.2.7
Scylla manager version : 3.2.3

Below are agent logs --

Dec 07 23:23:27 ip-172-31-129-136 scylla-manager-agent[19262]: {"L":"ERROR","T":"2023-12-07T23:23:27.268+0530","N":"http","M":"GET /storage_service/scylla_release_version","from":"172.31.129.6:46824","status":502,"bytes":0,"duration":"737ms","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\nmain.(*logEntry).Write\n\tgithub.com/scylladb/scylla-manager/v3/pkg/cmd/agent/log.go:53\ngithub.com/go-chi/chi/v5/middleware.RequestLogger.func1.1.1\n\tgithub.com/go-chi/chi/v5@v5.0.0/middleware/logger.go:54\ngithub.com/go-chi/chi/v5/middleware.RequestLogger.func1.1\n\tgithub.com/go-chi/chi/v5@v5.0.0/middleware/logger.go:58\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2122\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:87\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2936\nnet/http.(*conn).serve\n\tnet/http/server.go:1995"}

Dec 07 23:23:26 ip-172-31-129-136 scylla-manager-agent[19262]: {"L":"INFO","T":"2023-12-07T23:23:26.752+0530","M":"http: TLS handshake error from 172.31.129.6:47344: EOF"}
Dec 07 23:23:27 ip-172-31-129-136 scylla-manager-agent[19262]: {"L":"INFO","T":"2023-12-07T23:23:27.267+0530","M":"http: proxy error: context canceled"}

Manager logs :

Dec 08 00:23:29 ip-172-31-129-6 scylla-manager[1273690]: {"L":"INFO","T":"2023-12-08T00:23:29.660+0530","N":"cluster.client","M":"HTTP retry backoff","operation":"NodeInfo","wait":"1s","error":"after 30s: context deadline exceeded","_trace_id":"s80LqIBaRr-UGizsm1BH0Q"}

Lot of times health check for nodes give timeout on the manager --

Dec 08 00:25:19 ip-172-31-129-6 scylla-manager[1273690]: {"L":"INFO","T":"2023-12-08T00:25:19.803+0530","N":"cluster.client","M":"HTTP retry backoff","operation":"JobProgress","wait":"4.284564696s","error":"net/http: TLS handshake timeout","_trace_id":"tzFrgc8ZTeu3XhASku_E2A"}
@Michal-Leszczynski
Copy link
Collaborator

There is a possibility that this is connected to #3298 that has been fixed with SM 3.2.5 release (agent runs out of memory when performing a big backup).
Could you upgrade to SM 3.2.5 and verify that this problem is solved?

@Michal-Leszczynski Michal-Leszczynski self-assigned this Dec 13, 2023
@shantanugithub
Copy link
Author

Yes, after upgrade, errors have reduced a lot, although not completely gone. Will monitor more on this.

@Michal-Leszczynski
Copy link
Collaborator

@shantanugithub can we close this issue or are the backups still failing for you?

@shantanugithub
Copy link
Author

yes can close now. Will inform if needed to be opened again.

@mykaul mykaul closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants