-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downgrade "context canceled" errors to Info instead of Error #6166
Conversation
"context canceled" is a normal state of operation, and results in log spam.
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @Mircoxi on file. Once you have signed the CLA, please let us know, so we can manually review and add you to the approved contributors list. |
Signed CLA a few minutes prior to committing! |
…ontext canceled" too. "context canceled" is a normal state of operation, and results in log spam.
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @Mircoxi on file. Once you have signed the CLA, please let us know, so we can manually review and add you to the approved contributors list. |
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @Mircoxi on file. Once you have signed the CLA, please let us know, so we can manually review and add you to the approved contributors list. |
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @Mircoxi on file. Once you have signed the CLA, please let us know, so we can manually review and add you to the approved contributors list. |
Checking the failing tests and the history available to me on the Jenkins log, it seems commit 03c52f1 introduced the bug that's causing test failures in this PR (it seems to be a schema change, which is consistent with what the errors are showing - perhaps the tests haven't been updated to account for it yet?), but it does appear that the tests for the parts I changed are passing. I'd appreciate some guidance on how to proceed given the failing tests. Also, I resubmitted the CLA form since I changed my email address - but I'm confirming that I've signed it. I'll hold off on any more merging updates to the code on my end to keep the bot quiet. |
This pull request has been mentioned on Storj Community Forum (official). There might be relevant details there: |
Sorry for the bot spam- someone has to manually check the signature entry and add you to a list so that the CLA bot will accept your contribution. I'm doing that now. |
storagenode/piecestore/endpoint.go
Outdated
endpoint.log.Error("upload failed", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Error(err), zap.Int64("Size", uploadSize), remoteAddrLogField) | ||
if errors.Is(err, context.Canceled) { | ||
// Context cancellation is common in normal operation, and shouldn't throw a full error. | ||
endpoint.log.Info("upload failed", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Error(err), zap.Int64("Size", uploadSize), remoteAddrLogField) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest going even farther and taking out the zap.Error(err)
parameter. You could add "(context canceled)"
to the log message itself instead. Our errors tend to be very long-winded and the stack trace is unlikely to be important.
Maybe we could log the error and stack trace in a Debug call below as well, just in case it ever is important.
Same recommendation for the Download case.
The build problem has been fixed in main, so it should go away if you rebase or merge in main. |
…bose messages at Info level to explain what happened to the user.
Good call on the shorter messages! I changed the originals to debug level, and then took the liberty of making the version logged at Info leave out everything but the piece ID and a short message telling the user that they lost the race, since I've seen a fair bit of confusion and worry that "context canceled" is a more serious error than it is. |
storagenode/piecestore/endpoint.go
Outdated
@@ -643,7 +645,8 @@ func (endpoint *Endpoint) Download(stream pb.DRPCPiecestore_DownloadStream) (err | |||
mon.IntVal("download_failure_duration_ns", actionSeriesTag).Observe(downloadDuration) | |||
mon.FloatVal("download_failure_rate_bytes_per_sec", actionSeriesTag).Observe(downloadRate) | |||
if errors.Is(err, context.Canceled) { | |||
endpoint.log.Info("download failed", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Int64("Offset", chunk.Offset), zap.Int64("Size", downloadSize), zap.String("Remote Address", remoteAddr), zap.Error(err)) | |||
endpoint.log.Info("download canceled (context canceled)", zap.Stringer("Piece ID", limit.PieceId), zap.String("Reason", "Someone else downloaded the required piece faster.")) | |||
endpoint.log.Debug("download canceled (context canceled)", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Int64("Offset", chunk.Offset), zap.Int64("Size", downloadSize), zap.String("Remote Address", remoteAddr), zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need the " (context canceled)
" hint on this line, since the err is still being provided.
storagenode/piecestore/endpoint.go
Outdated
@@ -361,7 +361,9 @@ func (endpoint *Endpoint) Upload(stream pb.DRPCPiecestore_UploadStream) (err err | |||
mon.FloatVal("upload_failure_rate_bytes_per_sec").Observe(uploadRate) | |||
if errors.Is(err, context.Canceled) { | |||
// Context cancellation is common in normal operation, and shouldn't throw a full error. | |||
endpoint.log.Info("upload failed", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Error(err), zap.Int64("Size", uploadSize), remoteAddrLogField) | |||
endpoint.log.Info("upload canceled (context canceled)", zap.Stringer("Piece ID", limit.PieceId), zap.String("Reason", "Someone else uploaded the required piece faster.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also possible to get context.Canceled when the node is shutting down, so those log lines could be confusing when a bunch of them show up at the end of a log. And unfortunately it's not trivial to distinguish those cases. What do you think about this-
endpoint.log.Info("upload canceled (race lost or node shutdown)", zap.Stringer("Piece ID", limit.PieceId))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad there! I didn't find any documentation about that possibility, I've gone ahead and corrected it.
What:
This PR simply downgrades the "context canceled" errors as part of piece uploads from Error to Info.
Why:
"context canceled" is a normal state of operation during uploads, and results in log spam, even when a user has elevated their default logging preference. Since it's ostensibly not an error, it shouldn't be marked as such in the logs. This would also close #5799, as well as making many people on the forums happy!
There's an argument to be made that this should be in the section above instead, since technically it's a cancellation rather than a failure, but I assumed that if it wasn't in that section already, it was for a reason. I'm happy to adjust that if preferred.
Please describe the tests:
None added, since this is a simple logging level change.
Please describe the performance impact:
Almost zero.
Code Review Checklist (to be filled out by reviewer)