Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downgrade "context canceled" errors to Info instead of Error #6166

Merged
merged 11 commits into from
Aug 10, 2023

Conversation

Mircoxi
Copy link
Contributor

@Mircoxi Mircoxi commented Aug 4, 2023

What:

This PR simply downgrades the "context canceled" errors as part of piece uploads from Error to Info.

Why:

"context canceled" is a normal state of operation during uploads, and results in log spam, even when a user has elevated their default logging preference. Since it's ostensibly not an error, it shouldn't be marked as such in the logs. This would also close #5799, as well as making many people on the forums happy!

There's an argument to be made that this should be in the section above instead, since technically it's a cancellation rather than a failure, but I assumed that if it wasn't in that section already, it was for a reason. I'm happy to adjust that if preferred.

Please describe the tests:

None added, since this is a simple logging level change.

Please describe the performance impact:

Almost zero.

Code Review Checklist (to be filled out by reviewer)

  • NEW: Are there any Satellite database migrations? Are they forwards and backwards compatible?
  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?
  • Do the database access patterns make sense?

"context canceled" is a normal state of operation, and results in log spam.
@Mircoxi Mircoxi requested a review from profclems as a code owner August 4, 2023 21:18
@cla-bot
Copy link

cla-bot bot commented Aug 4, 2023

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @Mircoxi on file. Once you have signed the CLA, please let us know, so we can manually review and add you to the approved contributors list.

@Mircoxi
Copy link
Contributor Author

Mircoxi commented Aug 4, 2023

Signed CLA a few minutes prior to committing!

…ontext canceled" too.

"context canceled" is a normal state of operation, and results in log spam.
@cla-bot
Copy link

cla-bot bot commented Aug 5, 2023

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @Mircoxi on file. Once you have signed the CLA, please let us know, so we can manually review and add you to the approved contributors list.

@cla-bot
Copy link

cla-bot bot commented Aug 7, 2023

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @Mircoxi on file. Once you have signed the CLA, please let us know, so we can manually review and add you to the approved contributors list.

@cla-bot
Copy link

cla-bot bot commented Aug 8, 2023

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @Mircoxi on file. Once you have signed the CLA, please let us know, so we can manually review and add you to the approved contributors list.

@Mircoxi
Copy link
Contributor Author

Mircoxi commented Aug 8, 2023

Checking the failing tests and the history available to me on the Jenkins log, it seems commit 03c52f1 introduced the bug that's causing test failures in this PR (it seems to be a schema change, which is consistent with what the errors are showing - perhaps the tests haven't been updated to account for it yet?), but it does appear that the tests for the parts I changed are passing. I'd appreciate some guidance on how to proceed given the failing tests.

Also, I resubmitted the CLA form since I changed my email address - but I'm confirming that I've signed it. I'll hold off on any more merging updates to the code on my end to keep the bot quiet.

@storjrobot
Copy link

This pull request has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/log-size-and-rotating/23421/35

@thepaul
Copy link
Contributor

thepaul commented Aug 8, 2023

Sorry for the bot spam- someone has to manually check the signature entry and add you to a list so that the CLA bot will accept your contribution. I'm doing that now.

endpoint.log.Error("upload failed", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Error(err), zap.Int64("Size", uploadSize), remoteAddrLogField)
if errors.Is(err, context.Canceled) {
// Context cancellation is common in normal operation, and shouldn't throw a full error.
endpoint.log.Info("upload failed", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Error(err), zap.Int64("Size", uploadSize), remoteAddrLogField)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest going even farther and taking out the zap.Error(err) parameter. You could add "(context canceled)" to the log message itself instead. Our errors tend to be very long-winded and the stack trace is unlikely to be important.

Maybe we could log the error and stack trace in a Debug call below as well, just in case it ever is important.

Same recommendation for the Download case.

@thepaul
Copy link
Contributor

thepaul commented Aug 8, 2023

The build problem has been fixed in main, so it should go away if you rebase or merge in main.

Mircoxi and others added 2 commits August 8, 2023 20:53
@cla-bot cla-bot bot added the cla-signed label Aug 8, 2023
@Mircoxi
Copy link
Contributor Author

Mircoxi commented Aug 8, 2023

Good call on the shorter messages! I changed the originals to debug level, and then took the liberty of making the version logged at Info leave out everything but the piece ID and a short message telling the user that they lost the race, since I've seen a fair bit of confusion and worry that "context canceled" is a more serious error than it is.

@@ -643,7 +645,8 @@ func (endpoint *Endpoint) Download(stream pb.DRPCPiecestore_DownloadStream) (err
mon.IntVal("download_failure_duration_ns", actionSeriesTag).Observe(downloadDuration)
mon.FloatVal("download_failure_rate_bytes_per_sec", actionSeriesTag).Observe(downloadRate)
if errors.Is(err, context.Canceled) {
endpoint.log.Info("download failed", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Int64("Offset", chunk.Offset), zap.Int64("Size", downloadSize), zap.String("Remote Address", remoteAddr), zap.Error(err))
endpoint.log.Info("download canceled (context canceled)", zap.Stringer("Piece ID", limit.PieceId), zap.String("Reason", "Someone else downloaded the required piece faster."))
endpoint.log.Debug("download canceled (context canceled)", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Int64("Offset", chunk.Offset), zap.Int64("Size", downloadSize), zap.String("Remote Address", remoteAddr), zap.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need the " (context canceled)" hint on this line, since the err is still being provided.

@@ -361,7 +361,9 @@ func (endpoint *Endpoint) Upload(stream pb.DRPCPiecestore_UploadStream) (err err
mon.FloatVal("upload_failure_rate_bytes_per_sec").Observe(uploadRate)
if errors.Is(err, context.Canceled) {
// Context cancellation is common in normal operation, and shouldn't throw a full error.
endpoint.log.Info("upload failed", zap.Stringer("Piece ID", limit.PieceId), zap.Stringer("Satellite ID", limit.SatelliteId), zap.Stringer("Action", limit.Action), zap.Error(err), zap.Int64("Size", uploadSize), remoteAddrLogField)
endpoint.log.Info("upload canceled (context canceled)", zap.Stringer("Piece ID", limit.PieceId), zap.String("Reason", "Someone else uploaded the required piece faster."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also possible to get context.Canceled when the node is shutting down, so those log lines could be confusing when a bunch of them show up at the end of a log. And unfortunately it's not trivial to distinguish those cases. What do you think about this-

endpoint.log.Info("upload canceled (race lost or node shutdown)", zap.Stringer("Piece ID", limit.PieceId))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad there! I didn't find any documentation about that possibility, I've gone ahead and corrected it.

@mniewrzal mniewrzal merged commit 6f01a81 into storj:main Aug 10, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove from "error" log events with: context canceled
5 participants