-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks #725
Conversation
The following is the coverage report on the affected files.
|
OK 4cc4cab passed the e2e's Next will in a separate commit attempt a more expansive use of context with timeout around all grpc calls from the dynamic reconciler. |
The following is the coverage report on the affected files.
|
ok @sayan-biswas I have 2 forms of a fix and/or work around for the results watcher hang we saw in our prod env today the first commit just covers updating logs to the results api server the second commit handles all grpc calls to the api server coming out of the dynamic reconciler after some code review, it centers on the notion that we were blocked on an grpc call to the api server, where the api server may have been in a bad state, given I noticed an api restart around the time I recycled the watcher pod So basically,
|
} | ||
|
||
return nil | ||
} | ||
|
||
func (r *Reconciler) streamLogs(ctx context.Context, o results.Object, logType, logName string) error { | ||
logger := logging.FromContext(ctx) | ||
logsClient, err := r.resultsClient.UpdateLog(ctx) | ||
streamCtx, streamCancel := context.WithTimeout(ctx, 5*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we go with the broader using of context timeout within the dynamic reconciler loop, we would get rid of this particular child context with timeout
The following is the coverage report on the affected files.
|
ff9ad16
to
2acd8a5
Compare
The following is the coverage report on the affected files.
|
2acd8a5
to
663a88f
Compare
The following is the coverage report on the affected files.
|
663a88f
to
0e6750b
Compare
ok @sayan-biswas I now have a third commit that deciphers if the context or any specific GRPC calls exist because of deadline exceeded, and if so, dumps a goroutine with stack traces list to stdout for analysis. So, with our hand on Friday, either the tkn client threading ended in a deadlock, or more likely, we had a grpc call to the api server hang (perhaps because the api server became messed up, which might have occurred since it restarted on its own). In either case, we should get a thread dump in the log that we can diagnose. |
The following is the coverage report on the affected files.
|
0e6750b
to
ad4fd1a
Compare
The following is the coverage report on the affected files.
|
ad4fd1a
to
422f1de
Compare
The following is the coverage report on the affected files.
|
422f1de
to
db9eb3e
Compare
The following is the coverage report on the affected files.
|
e2e's green with timeout full propagated, no go func for sendLogs will update configuration of timeout now |
The following is the coverage report on the affected files.
|
@sayan-biswas @khrm @avinal @enarha this PR is ready for review, iterate, merge |
// context with timeout does not work with the partial end to end flow that exists with unit tests; | ||
// this field will alway be set for real | ||
if r.cfg != nil && r.cfg.UpdateLogTimeout != nil { | ||
dynamicContext, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dynamicContext, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout) | |
ctx, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'm not a fan of using the generic variable name for this ... I was intentional in this name change
can you elaborate @khrm on why you want the name to be ctx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone modifying the function later on can easily by mistake use ctx
. ctx
is still available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok you've swayed me ... will adjust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed as a separate commit for now @khrm to facilitate your review .... I can squash once we are sync'ed on the change being ready
@@ -333,22 +414,21 @@ func (r *Reconciler) sendLog(ctx context.Context, o results.Object) error { | |||
zap.String("name", o.GetName()), | |||
) | |||
|
|||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we pass context here and use select and ctx.done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the end result is the same; we block the reconcile thread until the operation is complete
no reason for the added complextiy
@@ -396,6 +499,13 @@ func (r *Reconciler) streamLogs(ctx context.Context, o results.Object, logType, | |||
Err: inMemWriteBufferStderr, | |||
}, logChan, errChan) | |||
|
|||
// pull the first error that occurred and return on that; reminder - per https://golang.org/ref/spec#Channel_types | |||
// channels act as FIFO queues | |||
chanErr, ok := <-errChan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since @khrm last review I've added here a simplified version of the error handling that @sayan-biswas used to do prior to pr #712
let's just return the first error seen; from what I see in the tkn code, multiple errors are not collected
with this though @sayan-biswas and the prior changes made with #712, plus moving the sendLogs
call back on the reconciler thread, we now address one of the short comings of the old implementation, namely ignoring of errors and preventing retries on reconciliation.
Reminder: with the threadiness
arg, users can adjust the number of threads when they enable log support.
I will add a separate PR to add the k8s client tuning shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#744 is the add tuning options PR
The following is the coverage report on the affected files.
|
@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks |
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: khrm, sayan-biswas The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changes are fine. |
…next with timeout switch from CloseSend to CloseRecv in streamLogs add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events. We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why. This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a configurable timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. 3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces 4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete. 5) we now log the dealines for UpdateLog on the api server side 6) we are back to pulling errors off of the tkn client error channel
c8e684c
to
2ab9f4e
Compare
thanks @sayan-biswas @khrm commits are squashed |
The following is the coverage report on the affected files.
|
/lgtm |
Changes
So we ran with the change from #712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events.
We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the #712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.
This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.
3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces
4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete.
5) we now log the dealines for UpdateLog on the api server side
rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
/kind bug
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes