-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
employ some channel best practices to avoid hanging goroutines #712
employ some channel best practices to avoid hanging goroutines #712
Conversation
The following is the coverage report on the affected files.
|
05f29b2
to
6890430
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
380d65e
to
b3b6335
Compare
The following is the coverage report on the affected files.
|
b3b6335
to
61f464b
Compare
The following is the coverage report on the affected files.
|
/retest |
61f464b
to
251568f
Compare
The following is the coverage report on the affected files.
|
251568f
to
2effb64
Compare
The following is the coverage report on the affected files.
|
2effb64
to
707d0e7
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
hey @sayan-biswas @avinal @ramessesii2 @enarha I'm going to need your help with the error I'm seeing in the integration tests here. Turns out the
and with my latest commit, I also get
Now, I first thought it was related to the child context we originally put in the fix, but I've made a couple of successive commits to first remove the calling of the cancel function, and then remove the use of the child context, but they have not had any effect. Still get this I did add debug in the goroutine and confirm that
WDYT? thanks |
The following is the coverage report on the affected files.
|
good news @sayan-biswas @avinal @ramessesii2 @enarha - with c76523c I have bypassed the canceled context error in the UpdateLogs on the API server. Seeing that error with the mem leak fix maybe makes sense, in that we are doing better at unblocking the tknlog.NewWriter(..).Write(...) call but I'm having @pmacik run a parallel test with release 0.9.1 so we can confirm whether the error was ever happening with that version. I just now need to hack on this fix some more so that we sufficiently block on the context before we exit |
The following is the coverage report on the affected files.
|
my goodness golint even complains about misspelling words :-)
|
ff6e146
to
5038451
Compare
The following is the coverage report on the affected files.
|
@sayan-biswas @avinal @ramessesii2 @enarha I think I would be good to get this PR in first, to help provide some immediately relief with our deployment, and then I can rebase #715 as needed and after our stress test for that one is done, follow up with getting you all to get that ready for merge PTAL |
bump @sayan-biswas @avinal per my comment #715 (comment) let's merge this before #715 to get some immediate relief for the memory leak as well as get some baselines in production environments before we entertain #715 |
…mory leaks from unclosed buffered channels fix panic in e2es and add dump of controller logs on errors rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
5038451
to
3a8b153
Compare
I've squashed the commits (there were several temp/debug commits in there) |
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to work fine in my tests.
/hold cancel |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: avinal, sayan-biswas The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
…amLogs So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events. We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why. This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a timeout to the context used in streamLogs, in case a blockage in the UpdateLog call somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
…next with timeout switch from CloseSend to CloseRecv in streamLogs add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events. We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why. This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. 3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
…next with timeout switch from CloseSend to CloseRecv in streamLogs add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events. We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why. This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. 3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces 4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete. 5) we now log the dealines for UpdateLog on the api server side rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
…next with timeout switch from CloseSend to CloseRecv in streamLogs add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events. We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why. This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a configurable timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. 3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces 4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete. 5) we now log the dealines for UpdateLog on the api server side 6) we are back to pulling errors off of the tkn client error channel
…next with timeout switch from CloseSend to CloseRecv in streamLogs add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier So we ran with the change from #712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events. We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the #712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why. This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a configurable timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. 3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces 4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete. 5) we now log the dealines for UpdateLog on the api server side 6) we are back to pulling errors off of the tkn client error channel
Changes
Fixes #695
possibly
my team (including @ramessesii2 @avinal @sayan-biswas @pmacik) is doing some internal testing to both
zero in on the cause of the leak, as well as see if our theory around the goroutines not getting GC'ed based on what we
see in testing proves out.
I've marked this as WIP and hold because aside from still needing to verify things, this PR currently has some debugging prints we will eventually remove.
/kind bug
/hold
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes