Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

employ some channel best practices to avoid hanging goroutines #712

Merged
merged 1 commit into from
Feb 27, 2024

Conversation

gabemontero
Copy link
Contributor

@gabemontero gabemontero commented Feb 14, 2024

Changes

Fixes #695

possibly

my team (including @ramessesii2 @avinal @sayan-biswas @pmacik) is doing some internal testing to both
zero in on the cause of the leak, as well as see if our theory around the goroutines not getting GC'ed based on what we
see in testing proves out.

I've marked this as WIP and hold because aside from still needing to verify things, this PR currently has some debugging prints we will eventually remove.

/kind bug
/hold

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Tested your changes locally (if this is a code change)
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contain the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fixes around the threading / goroutines employed while storing logs mitigate an observed memory leak

@tekton-robot tekton-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 14, 2024
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 14, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.4% -3.9

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 62.6% -6.8

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 14, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3

@gabemontero
Copy link
Contributor Author

/retest

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 65.0% -4.3

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.9% -5.5

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.9% -5.5

@gabemontero
Copy link
Contributor Author

hey @sayan-biswas @avinal @ramessesii2 @enarha I'm going to need your help with the error I'm seeing in the integration tests here.

Turns out the UpdateLog is not completing successfully. When I dump the API server logs, I see a couple context canceled messages in the logs.

        {"level":"info","ts":1708029631.050909,"caller":"zap/options.go:212","msg":"finished streaming call with code Canceled","grpc.auth_disabled":true,"grpc.start_time":"2024-02-15T20:40:30Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.244.1.14:34252","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"rpc error: code = Canceled desc = context canceled","grpc.code":"Canceled","grpc.time_duration_in_ms":97}

and with my latest commit, I also get

        {"level":"error","ts":1708032696.9937341,"caller":"zap/options.go:212","msg":"finished streaming call with code Unknown","grpc.auth_disabled":true,"grpc.start_time":"2024-02-15T21:31:36Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.244.1.14:54414","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"context canceled","grpc.code":"Unknown","grpc.time_duration_in_ms":48,"stacktrace":"github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.DefaultMessageProducer\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/options.go:212\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:61\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1673\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1787\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1016"}

Now, I first thought it was related to the child context we originally put in the fix, but I've made a couple of successive commits to first remove the calling of the cancel function, and then remove the use of the child context, but they have not had any effect. Still get this canceled context error on the UpdateLog

I did add debug in the goroutine and confirm that Flush call returns no error, and there were positive counts wrt data being flushed:

{"level":"info","time":"2024-02-15T21:14:33.290Z","logger":"watcher","caller":"dynamic/dynamic.go:432","msg":"GGMGGM1 flush ret count","commit":"d7fe8ee","knative.dev/traceid":"65b9ff5a-eed4-415b-9b87-65df76f0cabb","knative.dev/key":"default/hello-gcs-hello","results.tekton.dev/kind":"TaskRun","name":"hello-gcs-hello","flushCount":162}
{"level":"info","time":"2024-02-15T21:14:33.330Z","logger":"watcher","caller":"dynamic/dynamic.go:432","msg":"GGMGGM1 flush ret count","commit":"d7fe8ee","knative.dev/traceid":"534637e7-7558-4ef5-bcba-0ce8b864a7d3","knative.dev/key":"default/hello-gcs","results.tekton.dev/kind":"PipelineRun","name":"hello-gcs","flushCount":197}

WDYT?

thanks

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 63.6% -5.7

@gabemontero
Copy link
Contributor Author

gabemontero commented Feb 16, 2024

good news @sayan-biswas @avinal @ramessesii2 @enarha - with c76523c I have bypassed the canceled context error in the UpdateLogs on the API server.

Seeing that error with the mem leak fix maybe makes sense, in that we are doing better at unblocking the tknlog.NewWriter(..).Write(...) call but I'm having @pmacik run a parallel test with release 0.9.1 so we can confirm whether the error was ever happening with that version.

I just now need to hack on this fix some more so that we sufficiently block on the context before we exit streamLogs

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 62.7% -6.6

@gabemontero
Copy link
Contributor Author

my goodness golint even complains about misspelling words :-)

 pkg/watcher/reconciler/dynamic/dynamic.go:374:5: `reconcilation` is a misspelling of `reconciliation` (misspell)
	// reconcilation in this controller
	   ^
make: *** [Makefile:88: golangci-lint-check] Error 1 

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 64.2% -5.1

@gabemontero gabemontero changed the title [WIP] employ some channel best practices to avoid hanging goroutines employ some channel best practices to avoid hanging goroutines Feb 22, 2024
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 22, 2024
@gabemontero
Copy link
Contributor Author

@sayan-biswas @avinal @ramessesii2 @enarha I think I would be good to get this PR in first, to help provide some immediately relief with our deployment, and then I can rebase #715 as needed and after our stress test for that one is done, follow up with getting you all to get that ready for merge

PTAL

@gabemontero
Copy link
Contributor Author

@sayan-biswas @avinal @ramessesii2 @enarha I think I would be good to get this PR in first, to help provide some immediately relief with our deployment, and then I can rebase #715 as needed and after our stress test for that one is done, follow up with getting you all to get that ready for merge

PTAL

bump @sayan-biswas @avinal per my comment #715 (comment) let's merge this before #715 to get some immediate relief for the memory leak as well as get some baselines in production environments before we entertain #715

…mory leaks from unclosed buffered channels

fix panic in e2es and add dump of controller logs on errors

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
@gabemontero
Copy link
Contributor Author

I've squashed the commits (there were several temp/debug commits in there)

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 64.2% -5.1

@avinal avinal removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2024
Copy link
Member

@avinal avinal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work fine in my tests.

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2024
@gabemontero
Copy link
Contributor Author

/hold cancel

@sayan-biswas
Copy link
Contributor

/approve

@tekton-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: avinal, sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [avinal,sayan-biswas]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sayan-biswas
Copy link
Contributor

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2024
@tekton-robot tekton-robot merged commit 21d96ba into tektoncd:main Feb 27, 2024
5 of 6 checks passed
@gabemontero gabemontero deleted the update-log-chan-goroutine branch February 27, 2024 14:16
gabemontero added a commit to gabemontero/results that referenced this pull request Mar 8, 2024
…amLogs

So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak
was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher
quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered.  Among other things
our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a timeout to the context used in streamLogs, in case a blockage in the UpdateLog call somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/results that referenced this pull request Mar 15, 2024
…next with timeout

switch from CloseSend to CloseRecv in streamLogs
add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier

So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak
was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher
quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered.  Among other things
our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.
3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/results that referenced this pull request Mar 29, 2024
…next with timeout

switch from CloseSend to CloseRecv in streamLogs
add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier

So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak
was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher
quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered.  Among other things
our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.
3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces
4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete.
5) we now log the dealines for UpdateLog on the api server side

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/results that referenced this pull request Apr 5, 2024
…next with timeout

switch from CloseSend to CloseRecv in streamLogs
add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier

So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak
was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher
quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered.  Among other things
our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a configurable timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.
3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces
4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete.
5) we now log the dealines for UpdateLog on the api server side
6) we are back to pulling errors off of the tkn client error channel
tekton-robot pushed a commit that referenced this pull request Apr 8, 2024
…next with timeout

switch from CloseSend to CloseRecv in streamLogs
add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier

So we ran with the change from #712 in our production system, and while we confirmed the results logging memory leak
was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher
quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the #712 fixes have had a direct cause to the deadlock, or if another issue was encountered.  Among other things
our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a configurable timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.
3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces
4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete.
5) we now log the dealines for UpdateLog on the api server side
6) we are back to pulling errors off of the tkn client error channel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory consumption of the watcher container keeps increasing over time
4 participants