add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks #725

gabemontero · 2024-03-08T19:43:46Z

Changes

So we ran with the change from #712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the #712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.
3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces
4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete.
5) we now log the dealines for UpdateLog on the api server side

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

[n/a ] Has Docs included if any changes are user facing
[n/a ] Has Tests included if any functionality added or changed
[ y] Tested your changes locally (if this is a code change)
[y ] Follows the commit message standard
[ y] Meets the Tekton contributor standards (including functionality, content, code)
[y ] Has a kind label. You can add a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
[ y] Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
[n/a ] Release notes contain the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Address deadlocked reconciler threads potentially hung while streaming logs

tekton-robot · 2024-03-08T19:48:05Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	63.9%	-0.3

gabemontero · 2024-03-08T21:03:27Z

OK 4cc4cab passed the e2e's

Next will in a separate commit attempt a more expansive use of context with timeout around all grpc calls from the dynamic reconciler.

tekton-robot · 2024-03-08T21:07:39Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	64.1%	-0.1

gabemontero · 2024-03-08T21:48:15Z

ok @sayan-biswas I have 2 forms of a fix and/or work around for the results watcher hang we saw in our prod env today

the first commit just covers updating logs to the results api server

the second commit handles all grpc calls to the api server coming out of the dynamic reconciler

after some code review, it centers on the notion that we were blocked on an grpc call to the api server, where the api server may have been in a bad state, given I noticed an api restart around the time I recycled the watcher pod

So basically,

by setting a timeout on the context used for all grpc calls out of the dynamic reconciler, we timeout and exit after 5 minutes if any of those calls are hung
the second commit, in order to have 1 context cover both the log grpc and non-log gprc, has to move the call to streamLog back on the reconciler thread, as cancelling/cleaning up the parent context could inadvertently abort the UpdateLog call
also employed the CloseAndRecv improvement on the UpdateLog call to again coordinate with cleaning up the context

gabemontero · 2024-03-09T14:14:19Z

pkg/watcher/reconciler/dynamic/dynamic.go

 	}

 	return nil
 }

 func (r *Reconciler) streamLogs(ctx context.Context, o results.Object, logType, logName string) error {
-	logger := logging.FromContext(ctx)
-	logsClient, err := r.resultsClient.UpdateLog(ctx)
+	streamCtx, streamCancel := context.WithTimeout(ctx, 5*time.Minute)


if we go with the broader using of context timeout within the dynamic reconciler loop, we would get rid of this particular child context with timeout

tekton-robot · 2024-03-09T15:05:23Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	62.4%	-1.8

tekton-robot · 2024-03-09T15:09:04Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	61.1%	-3.1

tekton-robot · 2024-03-09T18:39:21Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	60.4%	-3.8

gabemontero · 2024-03-10T01:43:17Z

ok @sayan-biswas I now have a third commit that deciphers if the context or any specific GRPC calls exist because of deadline exceeded, and if so, dumps a goroutine with stack traces list to stdout for analysis.

So, with our hand on Friday, either the tkn client threading ended in a deadlock, or more likely, we had a grpc call to the api server hang (perhaps because the api server became messed up, which might have occurred since it restarted on its own).

In either case, we should get a thread dump in the log that we can diagnose.

tekton-robot · 2024-03-10T01:43:33Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	58.9%	-5.3

tekton-robot · 2024-03-10T01:54:09Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	58.9%	-5.3

tekton-robot · 2024-03-15T14:10:33Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	58.5%	-5.7

tekton-robot · 2024-03-29T20:19:03Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/api/server/v1alpha2/logs.go	68.3%	68.4%	0.1
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	58.5%	-5.7

gabemontero · 2024-03-29T20:47:30Z

e2e's green with timeout full propagated, no go func for sendLogs

will update configuration of timeout now

tekton-robot · 2024-03-29T20:52:04Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
cmd/watcher/main.go	Do not exist	0.0%
pkg/api/server/v1alpha2/logs.go	68.3%	68.4%	0.1
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	56.5%	-7.7

gabemontero · 2024-03-29T20:59:06Z

@sayan-biswas @khrm @avinal @enarha this PR is ready for review, iterate, merge

khrm · 2024-04-01T04:42:23Z

pkg/watcher/reconciler/dynamic/dynamic.go

+	// context with timeout does not work with the partial end to end flow that exists with unit tests;
+	// this field will alway be set for real
+	if r.cfg != nil && r.cfg.UpdateLogTimeout != nil {
+		dynamicContext, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout)


Suggested change

dynamicContext, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout)

ctx, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout)

yeah I'm not a fan of using the generic variable name for this ... I was intentional in this name change

can you elaborate @khrm on why you want the name to be ctx ?

Someone modifying the function later on can easily by mistake use ctx. ctx is still available.

ok you've swayed me ... will adjust

I've pushed as a separate commit for now @khrm to facilitate your review .... I can squash once we are sync'ed on the change being ready

khrm · 2024-04-01T04:58:21Z

pkg/watcher/reconciler/dynamic/dynamic.go

@@ -333,22 +414,21 @@ func (r *Reconciler) sendLog(ctx context.Context, o results.Object) error {
 			zap.String("name", o.GetName()),
 		)

-		go func() {


Why don't we pass context here and use select and ctx.done?

because the end result is the same; we block the reconcile thread until the operation is complete

no reason for the added complextiy

gabemontero · 2024-04-01T16:31:13Z

pkg/watcher/reconciler/dynamic/dynamic.go

@@ -396,6 +499,13 @@ func (r *Reconciler) streamLogs(ctx context.Context, o results.Object, logType,
 		Err: inMemWriteBufferStderr,
 	}, logChan, errChan)

+	// pull the first error that occurred and return on that; reminder - per https://golang.org/ref/spec#Channel_types
+	// channels act as FIFO queues
+	chanErr, ok := <-errChan


since @khrm last review I've added here a simplified version of the error handling that @sayan-biswas used to do prior to pr #712

let's just return the first error seen; from what I see in the tkn code, multiple errors are not collected

with this though @sayan-biswas and the prior changes made with #712, plus moving the sendLogs call back on the reconciler thread, we now address one of the short comings of the old implementation, namely ignoring of errors and preventing retries on reconciliation.

Reminder: with the threadiness arg, users can adjust the number of threads when they enable log support.

I will add a separate PR to add the k8s client tuning shortly

#744 is the add tuning options PR

tekton-robot · 2024-04-01T16:34:38Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/api/server/v1alpha2/logs.go	68.3%	68.4%	0.1
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	55.7%	-8.5

gabemontero · 2024-04-03T14:55:57Z

@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks

tekton-robot · 2024-04-03T14:59:37Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/api/server/v1alpha2/logs.go	68.3%	68.4%	0.1
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	55.3%	-8.9

khrm

/approve

sayan-biswas · 2024-04-05T11:12:09Z

/approve

tekton-robot · 2024-04-05T11:12:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khrm, sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [khrm,sayan-biswas]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sayan-biswas · 2024-04-05T11:13:53Z

@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks

Changes are fine.
Please go ahead and squash.

…next with timeout switch from CloseSend to CloseRecv in streamLogs add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events. We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why. This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock: 1) we have added a configurable timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem 2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not intermittenly cancel an UpdateLog call that would have otherwise succeeded. 3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces 4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete. 5) we now log the dealines for UpdateLog on the api server side 6) we are back to pulling errors off of the tkn client error channel

gabemontero · 2024-04-05T11:48:18Z

thanks @sayan-biswas @khrm commits are squashed

tekton-robot · 2024-04-05T12:47:09Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/api/server/v1alpha2/logs.go	68.3%	68.4%	0.1
pkg/watcher/reconciler/dynamic/dynamic.go	64.2%	55.3%	-8.9

sayan-biswas · 2024-04-08T12:20:40Z

/lgtm

tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 8, 2024

tekton-robot requested review from avinal and dibyom March 8, 2024 19:43

tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 8, 2024

tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 8, 2024

gabemontero commented Mar 9, 2024

View reviewed changes

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 9, 2024

gabemontero force-pushed the possible-hang-diag branch from ff9ad16 to 2acd8a5 Compare March 9, 2024 15:05

gabemontero force-pushed the possible-hang-diag branch from 2acd8a5 to 663a88f Compare March 9, 2024 18:36

gabemontero force-pushed the possible-hang-diag branch from 663a88f to 0e6750b Compare March 10, 2024 01:40

gabemontero force-pushed the possible-hang-diag branch from 0e6750b to ad4fd1a Compare March 10, 2024 01:50

gabemontero mentioned this pull request Mar 11, 2024

DO NOT MERGE, CI ONLY: Downstream 0.9.1 04 openshift-pipelines/tektoncd-results#65

Closed

gabemontero force-pushed the possible-hang-diag branch from ad4fd1a to 422f1de Compare March 15, 2024 13:23

gabemontero changed the title ~~WIP: add timeout to streamLogs; switch from CloseSend to CloseRecv in streamLogs~~ WIP: add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks Mar 15, 2024

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 15, 2024

gabemontero force-pushed the possible-hang-diag branch from 422f1de to db9eb3e Compare March 29, 2024 20:14

khrm reviewed Apr 1, 2024

View reviewed changes

gabemontero commented Apr 1, 2024

View reviewed changes

khrm reviewed Apr 4, 2024

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 4, 2024

sayan-biswas approved these changes Apr 5, 2024

View reviewed changes

gabemontero force-pushed the possible-hang-diag branch from c8e684c to 2ab9f4e Compare April 5, 2024 11:47

tekton-robot assigned sayan-biswas Apr 8, 2024

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2024

tekton-robot merged commit aa5bb71 into tektoncd:main Apr 8, 2024
6 checks passed

gabemontero deleted the possible-hang-diag branch April 12, 2024 14:15

	dynamicContext, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout)
	ctx, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout)

add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks #725

add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks #725

Conversation

gabemontero commented Mar 8, 2024 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Mar 8, 2024

gabemontero commented Mar 8, 2024

tekton-robot commented Mar 8, 2024

gabemontero commented Mar 8, 2024

Choose a reason for hiding this comment

tekton-robot commented Mar 9, 2024

tekton-robot commented Mar 9, 2024

tekton-robot commented Mar 9, 2024

gabemontero commented Mar 10, 2024

tekton-robot commented Mar 10, 2024

tekton-robot commented Mar 10, 2024

tekton-robot commented Mar 15, 2024

tekton-robot commented Mar 29, 2024

gabemontero commented Mar 29, 2024

tekton-robot commented Mar 29, 2024

gabemontero commented Mar 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Apr 1, 2024

gabemontero commented Apr 3, 2024

tekton-robot commented Apr 3, 2024

khrm left a comment

Choose a reason for hiding this comment

sayan-biswas commented Apr 5, 2024

tekton-robot commented Apr 5, 2024

sayan-biswas commented Apr 5, 2024 • edited Loading

gabemontero commented Apr 5, 2024

tekton-robot commented Apr 5, 2024

sayan-biswas commented Apr 8, 2024

gabemontero commented Mar 8, 2024 •

edited

Loading

sayan-biswas commented Apr 5, 2024 •

edited

Loading