Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix flakes in log following related unit tests #112

Conversation

gabemontero
Copy link
Member

Changes

Fixes #103

/kind cleanup

Hello @otaviof @HeavyWombat @SaschaSchwarze0

I finally got some shipwright cycles this week, and one of my long standing to-do's was to spend quality time on the log following unit test flakes @otaviof reported.

Allow me to explain my process to getting to this point of an alternative PR.

While, as it is clear to see, I initially took a brief pass at reviewing @otaviof 's #104 a month ago, I decided to re-engage here by first making sure I understood first hand what was the crux of the original situation as I remember @otaviof describing in our team meetings, where if I were to attempt to summarize, it was the inherent fragility of k8s fake clients, especially around watches (of course @otaviof please do chime in if my brief summary misses key points).

So running @otaviof 's reproducers in #103 as well as running the unit tests in the debugger, the k8s fake client watch fragility is what was most noticeable to me. If I were to attempt to summarize, it did not seem thread safe. To the point where it was unclear if the events reported by the unit test as missing were the actual events that were missing.

But even of more importance, the "has synced" concept we have with the k8s watch based informers used in controllers occurred to me as being something that would be useful here (I describe this in more detail in #103 (comment)). So, building on that to see if that proved valid (which it did), after about a full day of coding, I've got a different take on addressing this flake with this PR.

I see #104 needs a rebase as it its. My initial thought is we iterate on this to merge it, so we address the flake. Then reset and see if we want some of the refactoring from #104 that may not be immediately related to the k8s fake client flakiness, but that we like, either by reworking 1#04, and get that ready again, or if it makes sense to just redo #104 in a new PR. But that said, if folks think of elements of #104 that can be easily applied here, sure, let's discuss.

A more detailed breakdown

  • no changes to the existing flags ... the "user experience" is the same
  • follwer/pod_watcher Start methonds have been split into Connect and WaitForCompletion ... this somewhat speaks to the "has synched" concept with k8s informers
  • the run command has a FollowerReady method to help with multi-threaded use of consumers of the follower (in theory other commands that use follower could follow this pattern if need be) ... this more directly is the "has synched" equivalent from k8s informers
  • we have better consistency across the different commands, in that they create or initialize their Follwers in their Complete methods, and they startup and wait for them in their Run methods
  • removed the 3 second sleep in the follower event processing
  • removed that initialization of the follower in the run unit test
  • added lower level tuning (not exposed as flags at this time) around some of the timing intervals to help speed up unit tests
  • removed some of the multi threaded elements in the pod_watcher unit tests; they intermittently stressed the more flaky elements of the k8s fake clients, especially the watch elements; I think in the end, running with race detection, we have sufficient coverage there anyway

Some output from some of @otaviof 's flake reproducers:

# Attempt 97...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.103s
# Attempt 98...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.110s
# Attempt 99...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.103s
# Attempt 100...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.104s
gmontero ~/go/src/github.com/shipwright-io/cli-1  (k8s-unit-test-flake-impacts-podwatcher)$

Submitter Checklist

  • [/ ] Includes tests if functionality changed/was added
  • [ n/a] Includes docs if changes are user-facing
  • [ /] Set a kind label on this PR
  • [ /] Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

NONE

@openshift-ci openshift-ci bot added release-note-none kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 20, 2022
@gabemontero gabemontero force-pushed the k8s-unit-test-flake-impacts-podwatcher branch from 57d1f02 to 18d9f9e Compare April 20, 2022 22:06
@gabemontero
Copy link
Member Author

clean tests

saw one minor correction in the code ... forgot to use the new fail poll interval in follow.go

@gabemontero
Copy link
Member Author

fyi @otaviof @HeavyWombat @SaschaSchwarze0 - another simplification occurred to me last night

I'm making an update this morning and will post again if it is in fact successful and I've updated this PR

thanks

@gabemontero gabemontero force-pushed the k8s-unit-test-flake-impacts-podwatcher branch 4 times, most recently from 3faebb8 to 773555a Compare April 21, 2022 13:55
@gabemontero
Copy link
Member Author

fyi @otaviof @HeavyWombat @SaschaSchwarze0 - another simplification occurred to me last night

I'm making an update this morning and will post again if it is in fact successful and I've updated this PR

thanks

ok @otaviof @HeavyWombat @SaschaSchwarze0 my simplifications are in (along with some additional comments in the code explaining some details wrt mulit-threaded testing, or lack there of)

PTAL when you all get the chance

thanks

@gabemontero
Copy link
Member Author

ok we all discussed this one in the community meeting

@otaviof is good with moving forward with this reduced scope version of addressing the test flakes. He'll review, and when he and I iterate and land with consensus on these changes, we'll merge this. @HeavyWombat and @SaschaSchwarze0 signed off on this plan as well.

going ahead and

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 25, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2022
@otaviof
Copy link
Member

otaviof commented Apr 28, 2022

On the very first run, it failed with test timeout, as per:

$ for i in `seq 1 100` ; do
        echo "# Attempt ${i}..."
        make test-unit \
                GO_FLAGS='' \
                GO_TEST_FLAGS='-race -failfast -timeout=45s' \
                CMD='' \
                PKG=./pkg/shp/cmd/build \
                ARGS='-run="^TestStartBuildRunFollowLog"' || break
done
# Attempt 1...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
[container] fake logs
panic: test timed out after 45s

goroutine 84 [running]:
testing.(*M).startAlarm.func1()
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:2029 +0xb8
created by time.goFunc
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/time/sleep.go:176 +0x4c

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000582340, {0x106c1e0de, 0x1a}, 0x1074a15e0)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1487 +0x580
testing.runTests.func1(0x0?)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1839 +0x98
testing.tRunner(0xc000582340, 0xc0007bfb98)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1439 +0x190
testing.runTests(0xc0004595e0?, {0x108421bd0, 0x1, 0x1}, {0xc0004f7ce0?, 0x104f80a80?, 0x108451c80?})
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1837 +0x6cc
testing.(*M).Run(0xc0004595e0)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1719 +0x87c
main.main()
	_testmain.go:47 +0x300

goroutine 21 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0?)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/k8s.io/klog/v2/klog.go:1283 +0x78
created by k8s.io/klog/v2.init.0
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/k8s.io/klog/v2/klog.go:420 +0x20c

goroutine 5 [chan receive]:
knative.dev/pkg/metrics.(*metricsWorker).start(0xc000010258)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/knative.dev/pkg/metrics/metrics_worker.go:99 +0x48
created by knative.dev/pkg/metrics.init.0
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/knative.dev/pkg/metrics/exporter.go:39 +0x138

goroutine 39 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0001fe000)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/go.opencensus.io/stats/view/worker.go:276 +0x104
created by go.opencensus.io/stats/view.init.0
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/go.opencensus.io/stats/view/worker.go:34 +0xfc

goroutine 6 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc000550300, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc000138460, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239
github.com/shipwright-io/cli/pkg/shp/cmd/build.(*RunCommand).Run(0xc000066d40, 0x1074c5f10?, 0xc000532810)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run.go:111 +0x48c
github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog(0xc0005824e0)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:191 +0x1498
testing.tRunner(0xc0005824e0, 0x1074a15e0)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1439 +0x190
created by testing.(*T).Run
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1486 +0x564

goroutine 25 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc000550000, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc00047e050, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239
github.com/shipwright-io/cli/pkg/shp/cmd/build.(*RunCommand).Run(0xc0000666c0, 0x106c82150?, 0xc00053c360)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run.go:111 +0x48c
github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog.func3()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:197 +0x54
created by github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:196 +0x1600

goroutine 10 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc00052a900, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc000138230, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239
github.com/shipwright-io/cli/pkg/shp/cmd/build.(*RunCommand).Run(0xc000112640, 0x106c82150?, 0xc000533230)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run.go:111 +0x48c
github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog.func3()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:197 +0x54
created by github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:196 +0x1600

goroutine 12 [chan receive]:
github.com/shipwright-io/cli/pkg/shp/tail.(*Tail).Start.func1.1()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/tail/tail.go:54 +0x5c
created by github.com/shipwright-io/cli/pkg/shp/tail.(*Tail).Start.func1
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/tail/tail.go:53 +0x27c

goroutine 32 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc000550180, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc000a481e0, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239
github.com/shipwright-io/cli/pkg/shp/cmd/build.(*RunCommand).Run(0xc0004bf4c0, 0x106c82150?, 0xc00069d590)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run.go:111 +0x48c
github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog.func3()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:197 +0x54
created by github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:196 +0x1600

goroutine 82 [chan receive (nil chan)]:
github.com/shipwright-io/cli/pkg/shp/tail.(*Tail).Start.func2()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/tail/tail.go:65 +0x54
created by github.com/shipwright-io/cli/pkg/shp/tail.(*Tail).Start
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/tail/tail.go:64 +0x1d8
FAIL	github.com/shipwright-io/cli/pkg/shp/cmd/build	45.740s
FAIL
make: *** [test-unit] Error 1

@gabemontero
Copy link
Member Author

very weird ... I ran that for loop a bunch of times

either I made a minor change that messed things up and forgot to run the for loop, or this flake is really intermittent

ok I'll take another pass today and report back

thanks

@gabemontero
Copy link
Member Author

gabemontero commented Apr 28, 2022

@otaviof - are you sure you recompiled with my branch? The line numbers from your #112 (comment) do not line up with what I see in the code.

As an example:

goroutine 6 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc000550300, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc000138460, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239

When I look at the code, the Start in follow.go starts on line 274.

And line 133 of pod_watcher.go is a comment. The select actually starts on line 137 for me.

The line numbers in your stack traces do line up with what I see in main branch.

Also, if you look at follow.go's Start now, it calls a different method besides Start from pod_watcher.go:

func (f *Follower) Start(listOpts metav1.ListOptions) (*corev1.Pod, error) {
	err := f.Connect(listOpts)
	if err != nil {
		return nil, err
	}
	return f.WaitForCompletion()
}

@gabemontero
Copy link
Member Author

Just ran

for i in `seq 1 100` ; do
        echo "# Attempt ${i}..."
        make test-unit \
                GO_FLAGS='' \
                GO_TEST_FLAGS='-race -failfast -timeout=45s' \
                CMD='' \
                PKG=./pkg/shp/cmd/build \
                ARGS='-run="^TestStartBuildRunFollowLog"' || break
done

and it completes OK for me:

# Attempt 98...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.104s
# Attempt 99...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.105s
# Attempt 100...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.104s

@otaviof
Copy link
Member

otaviof commented Apr 28, 2022

@otaviof - are you sure you recompiled with my branch? The line numbers from your #112 (comment) do not line up with what I see in the code.
[...]

Sorry 🤦‍♀️ I end up running the tests on the wrong terminal window. Indeed, the tests are passing consistently, so I'll continue the review where I left this morning 👍

@gabemontero
Copy link
Member Author

@otaviof - are you sure you recompiled with my branch? The line numbers from your #112 (comment) do not line up with what I see in the code.
[...]

Sorry woman_facepalming I end up running the tests on the wrong terminal window. Indeed, the tests are passing consistently, so I'll continue the review where I left this morning +1

whew :-) ... great thanks for confirming @otaviof

Copy link
Member

@otaviof otaviof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabemontero Just a couple of "nit" comments, all in all I liked the approach to avoid flaky results and reliably consume Watch events 👍 Thanks!

// Start runs the event loop based on a watch instantiated against informed pod. In case of errors
// the loop is interrupted.
func (p *PodWatcher) Start(listOpts metav1.ListOptions) (*corev1.Pod, error) {
func (p *PodWatcher) Connect(listOpts metav1.ListOptions) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to add a doc-comment explaining how the connect helps the multi-threading, etc.

@@ -27,6 +27,7 @@ type RunCommand struct {
buildRunSpec *buildv1alpha1.BuildRunSpec // stores command-line flags
follow bool // flag to tail pod logs
follower *follower.Follower
follwerReady chan bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo on the variable name.

eliminate some multi threaded elements in pod watcher unit test to help fragile fake k8s client
decrease some unit test timeouts to speed them up
…its consumers

allow for optimize / eliminate wait/sleep/timeout intervals in follower unit test and follower consumer unit tests
remove follower initialization in unit test
split follower setup where Complete() is for creation and Run() for running follower
add buildrun name setter to facilitate Complete() vs. Run() split
add multi thread coordination via FollowerReady to make sure pod_watcher is up
@gabemontero gabemontero force-pushed the k8s-unit-test-flake-impacts-podwatcher branch from 773555a to 600428d Compare April 29, 2022 14:27
@gabemontero
Copy link
Member Author

updates from your comments pushed @otaviof

given their nature, I just squashed them into the existing commits

thanks!

@otaviof
Copy link
Member

otaviof commented Apr 29, 2022

/lgtm

Thanks @gabemontero!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2022
@openshift-merge-robot openshift-merge-robot merged commit e427f92 into shipwright-io:main Apr 29, 2022
@gabemontero gabemontero deleted the k8s-unit-test-flake-impacts-podwatcher branch April 29, 2022 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Unit Test Results
3 participants