Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Unit Test Results #103

Closed
otaviof opened this issue Mar 21, 2022 · 2 comments · Fixed by #112 · May be fixed by #104
Closed

Flaky Unit Test Results #103

otaviof opened this issue Mar 21, 2022 · 2 comments · Fixed by #112 · May be fixed by #104
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.

Comments

@otaviof
Copy link
Member

otaviof commented Mar 21, 2022

TestStartBuildRunFollowLog (pkg/shp/cmd/build)

for i in `seq 1 100` ; do 
	echo "# Attempt ${i}..."
	make test-unit \
		GO_FLAGS='' \
		GO_TEST_FLAGS='-race -failfast -timeout=45s' \
		CMD='' \
		PKG=./pkg/shp/cmd/build \
		ARGS='-run="^TestStartBuildRunFollowLog"' || break
done

Fails after a few attempts:

# Attempt 1...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok      github.com/shipwright-io/cli/pkg/shp/cmd/build  20.195s
# Attempt 2...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok      github.com/shipwright-io/cli/pkg/shp/cmd/build  21.195s
# Attempt 3...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok      github.com/shipwright-io/cli/pkg/shp/cmd/build  21.195s
# Attempt 4...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
[container] fake logs
panic: test timed out after 45s

Test_PodWatcherEvents (pkg/shp/reactor)

make test-unit \
	GO_TEST_FLAGS='-race -cover -failfast' \
	CMD='' \
	PKG=./pkg/shp/reactor \
	ARGS='-run="^Test_PodWatcherEvents"'

Running the same test a few times sequentially, ends up as the following:

go test -v -mod=vendor -race -cover -failfast  ./pkg/shp/reactor -run="^Test_PodWatcherEvents"
=== RUN   Test_PodWatcherEvents
=== RUN   Test_PodWatcherEvents/pod-is-added
=== RUN   Test_PodWatcherEvents/pod-is-modified
=== RUN   Test_PodWatcherEvents/pod-is-deleted
=== CONT  Test_PodWatcherEvents
    pod_watcher_test.go:234:
        Timed out after 10.001s.
        Expected
            <chan string | len:0, cap:5>: 0xc000444180
        to receive something.
--- FAIL: Test_PodWatcherEvents (10.00s)
    --- PASS: Test_PodWatcherEvents/pod-is-added (0.00s)
    --- PASS: Test_PodWatcherEvents/pod-is-modified (0.00s)
    --- PASS: Test_PodWatcherEvents/pod-is-deleted (0.00s)
FAIL
coverage: 59.1% of statements
FAIL    github.com/shipwright-io/cli/pkg/shp/reactor    10.038s
FAIL
make: *** [Makefile:60: test-unit] Error 1
@otaviof otaviof changed the title Flaky Unit Tests Flaky Unit Test Results Mar 21, 2022
@adambkaplan
Copy link
Member

/kind cleanup

@openshift-ci openshift-ci bot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 23, 2022
@adambkaplan adambkaplan added this to the Backlog milestone Mar 23, 2022
@otaviof otaviof mentioned this issue Mar 26, 2022
4 tasks
@gabemontero
Copy link
Member

hey @otaviof

I finally got to spend some quality time (like an entire afternoon) on this flake. What I ended up doing was start from main branch and reproduced both with your make test-unit reproducers above, as well as running TestStartBuildRunFollowLog and
Test_PodWatcherEvents from my GoLand debugger, to make sure I understood the problem.

To that end, I saw first hand the k8s test framework / fakeclient "fragility" I believe you articulated to us back whenever we discussed this in office hours (which if I remember correctly was some time after you posted #104).

Now, running in the debugger especially, where I was able to set break points when the events came into the pod watcher for example, I saw some "oddities" which made me pivot a bit. For example, I saw failures reported at

g.Eventually(eventsCh, 10*time.Second).Should(Receive(&onPodDeletedFn))
for a missing delete event, but I saw a delete event get processed (but perhaps a modify event not get processed).

In sorting this out, this got me down a path where I have a set of changes (which still need some clean up before I push to a PR), which fall under these categories:

  1. eliminated some of the "mulit-threadedness" or concurrency in Test_PodWatcherEvents but also got a bit more precise around not just the arrival of events but the order of them; I have rationalized a bit that the lost concurrency is not something to worry about (this post is already long so I'll refrain for now), but we can certainly discuss that when the time comes
  2. added a rough approximation of the SharedInformerIndex.HasSynced and k8s.io/client-go/tools/cache.WaitForCacheSync constructs we have with watch based controllers; my testing especially with the debugger pointed me to the fact that the PodWatcher's watch might not have been established before we populated the fake k8s clients, and so we would lose events with the fake clients (where as we do not with the real ones, where an initial list occurs); my prototype (which is where I need the most "clean up" wrt coding standards) seems to have helped with this
  3. employing this "has sync" notion in TestStartBuildRunFollowLog also seemed to help
  4. lastly, there were some optimization around reducing timeouts, polling intervals, and elimination of a place holder sleep we had in TestStartBuildRunFollowLog that also tightened things up. Some of this also needs some coding "clean up"

When I pick back up hopefully tomorrow, I want to cross reference with your #104 to see if there is any changes I've made that resemble what you did. As well as tackle the "coding cleanup".

But overall it is a lot smaller of a change than your #104.

To that, when I do get a PR up, we can talk about the staging of things. Certainly some form of your #104 refactor as a long term goal could make sense. That said, it also may make sense to get a reviewed and agreed upon form of my change to resolve the flakes in the interim.

Anyway, I wanted to at least give you some inkling of what I did today, and that I may have a PR coming in this space. We can certainly talk more during the next team office hours, when I could even share some code, even if I don't have it quite ready for a PR.

As a preview, here are some log snippets using your make test-unit reproducers above. In particular, I think you'll notice the improved performance, especially with the log following test case:

# Attempt 97...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.103s
# Attempt 98...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.110s
# Attempt 99...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.103s
# Attempt 100...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.104s
gmontero ~/go/src/github.com/shipwright-io/cli-1  (k8s-unit-test-flake-impacts-podwatcher)$

and

gmontero ~/go/src/github.com/shipwright-io/cli-1  (k8s-unit-test-flake-impacts-podwatcher)$ make test-unit \
	GO_TEST_FLAGS='-race -cover -failfast' \
	CMD='' \
	PKG=./pkg/shp/reactor \
	ARGS='-run="^Test_PodWatcherEvents"'
> > > > go test -v -mod=vendor -race -cover -failfast  ./pkg/shp/reactor -run="^Test_PodWatcherEvents"
=== RUN   Test_PodWatcherEvents
--- PASS: Test_PodWatcherEvents (0.00s)
PASS
coverage: 60.3% of statements
ok  	github.com/shipwright-io/cli/pkg/shp/reactor	0.061s	coverage: 60.3% of statements
gmontero ~/go/src/github.com/shipwright-io/cli-1  (k8s-unit-test-flake-impacts-podwatcher)$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants