Skip to content

kvnemesis: first step towards a fuzzed KVNemesis #148634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 7, 2025

Conversation

stevendanna
Copy link
Collaborator

You can run this test with the go fuzzer with something like:

go test ./pkg/kv/kvnemesis/ -test.fuzz=FuzzKVNemesisSingleNode
-test.fuzzcachedir=_fuzzcache -v -test.run=^$
-tags crdb_test -timeout=300m -parallel=4

It can also be run under bazel, but I have not yet sorted out all of the flags needed to get a coverage enabled build and to ensure that the failing test cases get written somewhere that can be referenced on subsequent runs.

The idea here is that the fuzzer provides a []byte that then determines the output of all random decisions in KVNemesis. This doesn't account for metamorphic decisions made outside of KVNemesis.

KVNemesis is a rather heavyweight test which seemed to be a problem for running it reliably under go-fuzz; however, go-fuzz's poor diagnostics when the test worker crash has made it hard to determine the exact cause so far.

Epic: none
Release note: None

@stevendanna stevendanna requested a review from a team as a code owner June 21, 2025 21:45
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna stevendanna requested a review from srosenberg June 21, 2025 21:50
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Jun 23, 2025
To run natively,

```
go build -buildmode=c-archive -gcflags=all=-d=libfuzzer -tags "crdb_test,gofuzz,gofuzz_libfuzzer,libfuzzer" -trimpath -gcflags syscall=-d=libfuzzer=0 -o kvnemesis_fuzz.a pkg/kv/kvnemesis/main/main.go
clang -o kvnemesis_fuzz kvnemesis_fuzz.a /home/stan_cockroachlabs_com/go/src/github.com/cockroachdb/cockroach/bin/c-deps/archived_cdep_libjemalloc_linux/lib/libjemalloc.a /home/stan_cockroachlabs_com/go/src/github.com/cockroachdb/cockroach/bin/c-deps/archived_cdep_libproj_linux/lib/libproj.a -fsanitize=fuzzer
./kvnemesis_fuzz -max_len=8192 seeds  >libfuzzer.out 2>&1
```

To run in go with libfuzzer instrumentation,

```
bazel run --override_repository=io_bazel_rules_go=/home/stan_cockroachlabs_com/go/src/github.com/cockroachdb/rules_go --@io_bazel_rules_go//go/config:gc_goopts=-d=libfuzzer --crdb_test pkg/kv/kvnemesis:kvnemesis_test -- -test.run notests -test.fuzz FuzzKVNemesisSingleNode  -test.fuzzcachedir /tmp/foobar/ -test.v >gofuzzer.out 2>&1
```

[1] cockroachdb#148634
[2] https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1750542607409039?thread_ts=1750341116.758639&cid=C023S0V4YEB

Epic: none

Release note: None
@stevendanna
Copy link
Collaborator Author

Test failure is unrelated (but important and fixed here #148660)

You can run this test with the go fuzzer with something like:

  go test ./pkg/kv/kvnemesis/ -test.fuzz=FuzzKVNemesisSingleNode \
  -test.fuzzcachedir=_fuzzcache -v -test.run=^$ \
  -tags crdb_test  -timeout=300m -parallel=4

It can also be run under bazel, but I have not yet sorted out all of the
flags needed to get a coverage enabled build and to ensure that the
failing test cases get written somewhere that can be referenced on
subsequent runs.

The idea here is that the fuzzer provides a []byte that then determines
the output of all random decisions in KVNemesis. This doesn't account
for metamorphic decisions made outside of KVNemesis.

KVNemesis is a rather heavyweight test which seemed to be a problem for
running it reliably under go-fuzz; however, go-fuzz's poor diagnostics
when the test worker crash has made it hard to determine the exact cause
so far.

Epic: none
Release note: None
@stevendanna stevendanna requested a review from miraradeva June 30, 2025 18:13
@stevendanna
Copy link
Collaborator Author

Overall, this test might be a bit too big for a fuzz-based approach. But, I think this might still be worth committing just so we have an example of how it might work.

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just trying to understand how fuzzing works. Naively I would have just let the fuzzer pick a randomness source or just a seed, and pass it to kvnemesis, but then again I don't know how the fuzzer actually navigates a coverage area such as kvnemesis, which itself makes random choices. Feel free to point me to reading material; I just did the go fuzzing tutorial and that didn't do into depth.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @srosenberg)


pkg/kv/kvnemesis/kvnemesis_test.go line 443 at r1 (raw file):

		// reliably. With all default settings the test runner fails without
		// printing any useful info. I _think_ it might be the result of a
		// hard-coded 10s timeout in the go-fuzz test worker.

It seems like that can be changed with -fuzztime?


pkg/kv/kvnemesis/kvnemesis_test.go line 458 at r1 (raw file):

			assertRaftApply:              true,
		})
		f.Add(rndSource.Output())

Another fuzzing noob question: it seems important to actually run kvnemesis with the generated randomness source and pass the resulting output to the fuzzer? Why is that exactly? Isn't the output the same as grabbing a random value from the source a fixed number of times? Is the problem that we don't know what that fixed number is?


pkg/util/randutil/rand.go line 301 at r1 (raw file):

// fuzzer's next step can change a single decision without affecting all
// previous decisions in the test, giving it some ability to direct its
// exploration.

Fuzzing noob question: one step of the fuzzer is one run of the given test (e.g. kvnemesis), which might consist of multiple random decisions, right? If yes, then what does it mean for "the fuzzer's next step can change a single decision without affecting all previous decisions in the test"?

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively I would have just let the fuzzer pick a randomness source or just a seed

The fuzzer uses a different API than stdlib's Random; see LLVMFuzzerTestOneInput in [1]. Most other fuzzers including Go's essentially work the same way. That is, they use an internal PRNG to generate and mutate an input, i.e., a byte array. (You can see an example of a Go's fuzzing mutator in [2]. Go's fuzzer uses reflection to extract out typed inputs from the byte array.) Fuzzers usually implement their own Random, mostly for performance reasons. (Fundamentally, there is no difference.)

Since fuzzer input generation is guided by coverage, it's important that the entire set of inputs is generated by the fuzzer. Essentially we have to adapt Random to read directly from the fuzzer's byte array. We have to be careful when the fuzzer's input is exhausted before we can generate a typed (kvnemesis) input; i.e., it's not generally safe to return a sentinel value [3].

[1] https://llvm.org/docs/LibFuzzer.html
[2] https://github.com/golang/go/blob/fdd7713fe5a7bc054fe08f79a1877f90c7188e53/src/internal/fuzz/mutator.go#L37
[3] https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1750687993416619?thread_ts=1750341116.758639&cid=C023S0V4YEB

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva and @stevendanna)

Copy link
Collaborator Author

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how the fuzzer actually navigates a coverage area such as kvnemesis, which itself makes random choices.

I don't either, this was based on some assumptions about what might be useful across a couple of different fuzzers that have been discussed, I put an argument for it in the response to one of your other questions but it is a hand-wave for sure. My assumption here is that if we just let the fuzzer choose the seed, then the fuzzer isn't any different than just letting the test run under stress.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva)


pkg/util/randutil/rand.go line 301 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Fuzzing noob question: one step of the fuzzer is one run of the given test (e.g. kvnemesis), which might consist of multiple random decisions, right? If yes, then what does it mean for "the fuzzer's next step can change a single decision without affecting all previous decisions in the test"?

What I was trying to get at is this. Say that a single run of the test makes 5 random decisions driven from 5*8 byte buffer that (for ease of notation) represents the following uint64s:

[ 45, 67, 12, 56, 12]

My thinking here is that a profile guided fuzzer could then, on its next fuzz iteration, mutate this input such that it looked like:

[ 45, 67, 12, 57, 12]
              ^- mutated

The random choices cause by the common prefix [ 45, 67, 12 would be the same and then the test would diverse. Whereas if we just had the fuzzer choose the seed, then the next fuzz iteration would likely represent a completely different series of choices.

Note, I have no idea if this works out in practice as I don't know how the fuzzers we plan on using work.


pkg/kv/kvnemesis/kvnemesis_test.go line 443 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

It seems like that can be changed with -fuzztime?

The timeout I was referring to here is this rather unfortunately hard-coded timeout:

https://github.com/golang/go/blob/86fca3dcb63157b8e45e565e821e7fb098fcf368/src/internal/fuzz/worker.go#L492-L494

But, I wrote this before I changed the random source to skip the test on short input, so maybe we won't hit this as much anymore.


pkg/kv/kvnemesis/kvnemesis_test.go line 458 at r1 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Another fuzzing noob question: it seems important to actually run kvnemesis with the generated randomness source and pass the resulting output to the fuzzer? Why is that exactly? Isn't the output the same as grabbing a random value from the source a fixed number of times? Is the problem that we don't know what that fixed number is?

The f.Add function allows you to add initial "input" into the corpus of tests that the fuzzer will use. This initial input can then be used to get initial coverage information and (depending on the fuzzer) serve as the basis for future, mutated, fuzz inputs. So my idea here was that by allowing the test to just run through completely and recording all the decisions, we would at least have some corpus data where the input byte array was a sufficient length to generate all the decisions kvnemesis needs to make. To be honest I'm not sure how well that works out in practice.

Copy link
Contributor

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations! I think I understand: we essentially take the randomness away from kvnemesis and let the fuzzer fully control it.

:lgtm:

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/util/randutil/rand.go line 301 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

What I was trying to get at is this. Say that a single run of the test makes 5 random decisions driven from 5*8 byte buffer that (for ease of notation) represents the following uint64s:

[ 45, 67, 12, 56, 12]

My thinking here is that a profile guided fuzzer could then, on its next fuzz iteration, mutate this input such that it looked like:

[ 45, 67, 12, 57, 12]
              ^- mutated

The random choices cause by the common prefix [ 45, 67, 12 would be the same and then the test would diverse. Whereas if we just had the fuzzer choose the seed, then the next fuzz iteration would likely represent a completely different series of choices.

Note, I have no idea if this works out in practice as I don't know how the fuzzers we plan on using work.

Yeah, I get it. We make the random choices explicit to the fuzzer, so it can vary them more granularly. Though maybe it is smart enough to figure out: given random seed X that produces choices [ 45, 67, 12, 56, 12], what random seed would produce choices [ 45, 67, 12, 57, 12].


pkg/kv/kvnemesis/kvnemesis_test.go line 458 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

The f.Add function allows you to add initial "input" into the corpus of tests that the fuzzer will use. This initial input can then be used to get initial coverage information and (depending on the fuzzer) serve as the basis for future, mutated, fuzz inputs. So my idea here was that by allowing the test to just run through completely and recording all the decisions, we would at least have some corpus data where the input byte array was a sufficient length to generate all the decisions kvnemesis needs to make. To be honest I'm not sure how well that works out in practice.

I agree, though it seems like it's really the length of the input array that matters. E.g. if we know exactly how many random choices kvnemesis will make, we can simulate the resulting rndSource.Output() by just drawing that many random values from rndSource. It could speed up the test if it seems like a correct assumption.

@stevendanna
Copy link
Collaborator Author

I agree, though it seems like it's really the length of the input array that matters. E.g. if we know exactly how many random choices kvnemesis will make, we can simulate the resulting rndSource.Output() by just drawing that many random values from rndSource. It could speed up the test if it seems like a correct assumption.

Good idea, I'll take a look.

@stevendanna
Copy link
Collaborator Author

bors r=miraradeva

@craig
Copy link
Contributor

craig bot commented Jul 7, 2025

@craig craig bot merged commit 221c3df into cockroachdb:master Jul 7, 2025
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants