Experiment: random mutant sampling #2584

lukas-schaetzle · 2020-10-23T22:37:50Z

I've read this paper which suggests that random mutant sampling on top of selective mutation might lead to a very efficient mutation process while still providing a relative accurate mutation score. That's why I tried to incorporate a random sampling approach into Stryker, too. In the paper they state that complete randomness performs worse than a more sophisticated approach that they also applied but none the less, it should still lead to good results and is easier to implement. :)
I did not want to make a PR directly because I plan on doing some experiments with my implementation first. But I thought it might be a good idea to already make a small note here in case anyone is interested into the implementation. It's available in the mutant-sampling branch in my fork. I've added 2 parameters for controling the sampling: sampling <true/false> and samplingRate <percentage>.
Note, that it's still a little rough around the edges (e.g. the processbar is not adjusted).

The text was updated successfully, but these errors were encountered:

spruce-bruce · 2020-10-28T14:00:41Z

I have a large codebase with many tests that can't yet run in parallel (because of their use of a test database).

My entire test suite takes about two minutes to run, and stryker wants to try something like 27000 mutations on my app.

Random sampling is the only way I can think of to get stryker to work on an app like this.

Of course the tests can be made to run in parallel, and further optimized, but I'd like to produce a mutation score NOW that shows me the effectiveness of my tests and even running small samples would give me that.

Edit:
I've been reading more through the issues and it seems like stryker is focused exclusively on unit tests, and would consider most of my tests integration tests, and therefore incompatible.

Random sampling would MAKE them compatible, though, at least in part.

nicojs · 2020-11-04T07:45:21Z

I understand the need for this feature in some use cases. I'm in the same spot with one of my day-job projects as well.

However, I don't like the "random" here. I would like this to be reproduced, this way you can test it locally and get the same result as you can on your CI pipeline.

The way I see it now, I think a sampleRate might work (a number between 0 and 1). The default is 1. Stryker would then deterministically remove some mutants. Stryker should do a good job of selecting mutants here, so not simply remove all mutants of one mutator, instead, it should allow some mutants from all categories.

I do think the mutants should be visible in the report, but with an "ignored" state, that way you at least know which mutants weren't tested, so you won't get the false sense of security.

I've been reading more through the issues and it seems like stryker is focused exclusively on unit tests, and would consider most of my tests integration tests, and therefore incompatible.

Indeed, we're focussed on unit testing, but Stryker works for integration tests as well. You pointed out the challenges quite well.

Do you think having some way for each worker process to use a different database would help you? I've been thinking about ways to support that. For example, we could add a STRYKER_MUTATOR_WORKER_ID env variable (name pending), that could allow your test code to select a diffent database. This could allow you to run the integration tests in parallel.

spruce-bruce · 2020-11-04T16:26:26Z

However, I don't like the "random" here. I would like this to be reproduced, this way you can test it locally and get the same result as you can on your CI pipeline.

I haven't read all the way through the paper, so I can't be authoritative here, but I'd expect the randomness to be an important part of producing a valid score. Would some kind of seed help towards your concern? If random sampling was used, stryker could output a hash, and if you supplied that hash as an arg you'd get the same set of mutations. I personally don't know much about how seeding for "reproducible randomness" actually works, but I've played enough video games to know it's at least possible :)

Do you think having some way for each worker process to use a different database would help you? I've been thinking about ways to support that. For example, we could add a STRYKER_MUTATOR_WORKER_ID env variable (name pending), that could allow your test code to select a diffent database. This could allow you to run the integration tests in parallel.

I think that's a great idea! I was reading about how ava works with databases. I was going to experiment with setting up a blank database for each test at run time. I expect that this would increase overall test time for serial runs of course, but it would enable significant parallelism. The question is how soon does the db server become the bottleneck?

Your suggestion (one database per process) is a strong middle ground between "one database" and "one database per test", though!

Lakitna · 2020-11-13T15:58:05Z

This is a super interesting approach! The conclusion of the paper is that you can, fairly safely, drop the sample rate all the way down to 5% and still get >99% accurate overall mutation score! With that, I would gladly drop the sample rate down to 10% in CI. I'd bump it up to 10% to account for real-world ugliness.

In the paper they state that complete randomness performs worse than a more sophisticated approach that they also applied but none the less, it should still lead to good results and is easier to implement. :)

They've indeed used some fancy methods for selecting the mutations. It looks like it doesn't matter a lot which method you choose though. The numbers in the paper are really close for every method. In the paper, random selection with 5% sampling is still 99.44% accurate. The highest score is 99.52% for a 5% sampling rate.

I would start with a seeded random sampling of all mutations. Seeded to make it predictable as Nico said. Random sampling on all mutants because it's easiest to implement.

If the implementation is somewhat stable, I'm willing to generate some measurements using it. :)

Lakitna · 2020-11-16T09:41:56Z

So I've been thinking about this some more during the weekend. I think the ideal situation in CI for me would be something like this:

Global mutation score with a sampling rate of <= 10%. Record score somewhere. No score threshold.
Incremental mutation score for changes introduced in the current PR (see Support incremental analysis #322) with a sampling rate of 100%. Record score somewhere. Score threshold of >= 50%.

This will provide us with meaningful metrics over time and it prevents us from neglecting to write proper tests for new stuff. Or, in other words, it gives us Strykers main CI advantages.

At the same time, we're not punished for old code. We're not ignoring old code either which will keep test quality visible. This focus on testing new/changed code also allows us to start using Stryker with existing code bases. And, last but not least, this will reduce CI runtime significantly, especially for large codebases. For large codebases, you can probably get away with lower sampling rates too, reducing execution times even further.

I think it would be the best balance between using Stryker to write great tests and CI duration.

It's easy to think of this stuff, though implementing will be an entirely different beast. 🤭

stale · 2021-11-16T15:32:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Lakitna mentioned this issue Dec 3, 2020

Discussion: Performance testing #2434

Closed

stale bot added the ☠ stale Marked as stale by the stale bot, will be removed after a certain time. label Nov 16, 2021

stale bot closed this as completed Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: random mutant sampling #2584

Experiment: random mutant sampling #2584

lukas-schaetzle commented Oct 23, 2020

spruce-bruce commented Oct 28, 2020 •

edited

nicojs commented Nov 4, 2020

spruce-bruce commented Nov 4, 2020

Lakitna commented Nov 13, 2020 •

edited

Lakitna commented Nov 16, 2020

stale bot commented Nov 16, 2021

Experiment: random mutant sampling #2584

Experiment: random mutant sampling #2584

Comments

lukas-schaetzle commented Oct 23, 2020

spruce-bruce commented Oct 28, 2020 • edited

nicojs commented Nov 4, 2020

spruce-bruce commented Nov 4, 2020

Lakitna commented Nov 13, 2020 • edited

Lakitna commented Nov 16, 2020

stale bot commented Nov 16, 2021

spruce-bruce commented Oct 28, 2020 •

edited

Lakitna commented Nov 13, 2020 •

edited