Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: random mutant sampling #2584

Closed
lukas-schaetzle opened this issue Oct 23, 2020 · 6 comments
Closed

Experiment: random mutant sampling #2584

lukas-schaetzle opened this issue Oct 23, 2020 · 6 comments
Labels
☠ stale Marked as stale by the stale bot, will be removed after a certain time.

Comments

@lukas-schaetzle
Copy link
Contributor

I've read this paper which suggests that random mutant sampling on top of selective mutation might lead to a very efficient mutation process while still providing a relative accurate mutation score. That's why I tried to incorporate a random sampling approach into Stryker, too. In the paper they state that complete randomness performs worse than a more sophisticated approach that they also applied but none the less, it should still lead to good results and is easier to implement. :)
I did not want to make a PR directly because I plan on doing some experiments with my implementation first. But I thought it might be a good idea to already make a small note here in case anyone is interested into the implementation. It's available in the mutant-sampling branch in my fork. I've added 2 parameters for controling the sampling: sampling <true/false> and samplingRate <percentage>.
Note, that it's still a little rough around the edges (e.g. the processbar is not adjusted).

@spruce-bruce
Copy link

spruce-bruce commented Oct 28, 2020

I have a large codebase with many tests that can't yet run in parallel (because of their use of a test database).

My entire test suite takes about two minutes to run, and stryker wants to try something like 27000 mutations on my app.

Random sampling is the only way I can think of to get stryker to work on an app like this.

Of course the tests can be made to run in parallel, and further optimized, but I'd like to produce a mutation score NOW that shows me the effectiveness of my tests and even running small samples would give me that.

Edit:
I've been reading more through the issues and it seems like stryker is focused exclusively on unit tests, and would consider most of my tests integration tests, and therefore incompatible.

Random sampling would MAKE them compatible, though, at least in part.

@nicojs
Copy link
Member

nicojs commented Nov 4, 2020

I understand the need for this feature in some use cases. I'm in the same spot with one of my day-job projects as well.

However, I don't like the "random" here. I would like this to be reproduced, this way you can test it locally and get the same result as you can on your CI pipeline.

The way I see it now, I think a sampleRate might work (a number between 0 and 1). The default is 1. Stryker would then deterministically remove some mutants. Stryker should do a good job of selecting mutants here, so not simply remove all mutants of one mutator, instead, it should allow some mutants from all categories.

I do think the mutants should be visible in the report, but with an "ignored" state, that way you at least know which mutants weren't tested, so you won't get the false sense of security.

I've been reading more through the issues and it seems like stryker is focused exclusively on unit tests, and would consider most of my tests integration tests, and therefore incompatible.

Indeed, we're focussed on unit testing, but Stryker works for integration tests as well. You pointed out the challenges quite well.

Do you think having some way for each worker process to use a different database would help you? I've been thinking about ways to support that. For example, we could add a STRYKER_MUTATOR_WORKER_ID env variable (name pending), that could allow your test code to select a diffent database. This could allow you to run the integration tests in parallel.

@spruce-bruce
Copy link

However, I don't like the "random" here. I would like this to be reproduced, this way you can test it locally and get the same result as you can on your CI pipeline.

I haven't read all the way through the paper, so I can't be authoritative here, but I'd expect the randomness to be an important part of producing a valid score. Would some kind of seed help towards your concern? If random sampling was used, stryker could output a hash, and if you supplied that hash as an arg you'd get the same set of mutations. I personally don't know much about how seeding for "reproducible randomness" actually works, but I've played enough video games to know it's at least possible :)

Do you think having some way for each worker process to use a different database would help you? I've been thinking about ways to support that. For example, we could add a STRYKER_MUTATOR_WORKER_ID env variable (name pending), that could allow your test code to select a diffent database. This could allow you to run the integration tests in parallel.

I think that's a great idea! I was reading about how ava works with databases. I was going to experiment with setting up a blank database for each test at run time. I expect that this would increase overall test time for serial runs of course, but it would enable significant parallelism. The question is how soon does the db server become the bottleneck?

Your suggestion (one database per process) is a strong middle ground between "one database" and "one database per test", though!

@Lakitna
Copy link
Contributor

Lakitna commented Nov 13, 2020

This is a super interesting approach! The conclusion of the paper is that you can, fairly safely, drop the sample rate all the way down to 5% and still get >99% accurate overall mutation score! With that, I would gladly drop the sample rate down to 10% in CI. I'd bump it up to 10% to account for real-world ugliness.

In the paper they state that complete randomness performs worse than a more sophisticated approach that they also applied but none the less, it should still lead to good results and is easier to implement. :)

They've indeed used some fancy methods for selecting the mutations. It looks like it doesn't matter a lot which method you choose though. The numbers in the paper are really close for every method. In the paper, random selection with 5% sampling is still 99.44% accurate. The highest score is 99.52% for a 5% sampling rate.

I would start with a seeded random sampling of all mutations. Seeded to make it predictable as Nico said. Random sampling on all mutants because it's easiest to implement.

If the implementation is somewhat stable, I'm willing to generate some measurements using it. :)

@Lakitna
Copy link
Contributor

Lakitna commented Nov 16, 2020

So I've been thinking about this some more during the weekend. I think the ideal situation in CI for me would be something like this:

  1. Global mutation score with a sampling rate of <= 10%. Record score somewhere. No score threshold.
  2. Incremental mutation score for changes introduced in the current PR (see Support incremental analysis #322) with a sampling rate of 100%. Record score somewhere. Score threshold of >= 50%.

This will provide us with meaningful metrics over time and it prevents us from neglecting to write proper tests for new stuff. Or, in other words, it gives us Strykers main CI advantages.

At the same time, we're not punished for old code. We're not ignoring old code either which will keep test quality visible. This focus on testing new/changed code also allows us to start using Stryker with existing code bases. And, last but not least, this will reduce CI runtime significantly, especially for large codebases. For large codebases, you can probably get away with lower sampling rates too, reducing execution times even further.

I think it would be the best balance between using Stryker to write great tests and CI duration.

It's easy to think of this stuff, though implementing will be an entirely different beast. 🤭

@stale
Copy link

stale bot commented Nov 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the ☠ stale Marked as stale by the stale bot, will be removed after a certain time. label Nov 16, 2021
@stale stale bot closed this as completed Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
☠ stale Marked as stale by the stale bot, will be removed after a certain time.
Projects
None yet
Development

No branches or pull requests

4 participants