Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypothesis testing for vcf2zarr #249

Closed
tomwhite opened this issue Jun 10, 2024 · 13 comments · Fixed by #264
Closed

Hypothesis testing for vcf2zarr #249

tomwhite opened this issue Jun 10, 2024 · 13 comments · Fixed by #264

Comments

@tomwhite
Copy link
Contributor

It would be good to use the Hypothesis strategy for generating VCF files that's in sgkit against vcf2zarr, to check for corner cases in conversion.

I wonder if we should move the Hypothesis VCF code to this repo, or release as a separate package (it may be of general interest)?

@jeromekelleher
Copy link
Contributor

I'd be very happy to move it into this repo, but maybe it is actually something of more general use? How much trouble would packaging it separately be?

@tomwhite
Copy link
Contributor Author

Here's a branch that uses hypothesis-vcf to generate VCFs: main...tomwhite:bio2zarr:hypothesis-vcf-tests

It's been passing for ~1000s of generated examples, which gives me confidence that vcf2zarr is handling lots of edge cases. But I just ran it again and it found a failing VCF which needs looking into. Perhaps we should run it as a separate GitHub Action workflow, or maybe even manually for the moment.

I've also had to modify the VCF generating code (currently in sgkit), so that's probably not quite ready to release separately yet either.

@tomwhite
Copy link
Contributor Author

I'd be very happy to move it into this repo, but maybe it is actually something of more general use? How much trouble would packaging it separately be?

I think it would be useful generally, and could be listed on https://hypothesis.readthedocs.io/en/latest/strategies.html. It would need minimal packaging and just a README for documentation I think.

@tomwhite
Copy link
Contributor Author

I've moved the hypothesis-vcf code into its own repository at https://github.com/tomwhite/hypothesis-vcf.

if that looks OK, I'd like to move it under https://github.com/sgkit-dev.

@jeromekelleher
Copy link
Contributor

LGTM - I think it would be a great addition to sgkit-dev

@tomwhite
Copy link
Contributor Author

The hypothesis-vcf code is now in https://github.com/sgkit-dev/hypothesis-vcf.

Thanks for fixing #251 @jeromekelleher. I've rebased and rerun the code in my branch at https://github.com/tomwhite/hypothesis-vcf and it hasn't found any more problems.

What do you think the next step is? Have a CI job that runs just the hypothesis tests once a day?

@jeromekelleher
Copy link
Contributor

What do you think the next step is? Have a CI job that runs just the hypothesis tests once a day?

I'm not sure this would do anything different to just adding a hypothesis job as part of normal CI. If we tune it to run for < 30 seconds and it runs with a different seed each time, it shouldn't get in the way and give us good coverage. We're not expecting it to break, so shouldn't lead to noise for contributors.

@tomwhite
Copy link
Contributor Author

I just had a quick look at the timings, and each call to vcf2zarr.convert is taking just over 1 second - even for these tiny generated files with just a handful of variants and samples. So for the default Hypothesis setting of 200 examples it takes two or three minutes to run the test. We could lower the number of examples it generated, but do you think there's scope for reducing the conversion time?

@jeromekelleher
Copy link
Contributor

Is it much better with worker_processes=0?

@tomwhite
Copy link
Contributor Author

Is it much better with worker_processes=0?

Yes! It takes around 30 seconds with the default number of examples. So we could probably just use that.

@tomwhite
Copy link
Contributor Author

Should that be the default if you're not using the CLI?

@jeromekelleher
Copy link
Contributor

The issue is that it's using a home-grown syncronous exector to do it (

class SynchronousExecutor(cf.Executor):
) which seems to work perfectly well, but is doing stuff the Python docs are explicit about saying "you should only use this for testing". I'm sure it's probably fine, I just wanted to be conservative for now.

@tomwhite
Copy link
Contributor Author

Makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants