Hypothesis testing for vcf2zarr #249

tomwhite · 2024-06-10T13:50:17Z

It would be good to use the Hypothesis strategy for generating VCF files that's in sgkit against vcf2zarr, to check for corner cases in conversion.

I wonder if we should move the Hypothesis VCF code to this repo, or release as a separate package (it may be of general interest)?

jeromekelleher · 2024-06-10T14:40:45Z

I'd be very happy to move it into this repo, but maybe it is actually something of more general use? How much trouble would packaging it separately be?

tomwhite · 2024-06-11T11:31:54Z

Here's a branch that uses hypothesis-vcf to generate VCFs: main...tomwhite:bio2zarr:hypothesis-vcf-tests

It's been passing for ~1000s of generated examples, which gives me confidence that vcf2zarr is handling lots of edge cases. But I just ran it again and it found a failing VCF which needs looking into. Perhaps we should run it as a separate GitHub Action workflow, or maybe even manually for the moment.

I've also had to modify the VCF generating code (currently in sgkit), so that's probably not quite ready to release separately yet either.

tomwhite · 2024-06-11T11:35:57Z

I'd be very happy to move it into this repo, but maybe it is actually something of more general use? How much trouble would packaging it separately be?

I think it would be useful generally, and could be listed on https://hypothesis.readthedocs.io/en/latest/strategies.html. It would need minimal packaging and just a README for documentation I think.

tomwhite · 2024-06-14T11:08:13Z

I've moved the hypothesis-vcf code into its own repository at https://github.com/tomwhite/hypothesis-vcf.

if that looks OK, I'd like to move it under https://github.com/sgkit-dev.

jeromekelleher · 2024-06-20T14:03:06Z

LGTM - I think it would be a great addition to sgkit-dev

tomwhite · 2024-06-27T11:05:14Z

The hypothesis-vcf code is now in https://github.com/sgkit-dev/hypothesis-vcf.

Thanks for fixing #251 @jeromekelleher. I've rebased and rerun the code in my branch at https://github.com/tomwhite/hypothesis-vcf and it hasn't found any more problems.

What do you think the next step is? Have a CI job that runs just the hypothesis tests once a day?

jeromekelleher · 2024-06-27T12:45:20Z

What do you think the next step is? Have a CI job that runs just the hypothesis tests once a day?

I'm not sure this would do anything different to just adding a hypothesis job as part of normal CI. If we tune it to run for < 30 seconds and it runs with a different seed each time, it shouldn't get in the way and give us good coverage. We're not expecting it to break, so shouldn't lead to noise for contributors.

tomwhite · 2024-06-27T13:44:19Z

I just had a quick look at the timings, and each call to vcf2zarr.convert is taking just over 1 second - even for these tiny generated files with just a handful of variants and samples. So for the default Hypothesis setting of 200 examples it takes two or three minutes to run the test. We could lower the number of examples it generated, but do you think there's scope for reducing the conversion time?

jeromekelleher · 2024-06-27T14:53:19Z

Is it much better with worker_processes=0?

tomwhite · 2024-06-27T15:09:00Z

Is it much better with worker_processes=0?

Yes! It takes around 30 seconds with the default number of examples. So we could probably just use that.

tomwhite · 2024-06-27T15:12:21Z

Should that be the default if you're not using the CLI?

jeromekelleher · 2024-06-27T15:14:55Z

The issue is that it's using a home-grown syncronous exector to do it (

bio2zarr/bio2zarr/core.py

Line 82 in d192054

class SynchronousExecutor(cf.Executor):

) which seems to work perfectly well, but is doing stuff the Python docs are explicit about saying "you should only use this for testing". I'm sure it's probably fine, I just wanted to be conservative for now.

tomwhite · 2024-06-27T15:23:28Z

Makes sense.

tomwhite mentioned this issue Jun 11, 2024

ValueError: could not broadcast input array #251

Closed

tomwhite mentioned this issue Jul 1, 2024

Hypothesis tests for VCF #264

Merged

jeromekelleher closed this as completed in #264 Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hypothesis testing for vcf2zarr #249

Hypothesis testing for vcf2zarr #249

tomwhite commented Jun 10, 2024

jeromekelleher commented Jun 10, 2024

tomwhite commented Jun 11, 2024

tomwhite commented Jun 11, 2024

tomwhite commented Jun 14, 2024

jeromekelleher commented Jun 20, 2024

tomwhite commented Jun 27, 2024

jeromekelleher commented Jun 27, 2024

tomwhite commented Jun 27, 2024

jeromekelleher commented Jun 27, 2024

tomwhite commented Jun 27, 2024

tomwhite commented Jun 27, 2024

jeromekelleher commented Jun 27, 2024

tomwhite commented Jun 27, 2024

Hypothesis testing for vcf2zarr #249

Hypothesis testing for vcf2zarr #249

Comments

tomwhite commented Jun 10, 2024

jeromekelleher commented Jun 10, 2024

tomwhite commented Jun 11, 2024

tomwhite commented Jun 11, 2024

tomwhite commented Jun 14, 2024

jeromekelleher commented Jun 20, 2024

tomwhite commented Jun 27, 2024

jeromekelleher commented Jun 27, 2024

tomwhite commented Jun 27, 2024

jeromekelleher commented Jun 27, 2024

tomwhite commented Jun 27, 2024

tomwhite commented Jun 27, 2024

jeromekelleher commented Jun 27, 2024

tomwhite commented Jun 27, 2024