-
Notifications
You must be signed in to change notification settings - Fork 78
VCF masking #2300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF masking #2300
Conversation
|
Oh yes, this should also give us support for missing data in the VCF output, as a nice side-effect. |
|
For performance evaluation we would:
|
Codecov Report
@@ Coverage Diff @@
## main #2300 +/- ##
==========================================
- Coverage 93.28% 90.97% -2.32%
==========================================
Files 28 28
Lines 26455 27594 +1139
Branches 1202 1463 +261
==========================================
+ Hits 24679 25104 +425
- Misses 1744 2428 +684
- Partials 32 62 +30
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
Sounds good - my only suggestion is that shouldn't masks be Boolean? |
Yeah, agreed. |
|
I did some preliminary performance evaluation. Test 1 (no mask): 656.5268213189993 |
|
Very interesting, thanks @szhan. Can you tell me what the timing for the current code in |
|
I'm seeing 490.993569683 when doing this |
|
Recap: |
|
Is the dip in speed acceptable? |
Not for the nominal "no mask" case, no. I can update easily enough to make a "fast-path" for no masking, which should hopefully have exactly the same performance as the old code. Will do when I get a chance. |
|
Some updates here and a few tests. This should make sure the nominal case if fast - @szhan can you rerun your checks please? |
|
TODO:
|
|
The perf results with one run instead of ten runs: |
|
@szhan Is this ready for me to review now? |
There are still some TODO items Jerome listed. |
|
I was just documenting this and about to mark it as ready for review when I realised that the site_mask as implemented is pretty stupid - we should just leave out the whole site, not mark all the genotypes as missing. If someone wants the current behaviour they can use the |
|
This should be pretty much ready to go now, can you take a look please @benjeffery? |
benjeffery
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Well tested, just a couple of typos.
|
Awesome, thanks. Marked for merging. |
After discussing things with @szhan and the bottlenecks he's experiencing, this is a proposal we came up with for how we can provide a flexible interface for masking out sites and samples in the VCF output. Usage:
Here, in this contrived example, we're masking out every 4th site, and have a sample mask that depends on the site. This is done by having
sample_maskoptionally be a callable, which takes the variant as input and returns the relevant mask.Output:
We haven't tested this for perf at all yet, so it could be a significant regression and need a little refactoring to make sure the common case is still fast.
Thoughts?