Skip to content

Conversation

@jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented May 25, 2022

After discussing things with @szhan and the bottlenecks he's experiencing, this is a proposal we came up with for how we can provide a flexible interface for masking out sites and samples in the VCF output. Usage:

ts = msprime.sim_ancestry(5, random_seed=1, sequence_length=100)
ts = msprime.sim_mutations(ts, rate=0.1, random_seed=1234)
site_mask = np.zeros(ts.num_sites, dtype=int)
site_mask[::4] = 1
sample_mask = np.zeros(ts.num_samples, dtype=int)
sample_mask[0] = 1
sample_mask[1] = 1

def sample_mask(variant):
    sample_mask = np.zeros(ts.num_samples, dtype=int)
    sample_mask[variant.site.id % ts.num_samples] = 1
    return sample_mask

ts.write_vcf(sys.stdout, site_mask=site_mask, sample_mask=sample_mask)

Here, in this contrived example, we're masking out every 4th site, and have a sample mask that depends on the site. This is done by having sample_mask optionally be a callable, which takes the variant as input and returns the relevant mask.

Output:

##fileformat=VCFv4.2
##source=tskit 0.4.1
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=100>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  tsk_0   tsk_1   tsk_2   tsk_3   tsk_4
1       0       0       A       C       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       1       1       A       T,C     .       PASS    .       GT      0|.     0|0     0|0     0|1     0|0
1       2       2       T       G       .       PASS    .       GT      1|0     .|1     1|1     1|0     1|1
1       4       3       G       A,T     .       PASS    .       GT      2|0     0|.     0|1     1|0     2|1
1       7       4       T       C       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       11      5       A       T       .       PASS    .       GT      0|1     1|0     0|.     0|0     0|0
1       13      6       A       T       .       PASS    .       GT      1|0     1|1     1|1     .|0     1|1
1       15      7       A       C       .       PASS    .       GT      0|1     0|0     0|0     0|.     0|0
1       16      8       G       A       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       17      9       A       T       .       PASS    .       GT      1|0     1|1     1|1     1|0     1|.
1       26      10      T       G,A     .       PASS    .       GT      .|0     2|1     1|2     2|0     1|2
1       27      11      A       C       .       PASS    .       GT      0|.     0|0     0|0     0|0     1|0
1       28      12      G       C       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       37      13      C       T       .       PASS    .       GT      0|0     0|.     0|0     0|1     0|0
1       39      14      T       G       .       PASS    .       GT      0|0     1|0     .|0     0|0     0|0
1       41      15      G       A       .       PASS    .       GT      1|0     1|1     1|.     1|0     1|1
1       45      16      G       C       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       46      17      G       A       .       PASS    .       GT      0|0     1|0     0|0     0|.     0|0
1       47      18      G       T       .       PASS    .       GT      0|0     0|0     1|0     0|0     .|0
1       50      19      A       T       .       PASS    .       GT      0|0     0|0     0|1     0|0     0|.
1       53      20      T       G       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       54      21      T       C       .       PASS    .       GT      0|.     0|0     0|1     1|0     0|1
1       56      22      T       A       .       PASS    .       GT      1|0     .|1     1|1     1|0     1|1
1       58      23      G       C       .       PASS    .       GT      0|1     0|.     0|0     0|1     0|0
1       61      24      C       T,A     .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       62      25      C       G       .       PASS    .       GT      0|0     0|0     1|.     0|0     0|0
1       64      26      A       G       .       PASS    .       GT      1|0     0|1     0|0     .|0     1|0
1       68      27      G       A       .       PASS    .       GT      1|0     0|1     0|0     0|.     1|0
1       69      28      G       A       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       71      29      C       G       .       PASS    .       GT      0|0     0|0     0|0     0|1     0|.
1       73      30      G       A       .       PASS    .       GT      .|1     0|0     0|0     0|1     0|0
1       77      31      T       C       .       PASS    .       GT      0|.     0|0     0|0     0|0     1|0
1       78      32      C       G,A,T   .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       79      33      A       T       .       PASS    .       GT      0|0     1|.     0|0     0|0     0|0
1       80      34      A       C       .       PASS    .       GT      1|0     0|1     .|0     0|0     0|0
1       88      35      C       A       .       PASS    .       GT      0|0     0|0     0|.     0|0     0|1
1       91      36      G       A       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       92      37      C       T       .       PASS    .       GT      1|0     0|1     0|0     0|.     1|0
1       93      38      G       T       .       PASS    .       GT      0|1     0|0     0|0     0|1     .|0
1       94      39      G       C       .       PASS    .       GT      0|1     0|0     0|0     0|1     0|.
1       95      40      G       A       .       PASS    .       GT      .|.     .|.     .|.     .|.     .|.
1       96      41      G       A       .       PASS    .       GT      0|.     0|0     0|1     1|0     0|1
1       97      42      C       T       .       PASS    .       GT      1|1     .|1     0|0     0|1     1|0
1       99      43      A       C,G,T   .       PASS    .       GT      3|1     0|.     2|0     0|1     3|0

We haven't tested this for perf at all yet, so it could be a significant regression and need a little refactoring to make sure the common case is still fast.

Thoughts?

@jeromekelleher
Copy link
Member Author

Oh yes, this should also give us support for missing data in the VCF output, as a nice side-effect.

@jeromekelleher
Copy link
Member Author

For performance evaluation we would:

  1. Get a large tree sequence (say 100k samples, a few thousand sites - something that takes 10s of seconds to output the VCF for), and time vcf output on main.
  2. Run this code with no masks, and time.
  3. Run this code with a constant sample mask (say masking out even samples or something) and time
  4. Run this code with a site-dependent sample mask (like the one above) and time.

@codecov
Copy link

codecov bot commented May 25, 2022

Codecov Report

Merging #2300 (eac7a88) into main (c0c32a5) will decrease coverage by 2.31%.
The diff coverage is 88.46%.

❗ Current head eac7a88 differs from pull request most recent head 2653c90. Consider uploading reports for the commit 2653c90 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2300      +/-   ##
==========================================
- Coverage   93.28%   90.97%   -2.32%     
==========================================
  Files          28       28              
  Lines       26455    27594    +1139     
  Branches     1202     1463     +261     
==========================================
+ Hits        24679    25104     +425     
- Misses       1744     2428     +684     
- Partials       32       62      +30     
Flag Coverage Δ
c-tests 92.31% <ø> (+<0.01%) ⬆️
lwt-tests 89.05% <ø> (ø)
python-c-tests 71.53% <3.84%> (+0.04%) ⬆️
python-tests 98.90% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/trees.py 74.10% <25.00%> (-24.07%) ⬇️
python/tskit/vcf.py 100.00% <100.00%> (ø)
python/tskit/util.py 94.35% <0.00%> (-5.65%) ⬇️
c/tskit/trees.c 95.07% <0.00%> (-0.01%) ⬇️
c/tskit/core.c 97.90% <0.00%> (ø)
python/tskit/__init__.py 100.00% <0.00%> (ø)
c/tskit/tables.c 90.33% <0.00%> (+<0.01%) ⬆️
python/tskit/drawing.py 99.42% <0.00%> (+0.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0c32a5...2653c90. Read the comment docs.

@benjeffery benjeffery added this to the Python 0.5.0 milestone May 25, 2022
@szhan szhan self-assigned this May 25, 2022
@szhan szhan added enhancement New feature or request Python API Issue is about the Python API labels May 25, 2022
@petrelharp
Copy link
Contributor

Sounds good - my only suggestion is that shouldn't masks be Boolean?

@jeromekelleher
Copy link
Member Author

Sounds good - my only suggestion is that shouldn't masks be Boolean?

Yeah, agreed.

@szhan
Copy link
Member

szhan commented May 27, 2022

I did some preliminary performance evaluation.

import timeit

my_setup = '''
import sys

import tskit
import msprime

import numpy as np

ts = msprime.sim_ancestry(1e6, sequence_length=1e5, random_seed=1234)
ts = msprime.sim_mutations(ts, rate=1e-4, random_seed=5678)

site_mask = np.zeros(ts.num_sites, dtype=int)
site_mask[::4] = 1

sample_mask = np.zeros(ts.num_samples, dtype=int)
sample_mask[::4] = 1

def get_site_specific_sample_mask(variant):
    sample_mask = np.zeros(ts.num_samples, dtype=int)
    sample_mask[variant.site.id % ts.num_samples] = 1
    return sample_mask
'''

print("Test 1: No mask")
print(
    timeit.timeit(
        stmt='''ts.write_vcf(open('/dev/null', 'w'), site_mask=None, sample_mask=None)''',
        setup=my_setup,
        number=10
    )
)

print("Test 2: Sample mask")
print(
    timeit.timeit(
        stmt='''ts.write_vcf(open('/dev/null', 'w'), site_mask=None, sample_mask=sample_mask)''',
        setup=my_setup,
        number=10
    )
)

print("Test 3: Site mask")
print(
    timeit.timeit(
        stmt='''ts.write_vcf(open('/dev/null', 'w'), site_mask=site_mask, sample_mask=None)''',
        setup=my_setup,
        number=10
    )
)

print("Test 4: Site-dependent sample mask")
print(
    timeit.timeit(
        stmt='''ts.write_vcf(open('/dev/null', 'w'), site_mask=None, sample_mask=get_site_specific_sample_mask)''',
        setup=my_setup,
        number=10
    )
)

Test 1 (no mask): 656.5268213189993
Test 2 (sample mask): 852.0801203749998 (30% slower)
Test 3 (site mask): 716.2780362409994 (9% slower)
Test 4 (sample and site mask): 810.0072484789998 (23% slower)

@jeromekelleher
Copy link
Member Author

Very interesting, thanks @szhan. Can you tell me what the timing for the current code in main is, for reference?

@szhan
Copy link
Member

szhan commented May 31, 2022

I'm seeing 490.993569683 when doing this timeit run.

print("Test 0: No mask (original)")
print(
    timeit.timeit(
        stmt='''ts.write_vcf(open('/dev/null', 'w'))''',
        setup=my_setup,
        number=10
    )
)

@szhan
Copy link
Member

szhan commented May 31, 2022

Recap:
Test 0 (no mask - original): 490.993569683
Test 1 (no mask): 656.5268213189993 (33% slower)
Test 2 (sample mask): 852.0801203749998 (74% slower)
Test 3 (site mask): 716.2780362409994 (46% slower)
Test 4 (sample and site mask): 810.0072484789998 (65% slower)

@szhan
Copy link
Member

szhan commented May 31, 2022

Is the dip in speed acceptable?

@jeromekelleher
Copy link
Member Author

Is the dip in speed acceptable?

Not for the nominal "no mask" case, no. I can update easily enough to make a "fast-path" for no masking, which should hopefully have exactly the same performance as the old code. Will do when I get a chance.

@jeromekelleher
Copy link
Member Author

Some updates here and a few tests. This should make sure the nominal case if fast - @szhan can you rerun your checks please?

@jeromekelleher
Copy link
Member Author

TODO:

  1. Error checking on mask inputs
  2. Test that pysam is OK with our missing data encoding
  3. Documentation

@szhan
Copy link
Member

szhan commented Jun 1, 2022

The perf results with one run instead of ten runs:

Test 0 (no mask - original): 49.657810238000025
Test 1 (no mask): 53.824011451000004 (8% slower)
Test 2 (sample mask): 63.393167141999996 (28% slower)
Test 3 (site mask): 53.221332816 (7% slower)
Test 4 (sample and site mask): 58.05120807399999 (17% slower)

@benjeffery
Copy link
Member

@szhan Is this ready for me to review now?

@szhan
Copy link
Member

szhan commented Jun 1, 2022

@szhan Is this ready for me to review now?

There are still some TODO items Jerome listed.

@jeromekelleher
Copy link
Member Author

I was just documenting this and about to mark it as ready for review when I realised that the site_mask as implemented is pretty stupid - we should just leave out the whole site, not mark all the genotypes as missing.

If someone wants the current behaviour they can use the sample_mask to do it.

@jeromekelleher jeromekelleher changed the title Initial proposal for VCF masking VCF masking Jun 9, 2022
@jeromekelleher jeromekelleher marked this pull request as ready for review June 9, 2022 10:02
@jeromekelleher
Copy link
Member Author

This should be pretty much ready to go now, can you take a look please @benjeffery?

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Well tested, just a couple of typos.

@jeromekelleher
Copy link
Member Author

Awesome, thanks. Marked for merging.

@mergify mergify bot merged commit 2a7341c into tskit-dev:main Jun 9, 2022
@jeromekelleher jeromekelleher deleted the vcf-mask branch June 9, 2022 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Python API Issue is about the Python API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants