Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools consensus mask not applied when input vcf file contains no variants #1592

Closed
charlesfoster opened this issue Oct 12, 2021 · 5 comments

Comments

@charlesfoster
Copy link

Hi,
As part of a workflow I use bcftools consensus to generate consensus genomes, with low coverage sites masked with Ns. Command as follows:

bcftools consensus -p $PREFIX -f $REFERENCE --mark-del '-' -m $MASK -H I -i 'INFO/DP >= 10 & GT!="mis"' $VCF > consensus.fa

This command works well in most cases: variants are applied, and low-coverage sites are masked with 'N'. However, there are some cases where (for whatever reason) the reads files for sample(s) are completely empty, then the mapped BAM files are empty and no variants are called as expected (VCF file contains only headers and no variants). In this situation, my coverage mask (bed-format file) indicates that the whole consensus genome (29903 nt) should be masked:

NC_045512.2	0	29903	0

Instead, in these cases no variants are applied (obviously), but the resulting consensus genome is returned as identical to the reference genome since the full-genome mask is not applied. I would instead hope that the resulting consensus genome is 100% Ns.

Is it possible to have the mask be respected even when no variants are in the VCF file? If I add even a single fake variant with a depth of 1 to the VCF file, the resulting consensus genome is 100% Ns (as hoped).

Thanks!

@pd3
Copy link
Member

pd3 commented Oct 20, 2021

What version of bcftools are you using? I just tried with the latest version and with 1.12 and the masking works fine even for empty VCFs.

@charlesfoster
Copy link
Author

Thanks for checking on it. Version details:

bcftools 1.13
Using htslib 1.13

I installed bcftools via conda. I just tried the command again as a sanity test, and I got the same result. I use the same command as in the OP, and I get a sequence that is 100% identical to the SARS-CoV-2 reference genome rather than 100% Ns. The terminal outputs the following:

Warning: Sequence "NC_045512.2" not in test.vcf.gz
Applied 0 variants

I've attached the empty vcf file (as *.txt so github will accept it) in case there is something that lofreq puts in there that upsets bcftools consensus.

test.vcf.txt

@pd3
Copy link
Member

pd3 commented Oct 21, 2021

Can you provide also the fasta reference and the bed file please?

@charlesfoster
Copy link
Author

charlesfoster commented Oct 22, 2021

Files, as attached below, reproduce the problem for me (without the .txt suffix). Thanks.

mask.bed.txt
NC_045512.fasta.txt

@pd3 pd3 closed this as completed in 49aa0b0 Oct 22, 2021
@pd3
Copy link
Member

pd3 commented Oct 22, 2021

This is now fixed. Thank you for the test case and for reporting the problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants