Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools consensus: --mask/--mask-with overwrites --mask-del #1386

Closed
kpj opened this issue Jan 29, 2021 · 6 comments
Closed

bcftools consensus: --mask/--mask-with overwrites --mask-del #1386

kpj opened this issue Jan 29, 2021 · 6 comments

Comments

@kpj
Copy link

kpj commented Jan 29, 2021

The recently added parameters --mask-del (#1381) and --mask-with (#1382) exhibit unexpected behaviour.

Consider the following command:

bcftools consensus --output temp.majority.fasta --fasta-ref NC_045512.2.fasta --mask temp.lowcoverage.bed --mask-with N --mark-del - temp.bcf

where the regions defined in temp.lowcoverage.bed coincide with a deletion.
As the --mark-del parameter comes after the --mask/--mask-with parameters, I would expect the consensus sequence to have a - in the position of the deletion. However, the position contains an N.

Is this intended behaviour, or a bug?

@pd3
Copy link
Member

pd3 commented Feb 2, 2021

It is not exactly intended, but is a consequence of the implementation: after the sequence is masked with another character (which is the first step in the process) overlapping variants are skipped. The only reason for this behavior is to make the code simpler, some of the functionality (e.g. applying of IUPAC codes, etc) would break otherwise.

I can see that applying indels to the masked sequence informs about the total length, not sure how important that is. This behavior should be at least documented.

@kpj
Copy link
Author

kpj commented Feb 4, 2021

I agree that it is probably a good idea to document this more clearly.

You make a good point that .. --mask foo.bed --mask-with lc --iupac-codes .. and .. --iupac-codes --mask foo.bed --mask-with lc .. should maybe also produce different results (following the idea that the order of masking steps matters).
But this is probably out of scope.

@pd3
Copy link
Member

pd3 commented Feb 11, 2021

The documentation and the usage page has been updated to reflect this.

@pd3 pd3 closed this as completed Feb 11, 2021
@charlesfoster
Copy link

Hello,

Sorry to open an old issue, but I'm having the same problem and I'm trying to wrap my head around it.

Basically, I would like to generate a consensus fasta sequence for our SARS-CoV-2 samples based on a vcf file. I want (1) the reference bases to be replaced with well supported variants, and (2) low-depth regions to be masked with 'N'. It works fine, except for the case of deletions. The deletions can be strongly supported by the reads, and as a consequence those sites in the genome are 'low coverage' (depth < 10). Consider the following:

#CHROM POS ID REF ALT FORMAT/DP FORMAT/AD FORMAT/VAF
NC_045512.2 22028 . GAGTTCA G 3426 3402 0.992995

When I try to mark the deletions with '-', the deletions are applied as expected. The low-coverage sites (e.g., 5' and 3' UTRs) remain in the consensus since I am not masking them.

Command:

bcftools consensus -p $PRE -f $REF --mark-del '-' -i 'FORMAT/VAF > 0.90 & INFO/MQ >= 20 & FORMAT/DP >= 10' ${NAME}.bcftools.vcf.gz > ${NAME}.bcftools_noMask.fa

When I try to mark the deletions with '-' and mask low-coverage sites, the deletions are not applied as expected. The low-coverage sites are masked with N in the consensus.

Command:

bcftools consensus -p $PRE -f $REF --mark-del '-' -m $SITES -i 'FORMAT/VAF > 0.90 & INFO/MQ >= 20 & FORMAT/DP >= 10' ${NAME}.bcftools.vcf.gz > ${NAME}.bcftools_noMask.fa

The $SITES variable refers to a simple text file with column1 = chromosome (reference) name and column2 = position in genome (1-based). I first get the depth at every position using bedtools genomecov -d -ibam $BAM, then subset for sites with < 10x depth, and finally retain only the chrom name and position in genome. Example:

image

Screenshot of deletion region:

image

Since positions 22029, 22030, and 22031 have >10x depth (just), they are not depth masked and are in the consensus with ref bases. Positions 22032, 22033, and 22034 have <10x depth, and are masked with N. The desired outcome would be for all six bases to have a '-' character.

I would hope that there is a way to make the two settings play together, i.e. to mask low-coverage sites unless they correspond to a variant. In the case of a deletion, the sites should not be depth-masked with N if they fall within the deletion. I tried following the above conversation in this issue, and it seems that the sites are masked with N first, and then if a variant falls within the masked region then it is skipped (i.e., the opposite of my desired outcome).

Is there a way to have the desired outcome? Have there been any changes to bcftools to allow this behaviour that I'm missing? I tried looking at the documentation but I couldn't find anything addressing the usage of both options in tandem.

Version info:
bcftools 1.12-73-gc95bb12
Using htslib 1.12-53-gbfb2df1

Thanks,
Charles

@pd3
Copy link
Member

pd3 commented Jun 28, 2021

How about something like this, would this work?

bcftools query -f'%CHROM\t%POS0\t%END\n' file.bcf > variants.bed
bedtools subtract -a low-depth-regions.bed -b variants.bed > mask.bed
bcftools consensus -m mask.bed 

@charlesfoster
Copy link

Fantastic - that works a treat! Thanks for the fast reply. In case it helps anyone else, I had to change my upstream commands with bedtools a little to make it work properly since previously my coordinates were 1-based:

bcftools query -f'%CHROM\t%POS0\t%END\n' ${NAME}.bcftools.vcf.gz > variants.bed
# get low coverage sites in bedgraph format
bedtools genomecov -bga -ibam $BAM | awk '$4 < 10' > low_coverage_sites.bed
bedtools subtract -a low_coverage_sites.bed -b variants.bed > mask.bed
#generate consensus and rename header
bcftools consensus -p $PREFIX -f $REFERENCE --mark-del '-' -m mask.bed -i 'FORMAT/VAF > 0.90 & INFO/MQ >= 20 & FORMAT/DP >= 10' ${NAME}.bcftools.vcf.gz | sed "/^>/s/${PREFIX}.*/${PREFIX}/" > ${NAME}.bcftools.consensus.fa

Charles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants