Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlapping variants #600

Closed
rebzzy opened this issue Apr 27, 2017 · 12 comments
Closed

overlapping variants #600

rebzzy opened this issue Apr 27, 2017 · 12 comments

Comments

@rebzzy
Copy link

rebzzy commented Apr 27, 2017

Hi -

I'm trying to generate a fastq file using bcftools consensus. However, I am getting the warning that consensus is skipping sites with overlapping variants (where there's two entries in the vcf for a particular position). Is there a way to circumvent this issue? It's not clear to me whether this is an upstream issue with how I'm generating my VCF file. I called SNPs after using samtools mpileup and bcftools call on a bam file with multiple individuals. I then extracted information for a single individual into a new vcf file and tried to run bcftools consensus.

Thanks.

@pd3
Copy link
Member

pd3 commented May 1, 2017

This is a simple question but difficult to help with. From the biological point of view, there should be no overlapping variants. However, all callers make false positives and the raw calls can be conflicting. There should be a filtering step to exclude low-confidence calls. Also, ideally the data should be phased.

@orionzhou
Copy link

orionzhou commented Apr 19, 2018

It makes sense to skip overlapping variants when generating a consensus for a particular sample. However, in some cases, even non-overlapping variants get reported as "overlapping" and skipped when running "bcftools consensus". Here is an example:

1 2288271 . AAC A 637.73 . AC=2;AF=1.0;AN=2;DP=13;ExcessHet=3.0103;FS=0.0;MLEAC=2;MLEAF=1.0;MQ=53.14;QD=27.76;SOR=3.767 GT:AD:DP:GQ:PL 1/1:0,13:13:45:675,45,0

1 2288273 . C CGTTTTTTT 547.73 . AC=2;AF=1.0;AN=2;DP=13;ExcessHet=3.0103;FS=0.0;MLEAC=2;MLEAF=1.0;MQ=53.14;QD=31.41;SOR=3.767 GT:AD:DP:GQ:PL 1/1:0,13:13:39:585,39,0

This is is a deletion immediately followed by an insertion, there is no overlap or conflict between them but "bcftools consensus" will report as "overlapping" and skip.

Any thoughts?

@pd3 pd3 closed this as completed in cbea92d Apr 20, 2018
@pd3
Copy link
Member

pd3 commented Apr 20, 2018

This should be now fixed. However, note that this fix is still not perfect in that it requires normalized VCF and may therefore falsely skip multiallelic indels.

@orionzhou
Copy link

orionzhou commented Apr 20, 2018

Thanks for the quick fix. However, running the program on that same VCF (see my above example) gives this error:

Note: the --sample option not given, applying all records regardless of the genotype
The fasta sequence does not match the REF allele at 1:2288273:
.vcf: [C]
.vcf: [CGTTTTTTT] <- (ALT)
.fa: [A]TTTTTGGGG

This error arises because position 1:2288273 has been removed by the previous deletion (1:2288271 AAC -> A), but the insertion right after the deletion (in the VCF file) still uses the raw reference (1:2288273:C) which causes a conflict.

@pd3
Copy link
Member

pd3 commented Apr 21, 2018

Any chance you could provide a test case to debug the problem?

pd3 added a commit that referenced this issue Apr 24, 2018
This improves the test introduced in cbea92d. Coincidentally the reference
base of the modified and the original reference was the same in the test
which masked the problem pointed out by
#600 (comment)
@pd3
Copy link
Member

pd3 commented Apr 24, 2018

Nevermind. This specific case should be now addressed by 21fd8da. Let's hope this will work for the rest of your VCF as well

@orionzhou
Copy link

Great! Now it works! Thanks!

@sprocha
Copy link

sprocha commented Jan 30, 2019

Hi!
would really appreciate some input here.
I am exploring how to deal with the issue of overlapping variants. I understand they should not exist and that they are caller's error... and I have multisample vcf(s) from GATK4 where there are quite some of these. I would like to understand exactly what bcftools consensus is doing with them, in order to decide how to process things downstream.

As far as I understood, it does:
1) default case (no -I):
a) writes down ALT (or one of the alts, in case both alleles for that sample are different from REF) snp and indel allelles;
b) in the case of indels, subsequent overlapping variants (i.e, within the size of any of the allelles for that sample) are skipped

  1. IUPAC ambiguities (-I):
    a) when an indel is heterozigous it is coded with the ref allelle, if both allelles are ALT, chooses one at random;
    b) overlapping SNPs or indels (i.e, inside the indel) are not skipped (which is kind of weird, but well)

will paste an example, referring to the genotype of first sample in vcf:

REF_NMAP_I3748 2556 . CGGCTG TGGCTG,,C 6236.95 . (...) GT:AD:DP:GQ:PGT:PID:PL 0/1:
REF_NMAP_I3748 2558 . G T,GTC 12154.4 (...) GT:AD:DP:GQ:PGT:PID:PL 0/1
REF_NMAP_I3748 2560 . TG CG,
,AG 76454 (...) GT:AD:DP:GQ:PGT:PID:PL 0/1
REF_NMAP_I3748 2561 . GTA *,CTA,ATA,G 2661.44 . GT:AD:DP:GQ:PL 0/3

if u look at the region in the aligned fastas:

captura de ecra 2019-01-30 as 18 32 40

you see that:

  1. the 2558 is skipped by the "no -I" (correctly) but not by the "-I";
  2. 2560 and 2561 are not in the warnings "skipped list" of "-I", so I assume he is writing down the ref allelle (and not skipping) - they do are skipped by the "no -I" output

most of my cases seem to fit the "rules" above, but not all. I do have cases (much less) of skipped positions for the "-I". they seem to be in problematic regions (multiple possible overlapping indels) but I could not quite figure out the rules... may it be that if it is an indel completely within the other indel it is skipped?

I am a bit confused, can you please explain me better what the program does?
many thanks in advance!!

@pd3
Copy link
Member

pd3 commented Feb 14, 2019

The program assumes that the VCF has the correct information and skips conflicting variants.

@pd3 pd3 reopened this Feb 14, 2019
@pd3 pd3 closed this as completed Feb 14, 2019
@milkschen
Copy link

Hi,
I'm using bcftools consensus to create consensus sequence and unfortunately my VCF has some conflicting variants. I wonder how bcftools deal with those variants. An example that confuses me is given as follows:

21:44457394-44457399
TCCCGT

21 44457396 rs76516641 C A
21 44457397 rs34512808 C A
21 44457397 rs141511289 C CA

The results I got with bcftools consensus is TCAAAGT in the corresponding region. However, since rs141511289 conflicts with rs34512808, if rs34512808 is used the output would be TCAAGT and if rs141511289 is used the output would be TCACAGT.

I'm using bcftools 1.9-206-g4694164 and htslib 1.9-258-ga428aa2. Thanks in advance!

@pd3
Copy link
Member

pd3 commented Jul 6, 2019

The behavior depends on how bcftools consensus is run: with -s -H or without? The program should print a warning message when conflicting (overlapping) variants are encountered and only the first is used, the rest is discarded.

However, there is no conflict in this case: the variant rs141511289 inserts a new A after 44457397, therefore the TCAAAGT sequence is correct.

Please open a new issue next time as this is not really a continuation of the old thread.

@milkschen
Copy link

Sorry for mis-posting. Thank you very much for the response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants