Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCFTools 1.9 - Consensus error "fasta sequence does not mach REF allele". #888

Closed
bvaldebenitom opened this issue Sep 19, 2018 · 11 comments
Closed

Comments

@bvaldebenitom
Copy link

Hi guys,

I have been looking to generate a consensus sequence. I mapped reads with Bowtie 2, then I used bcftools mpileup, and bcftools call. For this I used:

bcftools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

After mpileup and call, I used bcftools consensus, and got the following message:

The fasta sequence does not match the REF allele at JH651516.1:12426:
   .vcf: [AAAAAAA]
   .vcf: [AAAAAATAAAAAAAA] <- (ALT)
   .fa:  [CAAAAAA]TAAAAAATAAAAGAATGTGGTATGTTTATACAATGGAATACTACTTAGTTTCTTTTTAAGCTAAAAAAACTCTAAAAATTAAGGAACTCATGGATGCTACTA

The following is the info from the bcf file at said region:

JH651516.1	12426	.	A	C	49.2621	.	DP=36;VDB=0.0264014;SGB=-0.165224;MQ0F=0;AF1=1;AC1=30;DP4=0,0,0,3;MQ=35;FQ=-26.0753	GT:PL	1/1:0,0,0	1/1:0,0,0	1/1:0,0,0	1/1:0,0,0	1/1:29,3,0	1/1:0,0,0	1/1:0,0,0	1/1:0,0,0	1/1:32,3,0	1/1:0,0,0	1/1:0,0,0	1/1:23,3,0	1/1:0,0,0	1/1:0,0,0	1/1:0,0,0
JH651516.1	12426	.	AAAAAAA	AAAAAATAAAAAAAA,AAAAAAAAAAAAAAAA	57.008	.	INDEL;IDV=2;IMF=0.5;DP=36;VDB=0.999994;SGB=8.82107;MQSB=0.746849;MQ0F=0.0555556;AF1=0.66768;AC1=20;DP4=1,7,5,23;MQ=20;FQ=18.7946;PV4=1,1,0.000751967,1	GT:PL	0/1:0,0,0,0,0,0	1/1:23,3,0,23,3,23	0/1:0,0,0,0,0,0	0/1:0,0,0,0,0,0	0/1:0,3,31,3,31,31	0/1:0,0,0,0,0,0	1/1:44,17,11,33,0,28	1/1:41,9,0,41,3,38	0/0:10,16,45,0,32,30	1/1:22,6,0,23,4,20	1/1:8,8,5,9,3,0	0/1:67,68,80,0,13,4	0/1:0,0,0,0,0,0	0/1:0,0,0,0,0,0	0/1:0,0,0,0,0,0

Using samtools faidx for region 12426 on the same sequence returns the following:

>JH651516.1:12426-12426
A

Using bcftools downloaded from GitHub (following the instructions here http://samtools.github.io/bcftools/):

bcftools 1.9-51-g20a170e
Using htslib 1.9-38-g864c5b7
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

I get the same error:

The fasta sequence does not match the REF allele at JH651516.1:12426:
   .vcf: [AAAAAAA]
   .vcf: [AAAAAATAAAAAAAA] <- (ALT)
   .fa:  [CAAAAAA]TAAAAAATAAAAGAATGTGGTATGTTTATACAATGGAATACTACTTAGTTTCTTTTTAAGCTAAAAAAACTCTAAAAATTAAGGAACTCATGGATGCTACTA

However, using an older version of bcftools,

bcftools 1.4.1-16-g034870e
Using htslib 1.4.1-32-g5b9361d
Copyright (C) 2016 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

I get no error, rather the warning:
The site JH651516.1:12426 overlaps with another variant, skipping...
and the FASTA file is generated.

@pd3
Copy link
Member

pd3 commented Oct 1, 2018

Thank you for the bug report, this should be fixed. Is there any chance you could provide a tarball with a small reproducible test case?

@bvaldebenitom
Copy link
Author

Thank you for the bug report, this should be fixed. Is there any chance you could provide a tarball with a small reproducible test case?

Hi there! Thanks for replying.

I have attached a small sample of the data, in which the error is reproducible.
Using bcftools v1.4 I only get:

bcftools consensus -f reference_sequence.fa sample_mpileup.bcf
The site JH651516.1:12426 overlaps with another variant, skipping...

And using bcftools 1.9-51-g20a170e the result is:

~/bcftools-git/bcftools/bcftools consensus -f reference_sequence.fa sample_mpileup.bcf
The fasta sequence does not match the REF allele at JH651516.1:12426:
   .vcf: [AAAAAAA]
   .vcf: [AAAAAATAAAAAAAA] <- (ALT)
   .fa:  [CAAAAAA]TAAAAAATAAAAGAATGTGGTATGTTTATACAATGGAATACTACTTAG

Checking the sequence with samtools faidx:

>JH651516.1:12426-12426
A

Thanks again, and please, let me know if there is anything else I can do to help in fixing this issue.

sample_data_for_bcftools1.9_github.tar.gz

@pd3 pd3 closed this as completed in 253a1fd Oct 3, 2018
@pd3
Copy link
Member

pd3 commented Oct 3, 2018

Thank you for the test case, this is now fixed by 253a1fd

@sammyjava
Copy link

I'm still experiencing this error, and I'm running the current dev branch (I checked that my consensus.c matches that from 253a1fd above);

$ git show
commit bf526a0175cdbc59a5b22d3701216513a275ecb9 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Petr Danecek <pd3@sanger.ac.uk>
Date:   Thu Dec 20 14:31:49 2018 +0000
$ ./bcftools --version
bcftools 1.9-92-gbf526a0
Using htslib 1.9-71-g7492268

But I have the same problem as the original post on this issue:

$ samtools faidx hs37d5.fa 4:86698932-86698936
>4:86698932-86698936
GTGAA

and yet:

The fasta sequence does not match the REF allele at 4:86698932:
   .vcf: [GTGAA]
   .vcf: [GTGAATGAA] <- (ALT)
   .fa:  [ATGAA]TGAATGAATGAATACATTTCAAAAAGGTAATTTTAAAAATTATTTTTTACTAATGACATTGATGACAGCTGATTTTTCTTTATCCATTGATTACGTTT

results from the command

samtools faidx hs37d5.fa 4 | bcftools consensus --sample HG00119 --include 'TYPE="snp"||TYPE="indel"||TYPE="mnp"' --haplotype LA ALL.chr4.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > HG00119.chr4.fasta

@pd3
Copy link
Member

pd3 commented Dec 21, 2018

I am not able to reproduce the error. Can you make sure that bcftools --version reports the same version string as ./bcftools --version?

@pd3 pd3 reopened this Dec 21, 2018
@sammyjava
Copy link

sammyjava commented Dec 21, 2018

Yup, it's the only version I've got installed:

shokin@haldane ~ $ bcftools --version
bcftools 1.9-92-gbf526a0
Using htslib 1.9-71-g7492268
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

This is happening with the 1000 Genomes VCF against Chr4 from the appropriate reference, hs37d5. It's typically pretty far along the chromosome, but not always; here's the resulting length of the consensus fasta for the first bunch of individuals (the ref Chr4 is 191154276 bases):

HG00119.4 86708460
HG00120.4 23799660
HG00121.4 86708040
HG00122.4 14959860
HG00123.4 79398900
HG00125.4 56667480
HG00126.4 40404180
HG00127.4 37256520
HG00128.4 86708100
HG00129.4 86708880
HG00130.4 86708160
HG00131.4 86709240
HG00132.4 79397760
HG00133.4 22293060
HG00136.4 59075220
HG00137.4 85570440
HG00138.4 86708340

I can upload the data to reproduce this, it's fairly big, of course.

@sammyjava
Copy link

sammyjava commented Dec 21, 2018

BTW, it always occurs on repeat extensions, and the bug is always on the first base of the REF allele, e.g.

The fasta sequence does not match the REF allele at 4:86698932:
   .vcf: [GTGAA]
   .vcf: [GTGAATGAA] <- (ALT)
   .fa:  [ATGAA]TGAATGAATGAATACATTTCAAAAAGGTAATTTTAAAAATTATT

Presumably the VCF is saying this individual has an extra TGAA inserted in that short repeat. (I've been studying Huntington's Disease, which is a CAG repeat extension in the HTT gene which is luckily near the beginning of Chr4 and hasn't been affected by this bug for ranges within 0-5Mb.)

Do you know of another tool that folks use to pull individuals' genomes out of the 1kG VCF and corresponding reference? bcftools consensus seems to be the recommended tool, and it works great when it works!

@sammyjava
Copy link

Here's a very slightly different case: the REF is notated with a single base, but it is once again an extension of a four-base repeat:

The fasta sequence does not match the REF allele at 4:14957991:
   .vcf: [T]
   .vcf: [TAAACAAACAAAC] <- (ALT)
   .fa:  [C]AAACAAACATCATATAAATATATATAATTATATATCATATAATTATAT

@pd3 pd3 closed this as completed in 27c9e5d Dec 21, 2018
@pd3
Copy link
Member

pd3 commented Dec 21, 2018

I was able to reproduce the problem now and fixed. Let me know if you encounter any other issue.

@sammyjava
Copy link

Nice work! Thanks!!! Happy holidays!!

@MaximePolicarpo
Copy link

Hi,

I think the problem still exists. I used bcftools consensus and got this error :

The fasta sequence does not match the REF allele at NC_035902.1:45610288:
.vcf: [AT] <- (REF)
.vcf: [ATT] <- (ALT)
.fa: [TT]ATTATTATTATTATTATTATTATTATTATATTTT

[I have the last bcftools version, downloaded from https://github.com/samtools/bcftools]

Thanks and happy holidays,

Maxime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants