Skip to content

bcftools norm -m +any gives incorrect AN value when SNP and INDEL entries are merged #2137

Open
@Fan-iX

Description

@Fan-iX

When I join SNP and INDEL entries using bcftools norm -m +any, one of the AN ("Total number of alleles in called genotypes") value is discard.

Here is a reproducible example:

1.vcf (merged from two vcf files using bcftools merge --no-index part1.vcf part2.vcf, see below)

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=Chr1,length=100>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_mergeVersion=1.19+htslib-1.19
##bcftools_mergeCommand=merge --no-index part1.vcf part2.vcf; Date=Sun Mar 24 15:29:30 2024
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	A001	A002	A003	A004	A005	A006	A007	A008	A009	A010	A011	A012	A013	C001	C002	C003	C004	C005	C006	C007	C008	C009	C010	C011	C012	C013	C014	C015	C016	C017	C018	C019	C020	C021	C022	C023	C024	C025	C026	C027	C028	C029	C030	C031	C032	C033	C034	C035	C036	C037	C038	C039	C040	C041	C042	C043	C044	C045	C046	C047	C048	C049	C050	C051	C052	C053	C054	C055	C056	C057	C058	C059	C060
Chr1	1	.	T	A	228.246	PASS	AN=24;AC=8	GT:DP	1/1:20	0/0:2	0/1:8	1/1:6	./.:0	0/0:8	0/0:2	0/0:5	0/0:1	0/0:1	1/1:17	0/0:2	0/1:13	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.
Chr1	1	.	T	TAAAAA,TAAA,TAA,TAAAA,TA	228.401	PASS	INDEL;AN=120;AC=28,43,10,8,30	GT:DP	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	./.:.	1/1:11	2/3:16	4/4:11	4/1:11	2/2:18	2/5:29	5/5:23	1/1:35	2/2:36	0/3:14	3/5:19	3/2:15	2/2:18	3/2:14	2/2:17	2/2:9	1/2:16	5/1:15	5/2:17	1/1:11	2/1:9	5/5:5	5/5:44	1/5:21	2/2:18	5/3:19	1/1:19	5/1:19	2/1:48	2/5:31	1/5:23	2/2:12	2/4:11	1/1:20	2/4:10	1/2:9	1/2:14	1/1:17	5/2:10	3/2:12	2/2:16	1/2:14	1/5:55	1/2:41	5/3:47	1/4:39	5/5:13	5/2:11	5/2:37	3/5:43	2/1:27	5/5:30	4/2:30	5/5:35	4/2:12	2/1:10	5/5:13	2/3:23	5/2:14	2/2:12

After bcftools norm -m +any 1.vcf

...
##bcftools_normVersion=1.19+htslib-1.19
##bcftools_normCommand=norm -m +any c.vcf; Date=Sun Mar 24 15:32:45 2024
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	A001	A002	A003	A004	A005	A006	A007	A008	A009	A010	A011	A012	A013	C001	C002	C003	C004	C005	C006	C007	C008	C009	C010	C011	C012	C013	C014	C015	C016	C017	C018	C019	C020	C021	C022	C023	C024	C025	C026	C027	C028	C029	C030	C031	C032	C033	C034	C035	C036	C037	C038	C039	C040	C041	C042	C043	C044	C045	C046	C047	C048	C049	C050	C051	C052	C053	C054	C055	C056	C057	C058	C059	C060
Chr1	1	.	T	A,TAAAAA,TAAA,TAA,TAAAA,TA	228.401	PASS	AN=24;AC=8,28,43,10,8,30	GT:DP	1/1:20	0/0:2	0/1:8	1/1:6	./.:0	0/0:8	0/0:2	0/0:5	0/0:1	0/0:1	1/1:17	0/0:2	0/1:13	2/2:.	3/4:.	5/5:.	5/2:.	3/3:.	3/6:.	6/6:.	2/2:.	3/3:.	./4:.	4/6:.	4/3:.	3/3:.	4/3:.	3/3:.	3/3:.	2/3:.	6/2:.	6/3:.	2/2:.	3/2:.	6/6:.	6/6:.	2/6:.	3/3:.	6/4:.	2/2:.	6/2:.	3/2:.	3/6:.	2/6:.	3/3:.	3/5:.	2/2:.	3/5:.	2/3:.	2/3:.	2/2:.	6/3:.	4/3:.	3/3:.	2/3:.	2/6:.	2/3:.	6/4:.	2/5:.	6/6:.	6/3:.	6/3:.	4/6:.	3/2:.	6/6:.	5/3:.	6/6:.	5/3:.	3/2:.	6/6:.	3/4:.	6/3:.	3/3:.

As you can see, the AN value for the normed entry is 24, instead of the correct 144 (120+24).
This leads to an error when I ran bcftools norm -m +any 1.vcf | bcftools view -q 0.1:nonmajor :

[E::bcf_calc_ac] Incorrect AN/AC counts at Chr1:1

On the other hand, bcftools merge -m any --no-index part1.vcf part2.vcf gives the correct AN value:

...
##bcftools_mergeVersion=1.19+htslib-1.19
##bcftools_mergeCommand=merge --no-index -m any part1.vcf part2.vcf; Date=Sun Mar 24 15:39:19 2024
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  A001    A002    A003    A004    A005    A006    A007    A008    A009    A010    A011    A012    A013    C001    C002C003     C004    C005    C006    C007    C008    C009    C010    C011    C012    C013    C014    C015    C016    C017    C018    C019    C020    C021    C022    C023    C024    C025    C026C027     C028    C029    C030    C031    C032    C033    C034    C035    C036    C037    C038    C039    C040    C041    C042    C043    C044    C045    C046    C047    C048    C049    C050C051     C052    C053    C054    C055    C056    C057    C058    C059    C060
Chr1    1       .       T       A,TAAAAA,TAAA,TAA,TAAAA,TA      228.401 PASS    INDEL;AN=144;AC=8,28,43,10,8,30 GT:DP   1/1:20  0/0:2   0/1:8   1/1:6   ./.:0   0/0:8   0/0:2   0/0:5   0/0:10/0:1   1/1:17  0/0:2   0/1:13  2/2:11  3/4:16  5/5:11  5/2:11  3/3:18  3/6:29  6/6:23  2/2:35  3/3:36  0/4:14  4/6:19  4/3:15  3/3:18  4/3:14  3/3:17  3/3:9   2/3:16  6/2:15  6/3:17  2/2:11       3/2:9   6/6:5   6/6:44  2/6:21  3/3:18  6/4:19  2/2:19  6/2:19  3/2:48  3/6:31  2/6:23  3/3:12  3/5:11  2/2:20  3/5:10  2/3:9   2/3:14  2/2:17  6/3:10  4/3:12  3/3:16  2/3:14  2/6:55       2/3:41  6/4:47  2/5:39  6/6:13  6/3:11  6/3:37  4/6:43  3/2:27  6/6:30  5/3:30  6/6:35  5/3:12  3/2:10  6/6:13  3/4:23  6/3:14  3/3:12
part1.vcf and part2.vcf

part1.vcf

##fileformat=VCFv4.2
##contig=<ID=Chr1,length=100>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	A001	A002	A003	A004	A005	A006	A007	A008	A009	A010	A011	A012	A013
Chr1	1	.	T	A	228.246	PASS	AN=24;AC=8	GT:DP	1/1:20	0/0:2	0/1:8	1/1:6	./.:0	0/0:8	0/0:2	0/0:5	0/0:1	0/0:1	1/1:17	0/0:2	0/1:13

part2.vcf

##fileformat=VCFv4.2
##contig=<ID=Chr1,length=100>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	C001	C002	C003	C004	C005	C006	C007	C008	C009	C010	C011	C012	C013	C014	C015	C016	C017	C018	C019	C020	C021	C022	C023	C024	C025	C026	C027	C028	C029	C030	C031	C032	C033	C034	C035	C036	C037	C038	C039	C040	C041	C042	C043	C044	C045	C046	C047	C048	C049	C050	C051	C052	C053	C054	C055	C056	C057	C058	C059	C060
Chr1	1	.	T	TAAAAA,TAAA,TAA,TAAAA,TA	228.401	PASS	INDEL;AN=120;AC=28,43,10,8,30	GT:DP	1/1:11	2/3:16	4/4:11	4/1:11	2/2:18	2/5:29	5/5:23	1/1:35	2/2:36	0/3:14	3/5:19	3/2:15	2/2:18	3/2:14	2/2:17	2/2:9	1/2:16	5/1:15	5/2:17	1/1:11	2/1:9	5/5:5	5/5:44	1/5:21	2/2:18	5/3:19	1/1:19	5/1:19	2/1:48	2/5:31	1/5:23	2/2:12	2/4:11	1/1:20	2/4:10	1/2:9	1/2:14	1/1:17	5/2:10	3/2:12	2/2:16	1/2:14	1/5:55	1/2:41	5/3:47	1/4:39	5/5:13	5/2:11	5/2:37	3/5:43	2/1:27	5/5:30	4/2:30	5/5:35	4/2:12	2/1:10	5/5:13	2/3:23	5/2:14	2/2:12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions