bcftools norm inconsistency with allelic depth (AD) when splitting multi-allelic sites #360

freeseek · 2015-12-08T22:49:34Z

I hope this is not something that has been already asked before. Here an example VCF file:

## fileformat=VCFv4.1
## FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## contig=<ID=20,length=63025520>
# CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
20  20202020    .   C   G,T 0   PASS    .   GT:AD   1/2:0,48,61

If I split this file with the command "bcftools norm -m -any" I obtain:

## fileformat=VCFv4.1
## FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## contig=<ID=20,length=63025520>
# CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
20  20202020    .   C   G   0   PASS    .   GT:AD   1/0:0,48
20  20202020    .   C   T   0   PASS    .   GT:AD   0/1:0,61

However now I am in the uncomfortable situation where each site is heterozygous despite the allelic depth supporting "1/1" calls rather than "0/1" calls. I am sure people will have different opinions about this, but part of the reason many want to split multi-allelic sites is to consider each alternate allele as an allele to be interpreted as that allele against every other allele. It would be great to have at least an option to properly re-format the AD field so that the total sum of the AD fields is maintained after splitting, so that instead of splitting:

1/2:AD[0],AD[1],AD[2] -> 1/0:AD[0],AD[1] and 0/1:AD[0],AD[2]

It gets split instead as:

1/2:AD[0],AD[1],AD[2] -> 1/0:AD[0]+AD[2],AD[1] and 0/1:AD[0]+AD[1],AD[2]

I hope this makes sense.

The text was updated successfully, but these errors were encountered:

jrandall · 2016-01-25T12:15:55Z

@freeseek I think I understand what you are asking for, but I'm not sure it would make sense as a core feature. What would you suggest that the option be called, and how could it be made clear that the depths reported don't actually refer to the reference allele?

I can understand your frustration with the current output in this particular situation, but I'm not sure there is any better way to represent heterozygous sites where both alleles are non-reference. It seems like the "right" thing to do in this case might be to refuse the split the multiallelic line at all (perhaps on the basis that there is no evidence for it actually being multiallelic except for the fact that the reference contains a different allele than anything that was observed).

Wouldn't your suggestion to report all of the AD in the multiallelic input on each output line be even more confusing? In that case it would appear that there was actually evidence for a heterozygous call vs the reference allele, when in fact there was no evidence for the reference at all. Also, it would mean that if you sum up all of the AD for a particular location, there would appear to be twice as much as there actually was in the input data.

freeseek · 2017-09-13T15:23:20Z

@jrandall I totally see how what I am asking for would not make sense to everybody, but there are several situations where it would make sense, especially if allelic depth is being taken into account after the split. As a matter of fact, HAIL does the splitting leaving the sum of AD constant:
https://hail.is/docs/stable/hail.VariantDataset.html#hail.VariantDataset.split_multi
It is really hurtful for my applications not to have the AD split the way HAIL does. I do understand that is not what everybody might want, but it would be great if there was an option to perform the splitting of AD where the "ref AD entry is the sum of the other multiallelic entries".

Again, to reiterate, I would like the following VCF file:

##fileformat=VCFv4.1
##FORMAT=<ID=AD,Number=R,Type=Integer,Description=Allelic depths for the ref and alt alleles in the order listed>
##FORMAT=<ID=GT,Number=1,Type=String,Description=Genotype>
##contig=<ID=20,length=63025520>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
20	20202020	.	C	G,T	0	PASS	.	GT:AD	1/2:2,48,61

to split as follows:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description=Allelic depths for the ref and alt alleles in the order listed>
##FORMAT=<ID=GT,Number=1,Type=String,Description=Genotype>
##contig=<ID=20,length=63025520>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
20	20202020	.	C	G	0	PASS	.	GT:AD	0/1:63,48
20	20202020	.	C	T	0	PASS	.	GT:AD	0/1:50,61

the same way HAIL does it, even if requiring a special additional parameter AD-specific. Thank you! :-)

freeseek · 2017-09-15T22:58:17Z

I wrote the following patch that adds the HAIL-like splitting functionality (using the "--keep-sum-AD" flag):

--- vcfnorm.c	2017-09-15 12:00:00.000000000 -0400
+++ vcfnorm.c	2017-09-15 12:00:00.000000000 -0400
@@ -77,7 +77,7 @@
     char **argv, *output_fname, *ref_fname, *vcf_fname, *region, *targets;
     int argc, rmdup, output_type, n_threads, check_ref, strict_filter, do_indels;
     int nchanged, nskipped, nsplit, ntotal, mrows_op, mrows_collapse, parsimonious;
-    int record_cmd_line;
+    int record_cmd_line, keep_sum_ad;
 }
 args_t;
 
@@ -589,7 +589,7 @@
         assert( nvals>0 ); \
         type_t *vals = (type_t *) args->tmp_arr1; \
         int len = bcf_hdr_id2length(args->hdr,BCF_HL_FMT,fmt->id); \
-        int i, nsmpl = bcf_hdr_nsamples(args->hdr); \
+        int i, j, nsmpl = bcf_hdr_nsamples(args->hdr); \
         if ( nvals==nsmpl ) /* all values are missing */ \
         { \
             bcf_update_format_##type(args->hdr,dst,tag,vals,nsmpl); \
@@ -620,6 +620,8 @@
             for (i=0; i<nsmpl; i++) \
             { \
                 dst_vals[0] = src_vals[0]; \
+                if (args->keep_sum_ad && fmt->id==bcf_hdr_id2int(args->hdr,BCF_DT_ID,"AD") ) \
+                  for (j=1; j<nvals; j++) if ( j!=ialt+1 ) dst_vals[0] += src_vals[j]; \
                 dst_vals[1] = src_vals[ialt+1]; \
                 dst_vals += 2; \
                 src_vals += nvals; \
@@ -1746,6 +1748,7 @@
     fprintf(stderr, "    -d, --rm-dup <type>               remove duplicate snps|indels|both|any\n");
     fprintf(stderr, "    -f, --fasta-ref <file>            reference sequence (MANDATORY)\n");
     fprintf(stderr, "    -m, --multiallelics <-|+>[type]   split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both]\n");
+    fprintf(stderr, "    -k, --keep-sum-AD                 keep sum AD vector constant\n");
     fprintf(stderr, "        --no-version                  do not append version and command line to the header\n");
     fprintf(stderr, "    -N, --do-not-normalize            do not normalize indels (with -m or -c s)\n");
     fprintf(stderr, "    -o, --output <file>               write output to a file [standard output]\n");
@@ -1771,6 +1774,7 @@
     args->output_type = FT_VCF;
     args->n_threads = 0;
     args->record_cmd_line = 1;
+    args->keep_sum_ad = 0;
     args->aln_win = 100;
     args->buf_win = 1000;
     args->mrows_collapse = COLLAPSE_BOTH;
@@ -1784,6 +1788,7 @@
         {"fasta-ref",required_argument,NULL,'f'},
         {"do-not-normalize",no_argument,NULL,'N'},
         {"multiallelics",required_argument,NULL,'m'},
+        {"keep-sum-AD",no_argument,NULL,'k'},
         {"regions",required_argument,NULL,'r'},
         {"regions-file",required_argument,NULL,'R'},
         {"targets",required_argument,NULL,'t'},
@@ -1800,7 +1805,7 @@
         {NULL,0,NULL,0}
     };
     char *tmp;
-    while ((c = getopt_long(argc, argv, "hr:R:f:w:Dd:o:O:c:m:t:T:sN",loptions,NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "hr:R:f:w:Dd:o:O:c:m:t:T:sNk",loptions,NULL)) >= 0) {
         switch (c) {
             case 'N': args->do_indels = 0; break;
             case 'd':
@@ -1823,6 +1828,7 @@
                     else error("The argument to -m not recognised: %s\n", optarg);
                 }
                 break;
+            case 'k': args->keep_sum_ad = 1; break;
             case 'c':
                 if ( strchr(optarg,'w') ) args->check_ref |= CHECK_REF_WARN;
                 if ( strchr(optarg,'x') ) args->check_ref |= CHECK_REF_SKIP;

But I am not sure whether this will cause unintended behaviors in non-diploid cases. Could something like this be added to the main bcftools code?

dat4git · 2017-09-15T23:23:49Z

@freeseek this is a super helpful option! thanks for generating the nice patch!

pd3 · 2017-09-21T08:15:51Z

Hi, thank you for contributing the patch. However, it would be nicer to have a general way to specify rules for arbitrary tags, something like what merge -i implements.

freeseek · 2017-09-21T14:55:22Z

I see your point, but I am not aware of how people would want to split the format fields. Notice that HAIL:
https://hail.is/docs/stable/hail.VariantDataset.html#hail.VariantDataset.split_multi
also has a specific way to split the PL format which is quite specific to that format and it has a way to recompute the GQ format after splitting. I do think HAIL has the most appropriate behavior for AD and PL.
Specific handling would not be too different from what is in the convert.c code which contains special handling for PL and GP.
Either way, thank you for writing vcfnorm.c as it has been invaluable in so many applications.

pd3 · 2020-01-30T01:43:04Z

I added this feature. Please check it out and let me know in case of problems.

jrandall added P3: Low priority D2: Moderate labels Jan 25, 2016

pd3 closed this as completed in fdbda9a Jan 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcftools norm inconsistency with allelic depth (AD) when splitting multi-allelic sites #360

bcftools norm inconsistency with allelic depth (AD) when splitting multi-allelic sites #360

freeseek commented Dec 8, 2015 •

edited

Loading

jrandall commented Jan 25, 2016

freeseek commented Sep 13, 2017

freeseek commented Sep 15, 2017

dat4git commented Sep 15, 2017

pd3 commented Sep 21, 2017

freeseek commented Sep 21, 2017

pd3 commented Jan 30, 2020

bcftools norm inconsistency with allelic depth (AD) when splitting multi-allelic sites #360

bcftools norm inconsistency with allelic depth (AD) when splitting multi-allelic sites #360

Comments

freeseek commented Dec 8, 2015 • edited Loading

jrandall commented Jan 25, 2016

freeseek commented Sep 13, 2017

freeseek commented Sep 15, 2017

dat4git commented Sep 15, 2017

pd3 commented Sep 21, 2017

freeseek commented Sep 21, 2017

pd3 commented Jan 30, 2020

freeseek commented Dec 8, 2015 •

edited

Loading