Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools norm inconsistency with allelic depth (AD) when splitting multi-allelic sites #360

Closed
freeseek opened this issue Dec 8, 2015 · 7 comments

Comments

@freeseek
Copy link
Contributor

freeseek commented Dec 8, 2015

I hope this is not something that has been already asked before. Here an example VCF file:

## fileformat=VCFv4.1
## FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## contig=<ID=20,length=63025520>
# CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
20  20202020    .   C   G,T 0   PASS    .   GT:AD   1/2:0,48,61

If I split this file with the command "bcftools norm -m -any" I obtain:

## fileformat=VCFv4.1
## FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## contig=<ID=20,length=63025520>
# CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
20  20202020    .   C   G   0   PASS    .   GT:AD   1/0:0,48
20  20202020    .   C   T   0   PASS    .   GT:AD   0/1:0,61

However now I am in the uncomfortable situation where each site is heterozygous despite the allelic depth supporting "1/1" calls rather than "0/1" calls. I am sure people will have different opinions about this, but part of the reason many want to split multi-allelic sites is to consider each alternate allele as an allele to be interpreted as that allele against every other allele. It would be great to have at least an option to properly re-format the AD field so that the total sum of the AD fields is maintained after splitting, so that instead of splitting:

1/2:AD[0],AD[1],AD[2] -> 1/0:AD[0],AD[1] and 0/1:AD[0],AD[2]

It gets split instead as:

1/2:AD[0],AD[1],AD[2] -> 1/0:AD[0]+AD[2],AD[1] and 0/1:AD[0]+AD[1],AD[2]

I hope this makes sense.

@jrandall
Copy link
Contributor

@freeseek I think I understand what you are asking for, but I'm not sure it would make sense as a core feature. What would you suggest that the option be called, and how could it be made clear that the depths reported don't actually refer to the reference allele?

I can understand your frustration with the current output in this particular situation, but I'm not sure there is any better way to represent heterozygous sites where both alleles are non-reference. It seems like the "right" thing to do in this case might be to refuse the split the multiallelic line at all (perhaps on the basis that there is no evidence for it actually being multiallelic except for the fact that the reference contains a different allele than anything that was observed).

Wouldn't your suggestion to report all of the AD in the multiallelic input on each output line be even more confusing? In that case it would appear that there was actually evidence for a heterozygous call vs the reference allele, when in fact there was no evidence for the reference at all. Also, it would mean that if you sum up all of the AD for a particular location, there would appear to be twice as much as there actually was in the input data.

@freeseek
Copy link
Contributor Author

@jrandall I totally see how what I am asking for would not make sense to everybody, but there are several situations where it would make sense, especially if allelic depth is being taken into account after the split. As a matter of fact, HAIL does the splitting leaving the sum of AD constant:
https://hail.is/docs/stable/hail.VariantDataset.html#hail.VariantDataset.split_multi
It is really hurtful for my applications not to have the AD split the way HAIL does. I do understand that is not what everybody might want, but it would be great if there was an option to perform the splitting of AD where the "ref AD entry is the sum of the other multiallelic entries".

Again, to reiterate, I would like the following VCF file:

##fileformat=VCFv4.1
##FORMAT=<ID=AD,Number=R,Type=Integer,Description=Allelic depths for the ref and alt alleles in the order listed>
##FORMAT=<ID=GT,Number=1,Type=String,Description=Genotype>
##contig=<ID=20,length=63025520>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
20	20202020	.	C	G,T	0	PASS	.	GT:AD	1/2:2,48,61

to split as follows:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description=Allelic depths for the ref and alt alleles in the order listed>
##FORMAT=<ID=GT,Number=1,Type=String,Description=Genotype>
##contig=<ID=20,length=63025520>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
20	20202020	.	C	G	0	PASS	.	GT:AD	0/1:63,48
20	20202020	.	C	T	0	PASS	.	GT:AD	0/1:50,61

the same way HAIL does it, even if requiring a special additional parameter AD-specific. Thank you! :-)

@freeseek
Copy link
Contributor Author

I wrote the following patch that adds the HAIL-like splitting functionality (using the "--keep-sum-AD" flag):

--- vcfnorm.c	2017-09-15 12:00:00.000000000 -0400
+++ vcfnorm.c	2017-09-15 12:00:00.000000000 -0400
@@ -77,7 +77,7 @@
     char **argv, *output_fname, *ref_fname, *vcf_fname, *region, *targets;
     int argc, rmdup, output_type, n_threads, check_ref, strict_filter, do_indels;
     int nchanged, nskipped, nsplit, ntotal, mrows_op, mrows_collapse, parsimonious;
-    int record_cmd_line;
+    int record_cmd_line, keep_sum_ad;
 }
 args_t;
 
@@ -589,7 +589,7 @@
         assert( nvals>0 ); \
         type_t *vals = (type_t *) args->tmp_arr1; \
         int len = bcf_hdr_id2length(args->hdr,BCF_HL_FMT,fmt->id); \
-        int i, nsmpl = bcf_hdr_nsamples(args->hdr); \
+        int i, j, nsmpl = bcf_hdr_nsamples(args->hdr); \
         if ( nvals==nsmpl ) /* all values are missing */ \
         { \
             bcf_update_format_##type(args->hdr,dst,tag,vals,nsmpl); \
@@ -620,6 +620,8 @@
             for (i=0; i<nsmpl; i++) \
             { \
                 dst_vals[0] = src_vals[0]; \
+                if (args->keep_sum_ad && fmt->id==bcf_hdr_id2int(args->hdr,BCF_DT_ID,"AD") ) \
+                  for (j=1; j<nvals; j++) if ( j!=ialt+1 ) dst_vals[0] += src_vals[j]; \
                 dst_vals[1] = src_vals[ialt+1]; \
                 dst_vals += 2; \
                 src_vals += nvals; \
@@ -1746,6 +1748,7 @@
     fprintf(stderr, "    -d, --rm-dup <type>               remove duplicate snps|indels|both|any\n");
     fprintf(stderr, "    -f, --fasta-ref <file>            reference sequence (MANDATORY)\n");
     fprintf(stderr, "    -m, --multiallelics <-|+>[type]   split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both]\n");
+    fprintf(stderr, "    -k, --keep-sum-AD                 keep sum AD vector constant\n");
     fprintf(stderr, "        --no-version                  do not append version and command line to the header\n");
     fprintf(stderr, "    -N, --do-not-normalize            do not normalize indels (with -m or -c s)\n");
     fprintf(stderr, "    -o, --output <file>               write output to a file [standard output]\n");
@@ -1771,6 +1774,7 @@
     args->output_type = FT_VCF;
     args->n_threads = 0;
     args->record_cmd_line = 1;
+    args->keep_sum_ad = 0;
     args->aln_win = 100;
     args->buf_win = 1000;
     args->mrows_collapse = COLLAPSE_BOTH;
@@ -1784,6 +1788,7 @@
         {"fasta-ref",required_argument,NULL,'f'},
         {"do-not-normalize",no_argument,NULL,'N'},
         {"multiallelics",required_argument,NULL,'m'},
+        {"keep-sum-AD",no_argument,NULL,'k'},
         {"regions",required_argument,NULL,'r'},
         {"regions-file",required_argument,NULL,'R'},
         {"targets",required_argument,NULL,'t'},
@@ -1800,7 +1805,7 @@
         {NULL,0,NULL,0}
     };
     char *tmp;
-    while ((c = getopt_long(argc, argv, "hr:R:f:w:Dd:o:O:c:m:t:T:sN",loptions,NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "hr:R:f:w:Dd:o:O:c:m:t:T:sNk",loptions,NULL)) >= 0) {
         switch (c) {
             case 'N': args->do_indels = 0; break;
             case 'd':
@@ -1823,6 +1828,7 @@
                     else error("The argument to -m not recognised: %s\n", optarg);
                 }
                 break;
+            case 'k': args->keep_sum_ad = 1; break;
             case 'c':
                 if ( strchr(optarg,'w') ) args->check_ref |= CHECK_REF_WARN;
                 if ( strchr(optarg,'x') ) args->check_ref |= CHECK_REF_SKIP;

But I am not sure whether this will cause unintended behaviors in non-diploid cases. Could something like this be added to the main bcftools code?

@dat4git
Copy link

dat4git commented Sep 15, 2017

@freeseek this is a super helpful option! thanks for generating the nice patch!

@pd3
Copy link
Member

pd3 commented Sep 21, 2017

Hi, thank you for contributing the patch. However, it would be nicer to have a general way to specify rules for arbitrary tags, something like what merge -i implements.

@freeseek
Copy link
Contributor Author

I see your point, but I am not aware of how people would want to split the format fields. Notice that HAIL:
https://hail.is/docs/stable/hail.VariantDataset.html#hail.VariantDataset.split_multi
also has a specific way to split the PL format which is quite specific to that format and it has a way to recompute the GQ format after splitting. I do think HAIL has the most appropriate behavior for AD and PL.
Specific handling would not be too different from what is in the convert.c code which contains special handling for PL and GP.
Either way, thank you for writing vcfnorm.c as it has been invaluable in so many applications.

@pd3 pd3 closed this as completed in fdbda9a Jan 30, 2020
@pd3
Copy link
Member

pd3 commented Jan 30, 2020

I added this feature. Please check it out and let me know in case of problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants