Skip to content

Commit

Permalink
Fix bcftools norm -m +indels
Browse files Browse the repository at this point in the history
The code was handling the merging in a simplistic way, SNPs vs everything else.
As pointed by #2084, it also would not merge indels with `-m +indels`, only
with `-m +both` or `-m +any`.

Also splitting by type was not functioning properly due to an error where
two incompatible bitmasks were used together (e.g. COLLAPSE_SNPS vs VCF_SNP)

This is now fixed and improved, the new behavior is as follows:

- multiallelic sites with containing SNPs but not indels are split
  with `-m -snps` but not with `-m -indels`, and analogously for
  indels.

- multiallelic sites containing both SNPs and indels are split when
  any of the following is given: `-m -snps`, `-m -indels`, `-m both`,
  `-m any`

- merging with `-m +snps` and `-m +indels` should work as expected
  in case of pure SNP or indel sites. When the input sites contain
  a mixture of types (e.g. SNP + indel), such sites will not be merged.

- merging with `-m +both` will merge together not just SNPs with SNPs
  and indels with indels, but also "other types" with "other types".

Note: this could be improved by providing the user with a way to
fine-tune the desired behaviour, for example something like
    -m +snps+mnps,indels
to merge SNPs with MNPs together and indels together. This would
not be too difficult to add, but would complicate the user interface.

Another improvement would be to make it possible to split multiallelic
sites containing both SNPs and indels so that
a) two mutliallelic sites are emitted, one with SNPs only and one with
   indels only
b) as above, but one is transformed into multiple biallelic sites and
   one multiallelic site

This could be further improved (and complicated) by considering other
variant types.

Resolves #2084
  • Loading branch information
pd3 committed Feb 8, 2024
1 parent f33fd1d commit 7a4f801
Show file tree
Hide file tree
Showing 14 changed files with 207 additions and 71 deletions.
4 changes: 4 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ Changes affecting specific commands:
- Fix Type=String multiallelic splitting for Number=A,R,G tags with incorrect number
of values.

- Merging into multiallelic sites with `bcftools norm -m +indels` did not work. This is
now fixed and the merging is now more strict about variant types, for example complex
events, such as AC>TGA, are not considered as indels anymore (#2084)

* bcftools +setGT

- Support for custom genotypes based on the allele with higher depth, such
Expand Down
4 changes: 4 additions & 0 deletions doc/bcftools.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2503,6 +2503,10 @@ the *<<fasta_ref,--fasta-ref>>* option is supplied.
both SNPs and indels should be merged separately into two records, specify
'both'; if SNPs and indels should be merged into a single record, specify
'any'.
{nbsp} +
{nbsp} +
Note that multiallelic sites with both SNPs and indels will be split into
biallelic sites with both *-m -snps* and *-m -indels*.

*--multi-overlaps* '0'|'.'::
use the reference ('0') or missing ('.') allele for overlapping alleles after
Expand Down
7 changes: 7 additions & 0 deletions test/norm.merge.4.1.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=248387328>
##reference=file:ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1 . C T,CTT,A,CAA . . .
1 2 . C <DEL>,<DUP> . . .
8 changes: 8 additions & 0 deletions test/norm.merge.4.2.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=248387328>
##reference=file:ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1 . C T,CTT . . .
1 1 . C A,CAA . . .
1 2 . C <DEL>,<DUP> . . .
8 changes: 8 additions & 0 deletions test/norm.merge.4.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
##fileformat=VCFv4.2
##contig=<ID=1,length=248387328>
##reference=file:ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1 . C T,CTT . . .
1 1 . C A,CAA . . .
1 2 . C <DEL> . . .
1 2 . C <DUP> . . .
6 changes: 4 additions & 2 deletions test/norm.merge.out
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,10 @@
2 101 . ATTTTTTTTTTTTT ATTTTTTTTTTTTTTT 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
2 114 . TC TTCC,TTC 999 FAIL1 INDEL;AN=4;AC=2,2 GT:DP:FGS 1/2:1:A,BB,CCC,EEEE,.,FFFFF 1/2:1:AA,BB,CCC,EEEE,.,FFFFF
2 115 . C T 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
20 3 . GATG CTATG,GACT 999 PASS INDEL;AN=4;AC=2,2 GT 2/1 2/1
20 5 id0001;id0002 TGGG TAC,TG,TGGGG,AC . PASS INDEL;AN=4;AC=2,2,0,0 GT:PL:DP 1/2:1,2,3,4,.,6,7,.,.,10,11,.,.,.,15:1 1/2:1,2,3,4,.,6,7,.,.,10,11,.,.,.,15:1
20 3 . GATG GACT 999 PASS INDEL;AN=4;AC=2 GT 1/0 1/0
20 3 . G CT 999 PASS INDEL;AN=4;AC=2 GT 0/1 0/1
20 5 id0001;id0002 TGGG TG,TGGGG . PASS INDEL;AN=4;AC=2,0 GT:PL:DP 0/1:1,4,6,7,.,10:1 0/1:1,4,6,7,.,10:1
20 5 . TGGG TAC,AC . PASS INDEL;AN=4;AC=2,0 GT:PL:DP 1/0:1,2,3,11,.,15:1 1/0:1,2,3,11,.,15:1
20 59 id0003 AG . 999 PASS AN=4 GT:PL:DP 0/0:0:4 0/0:0:4
20 80 . CACAG CACAT 999 PASS AN=4;AC=2 GT:PL:DP 0/1:255,0,255:13 0/1:255,0,255:13
20 81 . A C 999 PASS AN=4;AC=2 GT:PL:DP 0/1:255,0,255:13 0/1:255,0,255:13
Expand Down
6 changes: 4 additions & 2 deletions test/norm.merge.strict.out
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,10 @@
2 101 . ATTTTTTTTTTTTT ATTTTTTTTTTTTTTT 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
2 114 . TC TTCC,TTC 999 PASS INDEL;AN=4;AC=2,2 GT:DP:FGS 1/2:1:A,BB,CCC,EEEE,.,FFFFF 1/2:1:AA,BB,CCC,EEEE,.,FFFFF
2 115 . C T 999 PASS INDEL;AN=4;AC=4 GT:DP 1/1:1 1/1:1
20 3 . GATG CTATG,GACT 999 PASS INDEL;AN=4;AC=2,2 GT 2/1 2/1
20 5 id0001;id0002 TGGG TAC,TG,TGGGG,AC . PASS INDEL;AN=4;AC=2,2,0,0 GT:PL:DP 1/2:1,2,3,4,.,6,7,.,.,10,11,.,.,.,15:1 1/2:1,2,3,4,.,6,7,.,.,10,11,.,.,.,15:1
20 3 . GATG GACT 999 PASS INDEL;AN=4;AC=2 GT 1/0 1/0
20 3 . G CT 999 PASS INDEL;AN=4;AC=2 GT 0/1 0/1
20 5 id0001;id0002 TGGG TG,TGGGG . PASS INDEL;AN=4;AC=2,0 GT:PL:DP 0/1:1,4,6,7,.,10:1 0/1:1,4,6,7,.,10:1
20 5 . TGGG TAC,AC . PASS INDEL;AN=4;AC=2,0 GT:PL:DP 1/0:1,2,3,11,.,15:1 1/0:1,2,3,11,.,15:1
20 59 id0003 AG . 999 PASS AN=4 GT:PL:DP 0/0:0:4 0/0:0:4
20 80 . CACAG CACAT 999 PASS AN=4;AC=2 GT:PL:DP 0/1:255,0,255:13 0/1:255,0,255:13
20 81 . A C 999 PASS AN=4;AC=2 GT:PL:DP 0/1:255,0,255:13 0/1:255,0,255:13
Expand Down
12 changes: 12 additions & 0 deletions test/norm.split.merge.1.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248387328>
##contig=<ID=chr2,length=242696752>
##contig=<ID=chr3,length=201105948>
##reference=file:ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 29291 . C T,G,A . . .
chr2 29292 . T C . . .
chr2 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
chr3 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
chr3 29292 . T CGTA . . .
14 changes: 14 additions & 0 deletions test/norm.split.merge.2.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248387328>
##contig=<ID=chr2,length=242696752>
##contig=<ID=chr3,length=201105948>
##reference=file:ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 29291 . C T . . .
chr1 29291 . C G . . .
chr1 29291 . C A . . .
chr2 29292 . T C . . .
chr2 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
chr3 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
chr3 29292 . T CGTA . . .
34 changes: 34 additions & 0 deletions test/norm.split.merge.3.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248387328>
##contig=<ID=chr2,length=242696752>
##contig=<ID=chr3,length=201105948>
##reference=file:ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 29291 . C T,G,A . . .
chr2 29292 . T C . . .
chr2 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC . . .
chr2 29292 . T TCTCTTTCTCACTGTCTCTCTAGCC . . .
chr2 29292 . T TCCCTCTCCTTTCTCCTCTCTAGC . . .
chr2 29292 . T TCCATCTGTATCCTCTCTAAGC . . .
chr2 29292 . T TCCCTCTCCTTTCTCCTCAGCC . . .
chr2 29292 . T TCCCTCTCCCTTTCTCCTCTCTAGCC . . .
chr2 29292 . T TCCTCTCCTTTCTCCTCTACCGC . . .
chr2 29292 . T TCCCTCTCCTTTCTCTCTCTAGCC . . .
chr2 29292 . T TCCCTCTCCTTTCTCCTCTAGCC . . .
chr2 29292 . T TCCCTCTCCTTTTCCTCCCCAGCC . . .
chr2 29292 . T TCCCTCTCCTTCTCCTCTCTAGCC . . .
chr2 29292 . T TCCCTCTCCCTTCTCCTCTCTCAC . . .
chr3 29292 . T TCCCTCTCCTTTCTCCTCTCTAGCC . . .
chr3 29292 . T TCTCTTTCTCACTGTCTCTCTAGCC . . .
chr3 29292 . T TCCCTCTCCTTTCTCCTCTCTAGC . . .
chr3 29292 . T TCCATCTGTATCCTCTCTAAGC . . .
chr3 29292 . T TCCCTCTCCTTTCTCCTCAGCC . . .
chr3 29292 . T TCCCTCTCCCTTTCTCCTCTCTAGCC . . .
chr3 29292 . T TCCTCTCCTTTCTCCTCTACCGC . . .
chr3 29292 . T TCCCTCTCCTTTCTCTCTCTAGCC . . .
chr3 29292 . T TCCCTCTCCTTTCTCCTCTAGCC . . .
chr3 29292 . T TCCCTCTCCTTTTCCTCCCCAGCC . . .
chr3 29292 . T TCCCTCTCCTTCTCCTCTCTAGCC . . .
chr3 29292 . T TCCCTCTCCCTTCTCCTCTCTCAC . . .
chr3 29292 . T CGTA . . .
10 changes: 10 additions & 0 deletions test/norm.split.merge.4.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248387328>
##contig=<ID=chr2,length=242696752>
##contig=<ID=chr3,length=201105948>
##reference=file:ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 29291 . C T,G,A . . .
chr2 29292 . T C,TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
chr3 29292 . T CGTA,TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
9 changes: 9 additions & 0 deletions test/norm.split.merge.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
##fileformat=VCFv4.2
##contig=<ID=chr1,length=248387328>
##contig=<ID=chr2,length=242696752>
##contig=<ID=chr3,length=201105948>
##reference=file:ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 29291 . C T,G,A . . .
chr2 29292 . T C,TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
chr3 29292 . T CGTA,TCCCTCTCCTTTCTCCTCTCTAGCC,TCTCTTTCTCACTGTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTCTAGC,TCCATCTGTATCCTCTCTAAGC,TCCCTCTCCTTTCTCCTCAGCC,TCCCTCTCCCTTTCTCCTCTCTAGCC,TCCTCTCCTTTCTCCTCTACCGC,TCCCTCTCCTTTCTCTCTCTAGCC,TCCCTCTCCTTTCTCCTCTAGCC,TCCCTCTCCTTTTCCTCCCCAGCC,TCCCTCTCCTTCTCCTCTCTAGCC,TCCCTCTCCCTTCTCCTCTCTCAC . . .
6 changes: 6 additions & 0 deletions test/test.pl
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,12 @@
run_test(\&test_vcf_norm,$opts,in=>'norm.right-align',fai=>'norm.right-align',out=>'norm.right-align.2.out',args=>'--old-rec-tag ORI -g {PATH}/norm.right-align.gff');
run_test(\&test_vcf_norm,$opts,in=>'norm.atom-split-norm',fai=>'norm.atom-split-norm',out=>'norm.atom-split-norm.1.out',args=>'--old-rec-tag ORI -a -m -any');
run_test(\&test_vcf_norm,$opts,in=>'norm.string-tags',out=>'norm.string-tags.1.out',args=>'-m -any');
run_test(\&test_vcf_norm,$opts,in=>'norm.split.merge',out=>'norm.split.merge.1.out',args=>['-m -','-m +both']);
run_test(\&test_vcf_norm,$opts,in=>'norm.split.merge',out=>'norm.split.merge.2.out',args=>['-m -','-m +indels']);
run_test(\&test_vcf_norm,$opts,in=>'norm.split.merge',out=>'norm.split.merge.3.out',args=>['-m -','-m +snps']);
run_test(\&test_vcf_norm,$opts,in=>'norm.split.merge',out=>'norm.split.merge.4.out',args=>['-m -','-m +any']);
run_test(\&test_vcf_norm,$opts,in=>'norm.merge.4',out=>'norm.merge.4.1.out',args=>'-m +any');
run_test(\&test_vcf_norm,$opts,in=>'norm.merge.4',out=>'norm.merge.4.2.out',args=>'-m +both');
run_test(\&test_vcf_view,$opts,in=>'merge.gvcf.2.a',out=>'merge.gvcf.2.a.1.out',args=>'-HA');
run_test(\&test_vcf_view,$opts,in=>'merge.gvcf.2.a',out=>'merge.gvcf.2.a.2.out',args=>'-HAA');
run_test(\&test_vcf_view,$opts,in=>'weird-chr-names',out=>'weird-chr-names.1.out',args=>'',reg=>'-r 1');
Expand Down
150 changes: 83 additions & 67 deletions vcfnorm.c
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,8 @@ typedef struct
{
char *tseq, *seq;
int mseq;
bcf1_t **lines, **tmp_lines, **alines, **blines, *mrow_out;
int ntmp_lines, mtmp_lines, nalines, malines, nblines, mblines;
bcf1_t **lines, **tmp_lines, **mrows, *mrow_out;
int ntmp_lines, mtmp_lines, nmrows, mmrows, mrows_first;
map_t *maps; // mrow map for each buffered record
char **als;
int mmaps, nals, mals;
Expand Down Expand Up @@ -1874,72 +1874,98 @@ static void merge_biallelics_to_multiallelic(args_t *args, bcf1_t *dst, bcf1_t *
}

#define SWAP(type_t, a, b) { type_t t = a; a = b; b = t; }
static void mrows_schedule(args_t *args, bcf1_t **line)
static void mrows_push(args_t *args, bcf1_t **line)
{
int i,m;
if ( args->mrows_collapse==COLLAPSE_ANY // merge all record types together
|| bcf_get_variant_types(*line)&VCF_SNP // SNP, put into alines
|| bcf_get_variant_types(*line)==VCF_REF ) // ref
{
args->nalines++;
m = args->malines;
hts_expand(bcf1_t*,args->nalines,args->malines,args->alines);
for (i=m; i<args->malines; i++) args->alines[i] = bcf_init1();
SWAP(bcf1_t*, args->alines[args->nalines-1], *line);
}
else
{
args->nblines++;
m = args->mblines;
hts_expand(bcf1_t*,args->nblines,args->mblines,args->blines);
for (i=m; i<args->mblines; i++) args->blines[i] = bcf_init1();
SWAP(bcf1_t*, args->blines[args->nblines-1], *line);
if ( !args->nmrows ) args->mrows_first = 0;
args->nmrows++;
m = args->mmrows;
hts_expand(bcf1_t*,args->nmrows,args->mmrows,args->mrows);
for (i=m; i<args->mmrows; i++) args->mrows[i] = bcf_init1();
SWAP(bcf1_t*, args->mrows[args->nmrows-1], *line);

if ( args->mrows_collapse==COLLAPSE_ANY ) return;

// move the line up the sorted list so that the same variant types end up together
int cur_type = bcf_get_variant_types(args->mrows[args->nmrows-1]);
i = args->mrows_first + args->nmrows - 1;
while (i>0)
{
int prev_type = bcf_get_variant_types(args->mrows[i-1]);
if ( prev_type <= cur_type ) break;
bcf1_t *tmp = args->mrows[i-1];
args->mrows[i-1] = args->mrows[i];
args->mrows[i] = tmp;
i--;
}
}
static int mrows_ready_to_flush(args_t *args, bcf1_t *line)
static int mrows_can_flush(args_t *args, bcf1_t *line)
{
if ( args->nalines && (args->alines[0]->rid!=line->rid || args->alines[0]->pos!=line->pos) ) return 1;
if ( args->nblines && (args->blines[0]->rid!=line->rid || args->blines[0]->pos!=line->pos) ) return 1;
if ( !args->nmrows ) return 0;
int ibeg = args->mrows_first;
if ( args->mrows[ibeg]->rid != line->rid ) return 1;
if ( args->mrows[ibeg]->pos != line->pos ) return 1;
return 0;
}
static bcf1_t *mrows_flush(args_t *args)
{
if ( args->nblines && args->nalines==1 && bcf_get_variant_types(args->alines[0])==VCF_REF )
if ( !args->nmrows ) return NULL;

int ibeg = args->mrows_first;

//fprintf(stderr,"flush: ibeg=%d n=%d\n",ibeg,args->nmrows);
//int i;
//for (i=ibeg; i<ibeg+args->nmrows; i++)
// fprintf(stderr,"\ti=%d type=%d %s %s\n",i,bcf_get_variant_types(args->mrows[i]),args->mrows[i]->d.allele[0],args->mrows[i]->d.allele[1]);

if ( args->nmrows==1 )
{
// By default, REF lines are merged with SNPs if SNPs and indels are to be kept separately.
// However, if there are indels only and a single REF line, merge it with indels.
args->nblines++;
int i,m = args->mblines;
hts_expand(bcf1_t*,args->nblines,args->mblines,args->blines);
for (i=m; i<args->mblines; i++) args->blines[i] = bcf_init1();
SWAP(bcf1_t*, args->blines[args->nblines-1], args->alines[0]);
args->nalines--;
args->nmrows = 0;
return args->mrows[ibeg];
}
if ( args->nalines )

if ( args->mrows_collapse==COLLAPSE_ANY )
{
if ( args->nalines==1 )
{
args->nalines = 0;
return args->alines[0];
}
// merge everything with anything
bcf_clear(args->mrow_out);
merge_biallelics_to_multiallelic(args, args->mrow_out, args->alines, args->nalines);
args->nalines = 0;
merge_biallelics_to_multiallelic(args, args->mrow_out, &args->mrows[ibeg], args->nmrows - ibeg);
args->nmrows = 0;
return args->mrow_out;
}
else if ( args->nblines )

int j;
int types[] = { VCF_SNP, VCF_MNP, VCF_INDEL, VCF_OTHER, -1 }; // merge everything within the same category
if ( args->mrows_collapse==COLLAPSE_SNPS ) types[1] = -1; // merge SNPs only
else if ( args->mrows_collapse==COLLAPSE_INDELS ) types[0] = VCF_INDEL, types[1] = -1; // merge indels only
for (j=0; types[j]!=-1; j++)
{
if ( args->nblines==1 )
int i, type = types[j]; // to keep the compiler happy
for (i=ibeg; i<ibeg+args->nmrows; i++)
{
args->nblines = 0;
return args->blines[0];
type = bcf_get_variant_types(args->mrows[i]);
if ( type!=types[j] && type!=VCF_REF ) break;
}
if ( i==ibeg+1 && type!=VCF_REF )
{
// just one line of this type, no merging, but multiple lines of different type follow
args->nmrows--;
args->mrows_first++;
return args->mrows[ibeg];
}
if ( i>ibeg )
{
// more than one line, merging is needed
int nflush = i - ibeg;
bcf_clear(args->mrow_out);
merge_biallelics_to_multiallelic(args, args->mrow_out, &args->mrows[ibeg], nflush);
args->nmrows -= nflush;
args->mrows_first += nflush;
return args->mrow_out;
}
bcf_clear(args->mrow_out);
merge_biallelics_to_multiallelic(args, args->mrow_out, args->blines, args->nblines);
args->nblines = 0;
return args->mrow_out;
}
return NULL;
args->nmrows--;
args->mrows_first++;
return args->mrows[ibeg];
}
static void cmpals_add(cmpals_t *ca, bcf1_t *rec)
{
Expand Down Expand Up @@ -2013,21 +2039,13 @@ static void flush_buffer(args_t *args, htsFile *file, int n)
k = rbuf_shift(&args->rbuf);
if ( args->mrows_op==MROWS_MERGE )
{
if ( mrows_ready_to_flush(args, args->lines[k]) )
if ( mrows_can_flush(args, args->lines[k]) )
{
while ( (line=mrows_flush(args)) )
if ( bcf_write1(file, args->out_hdr, line)!=0 ) error("[%s] Error: cannot write to %s\n", __func__,args->output_fname);
}
int merge = 1;
if ( args->mrows_collapse!=COLLAPSE_BOTH && args->mrows_collapse!=COLLAPSE_ANY )
{
if ( !(bcf_get_variant_types(args->lines[k]) & args->mrows_collapse) ) merge = 0;
}
if ( merge )
{
mrows_schedule(args, &args->lines[k]);
continue;
}
mrows_push(args, &args->lines[k]);
continue;
}
else if ( args->rmdup )
{
Expand Down Expand Up @@ -2125,12 +2143,9 @@ static void destroy_data(args_t *args)
for (i=0; i<args->mtmp_lines; i++)
if ( args->tmp_lines[i] ) bcf_destroy1(args->tmp_lines[i]);
free(args->tmp_lines);
for (i=0; i<args->malines; i++)
bcf_destroy1(args->alines[i]);
free(args->alines);
for (i=0; i<args->mblines; i++)
bcf_destroy1(args->blines[i]);
free(args->blines);
for (i=0; i<args->mmrows; i++)
bcf_destroy1(args->mrows[i]);
free(args->mrows);
for (i=0; i<args->mmaps; i++)
free(args->maps[i].map);
for (i=0; i<args->ntmp_als; i++)
Expand Down Expand Up @@ -2228,7 +2243,8 @@ static int split_and_normalize(args_t *args)
// any restrictions on variant types to split?
if ( args->mrows_collapse!=COLLAPSE_BOTH && args->mrows_collapse!=COLLAPSE_ANY )
{
if ( !(bcf_get_variant_types(line) & args->mrows_collapse) )
int type = args->mrows_collapse==COLLAPSE_SNPS ? VCF_SNP : VCF_INDEL;
if ( !(bcf_get_variant_types(line) & type) )
{
normalize_line(args, line);
return 0;
Expand Down

0 comments on commit 7a4f801

Please sign in to comment.