Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pileup outputs incorrect bases for insertions following deletions #59

Closed
jkbonfield opened this issue Feb 11, 2014 · 8 comments
Closed

Comments

@jkbonfield
Copy link
Contributor

I think this is an HTSlib issue rather than samtools issue. It was initially reported in 2010, but has only partially been fixed:

https://sourceforge.net/mailarchive/forum.php?thread_name=424A790A-3AC0-4FC2-8981-5E3F6671C0DD%40sanger.ac.uk&forum_name=samtools-devel

For example, the following SAM file:

@SQ     SN:c2   LN:9
@CO      c2    CC***AA**T**AA***CC
@CO
@CO     +s1    CT***AA**T**AA***TC
@CO     +s1b   CT*******T*******TC
@CO     +s2    CT*****G**G******TC
@CO     +s3    CT*****GG*GG*****TC
@CO     +s3b   CT****CGGCGGC****TC
@CO     +s4    CT***AAG**G*AA***TC
@CO     +s5    CTGGG*********GGGTC
s1      0       c2      1       0       9M      *       0       0       CTAATAATC       XXXXXXXXX
s1b     0       c2      1       0       2M2D1M2D2M      *       0       0       CTTTC   *
s2      0       c2      1       0       2M2D1I1D1I2D2M  *       0       0       CTGGTC  *
s3      0       c2      1       0       2M2D2I1D2I2D2M  *       0       0       CTGGGGTC        *
s3b     0       c2      1       0       2M1D1M2I1M2I1M1D2M      *       0       0       CTCGGCGGCTC     *
s4      0       c2      1       0       4M1I1D1I4M      *       0       0       CTAAGGAATC      *
s5      0       c2      1       0       2M3I5D3I2M      *       0       0       CTGGGGGGTC      *

displays for position 5

c2      5       T       7       ..-2AA*+1T*+2GTC+2GG*+1A*       X~~~~~~

It should be:

c2      5       T       7       ..-2AA*+1G*+2GGC+2GG*+1A*       X~~~~~~
@kirkmcclure
Copy link
Contributor

Should it really be:

c2      5       T       7       ..-2AA*+1G*+2GGC+2GG*+1G*       X~~~~~~

CTAA-T-AATC Reference
||||   |||| (CIGAR 4M1I1D1I4M)
CTAAG-GAATC Query

@jkbonfield
Copy link
Contributor Author

Yes my mistake. s4 is *G so *+1G would be correct.

@kirkmcclure
Copy link
Contributor

If we replace sam.c resolve_cigar2()

else if (op == BAM_CINS || op == BAM_CSOFT_CLIP) s->y += l;

with:

else if (op == BAM_CINS || op == BAM_CSOFT_CLIP) {
      if (op != BAM_CINS || _cop(cigar[s->k-1]) != BAM_CMATCH) // don't increment on match+insert
          s->y += l;
      if (op == BAM_CINS && _cop(cigar[s->k-1]) == BAM_CINS) // actually delete+insert signature
          s->y += l;
}

the results for the example are:

c2      1       C       7       ^!.^!.^!.^!.^!.^!.^!.   X~~~~~~
c2      2       T       7       ..-2AA.-2AA.-2AA.-1A..+3GGG     X~~~~~~
c2      3       A       7       .****.* X~~~~~~
c2      4       A       7       .**+1G*+2GGC+2GG.+1G*   X~~~~~~
c2      5       T       7       ..-2AA*+1G*+2GGC+2GG*+1G*       X~~~~~~
c2      6       A       7       .***C-1A.*      X~~~~~~
c2      7       A       6       .***.*+3GGG     X~~~~~
c2      8       T       6       ......  X~~~~~
c2      9       C       6       .$.$.$.$.$.$    X~~~~~

(and ~ 35 failures for samtools test/mpileup)

@jkbonfield
Copy link
Contributor Author

Looks promising and I think it is definitely along the right lines. It may need an s->k > 0 check too so that s->k-1 doesn't fail. This would prevent problems when cigar strings start with an insertion. Admittedly this is odd, but it can be valid in a multiple alignment.

I also wonder if it solves issues with 10M1D1P1I10M. Ie a pad operator between the D and the I. I haven't checked yet if this gives incorrect pileup, but most likely. Perhaps therefore it just needs a flag being tracked to indicate "last_base_is_del" which is set on D and unset of M=XI etc (not not P). This would also solve starting on an insertion. I need to think of more corner cases for testing.

@jkbonfield
Copy link
Contributor Author

Alas I see more problems that I didn't spot originally. Look at s5 in isolation:

ref CC***AATAA***CC
s5  CTGGG*****GGGTC
    2M 3I  5D 3I 2M

The cigar string shows s5 having insertion followed by deletion followed by insertion.

Position 2 therefore should probably read T+3GGG-5AATAA and position 7 is *+3GGG (correct).
It looks like there is no support for emitting both + and - from the same reference coordinate. I'm not even sure if most mpileup parsers would accept that syntax, although it sounds like it is valid.

@kirkmcclure
Copy link
Contributor

We need another int in (sam.h) bam_pileup1_t struct, say mismatchdellen
Then, in sam.c resolve_cigar2(), replace:

            else if (op2 == BAM_CINS)
                p->indel = l2;

with:

            else if (op2 == BAM_CINS) {
                p->indel = l2;
                if (s->k+2<c->n_cigar && _cop(cigar[s->k+2]) == BAM_CDEL) {
                    p->mismatchdellen = 0 - _cln(cigar[s->k+2]);
                }
            }

and augment bam_plcmd.c pileup_seq() p->indel condition with:

        if (p->mismatchdellen < 0) {
            printw(p->mismatchdellen, fp);
            for (j = 1; j <= -p->mismatchdellen; ++j) {
                int c = (ref && (int)pos+j < ref_len)? ref[pos+j] : 'N';
                putc(bam_is_rev(p->b)? tolower(c) : toupper(c), fp);
            }
        }

the results for the example are:

c2      1       C       7       ^!.^!.^!.^!.^!.^!.^!.   X~~~~~~
c2      2       T       7       ..-2AA.-2AA.-2AA.-1A..+3GGG-5AATAA      X~~~~~~
c2      3       A       7       .****.* X~~~~~~
c2      4       A       7       .**+1G-1T*+2GG-1TC+2GG.+1G-1T*  X~~~~~~
c2      5       T       7       ..-2AA*+1G-2AA*+2GG-2AAC+2GG*+1G*       X~~~~~~
c2      6       A       7       .***C-1A.*      X~~~~~~
c2      7       A       7       .****.*+3GGG    X~~~~~~
c2      8       T       7       ....... X~~~~~~
c2      9       C       7       .$.$.$.$.$.$.$  X~~~~~~

P.S. corrections to previous proposal:

    if (op != BAM_CINS || (s->k > 0 && _cop(cigar[s->k-1]) != BAM_CMATCH)) // don't increment on match+insert
        s->y += l;
    if (op == BAM_CINS && k>=2 && _cop(cigar[k-1]) == BAM_CDEL && _cop(cigar[k-2]) == BAM_CINS)  // increment on 2nd ins of ins+del+ins
        s->y += l;

jmarshall pushed a commit to samtools/samtools that referenced this issue May 17, 2016
Fixes #496.  Also fixed the tests for c1#pad1.bam and c1#pad2.bam;
they should have been known failures instead of know passes (see
issues samtools/htslib#57 and samtools/htslib#59).
pd3 pushed a commit to pd3/htslib that referenced this issue May 4, 2018
It may be annoying to manually search binary expansions of decimal
integers to get the corresponding flags.  Hat tip @karel-brinda.

Add footnote to SAM example noting the flag meanings of each bit in the
example, as this is the first time the reader sees the bitwise FLAGs.

Add footnote link to Wikipedia discussion of bitwise manipulation.

Add decimal bit values to flag table: non-programmers can construct bit
masks by adding powers of two together, but they don't know hexadecimal.
Fix 0x20 description: the mate is reverse-COMPLEMENTED.

Add hex equivalents to multi-segment annotation example, and reformat
as a table for space and to emphasize the interesting parts.  Fix typo
in GenBank entry.

Closes samtools#59.
@jkbonfield
Copy link
Contributor Author

More dumb examples, this time with "N".

s1	0	z	1	0	2N4M2N3M2N	*	0	0	GCTTCAG	*

Reported as:

z	1	T	1	^!.	B
z	2	A	1	.	a
z	3	G	1	.	~
z	4	C	1	.	~
z	5	T	1	.	~
z	6	T	1	.	~
z	7	A	1	>	~
z	8	G	1	>	~
z	9	C	1	.	~
z	10	A	1	.	~
z	11	G	1	.	~
z	12	G	0	*	*
z	13	T	0	*	*

Some oddities. resolve_cigar2 treats ref_skip as special when starting an alignment, but not when ending. Hence "." vs "*" at head and tail. Secondly it reports quality "B" and "a", which are coming from the last few bases of sequence instead of quality itself.

We can even dump out chunks of internal memory before a read starts, eg with a single read with cigar 50000N1M and ./samtools.dev mpileup n.sam |egrep -v '\*$'|awk '{print $NF}'|tr -d '\012' | tr 'B-}' '\041-\134'

I have a fix for this and the above problems, being tested now.

@valeriuo
Copy link
Contributor

Fixed by samtools/samtools#847

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants