Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added base modification interfaces. #1132

Merged
merged 7 commits into from
Aug 10, 2021
Merged

Conversation

jkbonfield
Copy link
Contributor

First draft of the base modification (methylation etc) API for comment on interface.

I'll work on samtools to add an option to get this into the mpileup text output too, once I've figured out exactly what we want it to look like. (Probably like one of the test/pileup_mod choices.)


The general principle is

  1. Allocate a base modification state
  2. Parse mods
  3. Iterate over mods or query positions
  4. Destroy state.

An example usage is as follows (minus error checking for clarity):

hts_base_mod_state *m = hts_base_mod_state_alloc();
sam_read1(fp, hdr, b);
bam_parse_basemod(b, m);

hts_base_mod mod;
while ((n = bam_next_basemod(b, m, &mod, 1, &pos) > 0))
    printf("Modified %c to %c at %d\n",
        mod.canonical_base, mod.modified_base, pos);
}
hts_base_mod_state_free(m);

If being called from an iterator that marches along the query sequence
we may wish to have a loop calling bam_mods_at_next_pos instead (pos
0, 1, 2, 3 etc).

A simple example of this is in test/test_mod.c

It may also be used from within a pileup iterator which marches along
query position instead of sequence position. In this case call
bam_mods_at_qpos instead with pileup->qpos; which is normally +1 each
time but may be more or less with indels.

A simple example of this is in test/pileup_mod.c although it doesn't
look for modifications within indels. (That would need to mix in
bam_mods_at_next_pos calls for each base in the indel.)

Note currently bam_mods_at_qpos is a brain-dead loop around
bam_mods_at_next_pos which means it may be excessive when huge
soft-clips are being used, however this optimisation may happen later.

Right now the overhead of pileup with modifications isn't vast. On a
~1Kb long seqs it's around 70% slower than the same code without these
calls. On ~50Kb seqs it drops to aroudn 20% slower. Hence the
significant cost in the initial bam_parse_basemod call.

#
# Copyright (C) 2020 Genome Research Ltd.
#
# Author: James Bonfield <rmd@sanger.ac.uk>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol cut, paste ;)

The general principle is

1. Allocate a base modification state
2. Parse mods
3. Iterate over mods or query positions
4. Destroy state.

An example usage is as follows (minus error checking for clarity):

    hts_base_mod_state *m = hts_base_mod_state_alloc();
    sam_read1(fp, hdr, b);
    bam_parse_basemod(b, m);

    hts_base_mod mod;
    while ((n = bam_next_basemod(b, m, &mod, 1, &pos) > 0))
        printf("Modified %c to %c at %d\n",
            mod.canonical_base, mod.modified_base, pos);
    }
    hts_base_mod_state_free(m);

If being called from an iterator that marches along the query sequence
we may wish to have a loop calling bam_mods_at_next_pos instead (pos
0, 1, 2, 3 etc).

A simple example of this is in test/test_mod.c

It may also be used from within a pileup iterator which marches along
query position instead of sequence position.  In this case call
bam_mods_at_qpos instead with pileup->qpos; which is normally +1 each
time but may be more or less with indels.

A simple example of this is in test/pileup_mod.c although it doesn't
look for modifications within indels.  (That would need to mix in
bam_mods_at_next_pos calls for each base in the indel.)

Note currently bam_mods_at_qpos is a brain-dead loop around
bam_mods_at_next_pos which means it may be excessive when huge
soft-clips are being used, however this optimisation may happen later.

Right now the overhead of pileup with modifications isn't vast. On a
~1Kb long seqs it's around 70% slower than the same code without these
calls.  On ~50Kb seqs it drops to aroudn 20% slower.  Hence the
significant cost in the initial bam_parse_basemod call.
The API for this always had both constructor and destructor as type
int, but it was never documented what this return value is for and nor
was it ever checked.

Although the ABI hasn't changed, this is an API change as we have
added meaning to an unused element.  However (hopefully) the new
meaning is in line with expectations and library norms.
This is a generalised version of bam_plp_insertion (which now calls
this with a NULL argument) that can also include markup for base
modifications.  Used by samtools mpileup.
When we have no base modification, the return value and ins->l are
both the same thing (number of bases inserted).

When we have base modifications, the old length was string length (and
ironically ins->l was incorrectly left as the number of bases).
That's not particularly obvious, nor useful.

It's now reversed, so ins->l is the length of the string ins->s and
the return value is the number of bases.  Eg "+3AG[+o99]A" vs "+3AGA"
both now return 3.
@whitwham
Copy link
Contributor

Removed a couple of my comments as I think the problem is elsewhere. Though as far as I can tell, bam_next_basemod can never return -1 for an error.

htslib/sam.h Show resolved Hide resolved
jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jul 29, 2021
jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jul 29, 2021
There is one more element in Ml than Mm so the final 212 is not used.

Reported by Andrew Whitwham at
samtools/htslib#1132 (comment)
jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jul 29, 2021
There is one more element in Ml than Mm so the final 169 is not used.

Reported by Andrew Whitwham at
samtools/htslib#1132 (comment)
Having an MM tag refer to bases that are beyond the end of the
sequence now generates a warning and returns -1 instead of 0 from the
function. It's a warning because it's harmless to ignore these, but
the caller may wish to check and treat it as a hard error (as the test
harness does).

Note: this isn't trivial to do for the reverse strand, so (for now) we
only check the top strand data.

Also fixed a typo in a comment.
@cjw85
Copy link

cjw85 commented Aug 4, 2021

@jkbonfield Thanks for this.

I've been using the branch in anger recently with a little program to summarize human 5mC data to bedMethyl format. I've not observed any issue. The performance is fine in this use case particularly writing out all contexts; writing the records to file becomes an issue. When filtering to CpG sites I can process 10s of fold coverage of chr11 in less than half a minute --- easily sufficient for practical purposes.

@jkbonfield
Copy link
Contributor Author

Thanks. It's always useful to get comments "from the wild" rather than just internal to the group.

It's long overdue merging, and I think we're pretty much there.

The format of data was correct, but the values summed to a probability
higher than 1.

As reported in samtools/hts-specs#584
@whitwham whitwham merged commit 7ba9ecd into samtools:develop Aug 10, 2021
@jmarshall
Copy link
Member

PR #1320 adds some minor source file fixes that were not noticed before this was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants