Added base modification interfaces. #1132

jkbonfield · 2020-08-27T15:44:47Z

First draft of the base modification (methylation etc) API for comment on interface.

I'll work on samtools to add an option to get this into the mpileup text output too, once I've figured out exactly what we want it to look like. (Probably like one of the test/pileup_mod choices.)

The general principle is

Allocate a base modification state
Parse mods
Iterate over mods or query positions
Destroy state.

An example usage is as follows (minus error checking for clarity):

hts_base_mod_state *m = hts_base_mod_state_alloc();
sam_read1(fp, hdr, b);
bam_parse_basemod(b, m);

hts_base_mod mod;
while ((n = bam_next_basemod(b, m, &mod, 1, &pos) > 0))
    printf("Modified %c to %c at %d\n",
        mod.canonical_base, mod.modified_base, pos);
}
hts_base_mod_state_free(m);

If being called from an iterator that marches along the query sequence
we may wish to have a loop calling bam_mods_at_next_pos instead (pos
0, 1, 2, 3 etc).

A simple example of this is in test/test_mod.c

It may also be used from within a pileup iterator which marches along
query position instead of sequence position. In this case call
bam_mods_at_qpos instead with pileup->qpos; which is normally +1 each
time but may be more or less with indels.

A simple example of this is in test/pileup_mod.c although it doesn't
look for modifications within indels. (That would need to mix in
bam_mods_at_next_pos calls for each base in the indel.)

Note currently bam_mods_at_qpos is a brain-dead loop around
bam_mods_at_next_pos which means it may be excessive when huge
soft-clips are being used, however this optimisation may happen later.

Right now the overhead of pileup with modifications isn't vast. On a
~1Kb long seqs it's around 70% slower than the same code without these
calls. On ~50Kb seqs it drops to aroudn 20% slower. Hence the
significant cost in the initial bam_parse_basemod call.

jmarshall · 2020-08-28T09:25:51Z

test/base_mods/base-mods.sh

+#
+#    Copyright (C) 2020 Genome Research Ltd.
+#
+#    Author: James Bonfield <rmd@sanger.ac.uk>


Lol cut, paste ;)

The general principle is 1. Allocate a base modification state 2. Parse mods 3. Iterate over mods or query positions 4. Destroy state. An example usage is as follows (minus error checking for clarity): hts_base_mod_state *m = hts_base_mod_state_alloc(); sam_read1(fp, hdr, b); bam_parse_basemod(b, m); hts_base_mod mod; while ((n = bam_next_basemod(b, m, &mod, 1, &pos) > 0)) printf("Modified %c to %c at %d\n", mod.canonical_base, mod.modified_base, pos); } hts_base_mod_state_free(m); If being called from an iterator that marches along the query sequence we may wish to have a loop calling bam_mods_at_next_pos instead (pos 0, 1, 2, 3 etc). A simple example of this is in test/test_mod.c It may also be used from within a pileup iterator which marches along query position instead of sequence position. In this case call bam_mods_at_qpos instead with pileup->qpos; which is normally +1 each time but may be more or less with indels. A simple example of this is in test/pileup_mod.c although it doesn't look for modifications within indels. (That would need to mix in bam_mods_at_next_pos calls for each base in the indel.) Note currently bam_mods_at_qpos is a brain-dead loop around bam_mods_at_next_pos which means it may be excessive when huge soft-clips are being used, however this optimisation may happen later. Right now the overhead of pileup with modifications isn't vast. On a ~1Kb long seqs it's around 70% slower than the same code without these calls. On ~50Kb seqs it drops to aroudn 20% slower. Hence the significant cost in the initial bam_parse_basemod call.

The API for this always had both constructor and destructor as type int, but it was never documented what this return value is for and nor was it ever checked. Although the ABI hasn't changed, this is an API change as we have added meaning to an unused element. However (hopefully) the new meaning is in line with expectations and library norms.

This is a generalised version of bam_plp_insertion (which now calls this with a NULL argument) that can also include markup for base modifications. Used by samtools mpileup.

When we have no base modification, the return value and ins->l are both the same thing (number of bases inserted). When we have base modifications, the old length was string length (and ironically ins->l was incorrectly left as the number of bases). That's not particularly obvious, nor useful. It's now reversed, so ins->l is the length of the string ins->s and the return value is the number of bases. Eg "+3AG[+o99]A" vs "+3AGA" both now return 3.

test/base_mods/MM-chebi.sam

whitwham · 2021-07-28T14:52:44Z

Removed a couple of my comments as I think the problem is elsewhere. Though as far as I can tell, bam_next_basemod can never return -1 for an error.

htslib/sam.h

Reported by Andrew Whitwham at samtools/htslib#1132 (comment)

There is one more element in Ml than Mm so the final 212 is not used. Reported by Andrew Whitwham at samtools/htslib#1132 (comment)

There is one more element in Ml than Mm so the final 169 is not used. Reported by Andrew Whitwham at samtools/htslib#1132 (comment)

Having an MM tag refer to bases that are beyond the end of the sequence now generates a warning and returns -1 instead of 0 from the function. It's a warning because it's harmless to ignore these, but the caller may wish to check and treat it as a hard error (as the test harness does). Note: this isn't trivial to do for the reverse strand, so (for now) we only check the top strand data. Also fixed a typo in a comment.

cjw85 · 2021-08-04T17:50:11Z

@jkbonfield Thanks for this.

I've been using the branch in anger recently with a little program to summarize human 5mC data to bedMethyl format. I've not observed any issue. The performance is fine in this use case particularly writing out all contexts; writing the records to file becomes an issue. When filtering to CpG sites I can process 10s of fold coverage of chr11 in less than half a minute --- easily sufficient for practical purposes.

jkbonfield · 2021-08-05T08:09:50Z

Thanks. It's always useful to get comments "from the wild" rather than just internal to the group.

It's long overdue merging, and I think we're pretty much there.

The format of data was correct, but the values summed to a probability higher than 1. As reported in samtools/hts-specs#584

jmarshall · 2021-08-10T21:40:10Z

PR #1320 adds some minor source file fixes that were not noticed before this was merged.

jmarshall reviewed Aug 28, 2020

View reviewed changes

jkbonfield mentioned this pull request Sep 17, 2020

Describe MM and MP methylation tags. samtools/hts-specs#418

Merged

jkbonfield mentioned this pull request Jan 22, 2021

bcftools consensus: mark deletions with character samtools/bcftools#1381

Closed

a-slide mentioned this pull request Apr 1, 2021

Support for extra DNA/RNA modified bases. igvteam/igv#945

Closed

jkbonfield added 4 commits July 14, 2021 16:36

Fix copyright notice

4763c69

Added bam_plp_insertion_mod API.

471488a

This is a generalised version of bam_plp_insertion (which now calls this with a NULL argument) that can also include markup for base modifications. Used by samtools mpileup.

jkbonfield force-pushed the methylation branch from 7df03ce to 471488a Compare July 14, 2021 15:39

jkbonfield force-pushed the methylation branch from 424e4b0 to 770eb39 Compare July 15, 2021 15:12

daviesrob assigned whitwham Jul 20, 2021

whitwham reviewed Jul 28, 2021

View reviewed changes

test/base_mods/MM-chebi.sam Outdated Show resolved Hide resolved

whitwham reviewed Jul 29, 2021

View reviewed changes

htslib/sam.h Show resolved Hide resolved

jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jul 29, 2021

Remove extra Ml quality value.

3d47ec4

Reported by Andrew Whitwham at samtools/htslib#1132 (comment)

jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jul 29, 2021

Remove extra Ml quality value.

e5244c9

There is one more element in Ml than Mm so the final 212 is not used. Reported by Andrew Whitwham at samtools/htslib#1132 (comment)

jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jul 29, 2021

Remove extra Ml quality value.

8cc4d4a

There is one more element in Ml than Mm so the final 169 is not used. Reported by Andrew Whitwham at samtools/htslib#1132 (comment)

jkbonfield mentioned this pull request Jul 29, 2021

Remove extra Ml quality value. samtools/hts-specs#584

Merged

jkbonfield force-pushed the methylation branch from 435e00c to 56b4930 Compare August 4, 2021 15:43

Small fix to MM-chebi.sam methylation test data.

8d39bef

The format of data was correct, but the values summed to a probability higher than 1. As reported in samtools/hts-specs#584

whitwham merged commit 7ba9ecd into samtools:develop Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added base modification interfaces. #1132

Added base modification interfaces. #1132

jkbonfield commented Aug 27, 2020

jmarshall Aug 28, 2020

jkbonfield Aug 28, 2020

whitwham commented Jul 28, 2021

cjw85 commented Aug 4, 2021

jkbonfield commented Aug 5, 2021

jmarshall commented Aug 10, 2021

Added base modification interfaces. #1132

Added base modification interfaces. #1132

Conversation

jkbonfield commented Aug 27, 2020

jmarshall Aug 28, 2020

Choose a reason for hiding this comment

jkbonfield Aug 28, 2020

Choose a reason for hiding this comment

whitwham commented Jul 28, 2021

cjw85 commented Aug 4, 2021

jkbonfield commented Aug 5, 2021

jmarshall commented Aug 10, 2021