-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added base modification interfaces. #1132
Conversation
test/base_mods/base-mods.sh
Outdated
# | ||
# Copyright (C) 2020 Genome Research Ltd. | ||
# | ||
# Author: James Bonfield <rmd@sanger.ac.uk> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rmd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol cut, paste ;)
The general principle is 1. Allocate a base modification state 2. Parse mods 3. Iterate over mods or query positions 4. Destroy state. An example usage is as follows (minus error checking for clarity): hts_base_mod_state *m = hts_base_mod_state_alloc(); sam_read1(fp, hdr, b); bam_parse_basemod(b, m); hts_base_mod mod; while ((n = bam_next_basemod(b, m, &mod, 1, &pos) > 0)) printf("Modified %c to %c at %d\n", mod.canonical_base, mod.modified_base, pos); } hts_base_mod_state_free(m); If being called from an iterator that marches along the query sequence we may wish to have a loop calling bam_mods_at_next_pos instead (pos 0, 1, 2, 3 etc). A simple example of this is in test/test_mod.c It may also be used from within a pileup iterator which marches along query position instead of sequence position. In this case call bam_mods_at_qpos instead with pileup->qpos; which is normally +1 each time but may be more or less with indels. A simple example of this is in test/pileup_mod.c although it doesn't look for modifications within indels. (That would need to mix in bam_mods_at_next_pos calls for each base in the indel.) Note currently bam_mods_at_qpos is a brain-dead loop around bam_mods_at_next_pos which means it may be excessive when huge soft-clips are being used, however this optimisation may happen later. Right now the overhead of pileup with modifications isn't vast. On a ~1Kb long seqs it's around 70% slower than the same code without these calls. On ~50Kb seqs it drops to aroudn 20% slower. Hence the significant cost in the initial bam_parse_basemod call.
The API for this always had both constructor and destructor as type int, but it was never documented what this return value is for and nor was it ever checked. Although the ABI hasn't changed, this is an API change as we have added meaning to an unused element. However (hopefully) the new meaning is in line with expectations and library norms.
This is a generalised version of bam_plp_insertion (which now calls this with a NULL argument) that can also include markup for base modifications. Used by samtools mpileup.
When we have no base modification, the return value and ins->l are both the same thing (number of bases inserted). When we have base modifications, the old length was string length (and ironically ins->l was incorrectly left as the number of bases). That's not particularly obvious, nor useful. It's now reversed, so ins->l is the length of the string ins->s and the return value is the number of bases. Eg "+3AG[+o99]A" vs "+3AGA" both now return 3.
Removed a couple of my comments as I think the problem is elsewhere. Though as far as I can tell, bam_next_basemod can never return -1 for an error. |
Reported by Andrew Whitwham at samtools/htslib#1132 (comment)
There is one more element in Ml than Mm so the final 212 is not used. Reported by Andrew Whitwham at samtools/htslib#1132 (comment)
There is one more element in Ml than Mm so the final 169 is not used. Reported by Andrew Whitwham at samtools/htslib#1132 (comment)
Having an MM tag refer to bases that are beyond the end of the sequence now generates a warning and returns -1 instead of 0 from the function. It's a warning because it's harmless to ignore these, but the caller may wish to check and treat it as a hard error (as the test harness does). Note: this isn't trivial to do for the reverse strand, so (for now) we only check the top strand data. Also fixed a typo in a comment.
@jkbonfield Thanks for this. I've been using the branch in anger recently with a little program to summarize human 5mC data to bedMethyl format. I've not observed any issue. The performance is fine in this use case particularly writing out all contexts; writing the records to file becomes an issue. When filtering to CpG sites I can process 10s of fold coverage of chr11 in less than half a minute --- easily sufficient for practical purposes. |
Thanks. It's always useful to get comments "from the wild" rather than just internal to the group. It's long overdue merging, and I think we're pretty much there. |
The format of data was correct, but the values summed to a probability higher than 1. As reported in samtools/hts-specs#584
PR #1320 adds some minor source file fixes that were not noticed before this was merged. |
First draft of the base modification (methylation etc) API for comment on interface.
I'll work on samtools to add an option to get this into the mpileup text output too, once I've figured out exactly what we want it to look like. (Probably like one of the test/pileup_mod choices.)
The general principle is
An example usage is as follows (minus error checking for clarity):
If being called from an iterator that marches along the query sequence
we may wish to have a loop calling bam_mods_at_next_pos instead (pos
0, 1, 2, 3 etc).
A simple example of this is in test/test_mod.c
It may also be used from within a pileup iterator which marches along
query position instead of sequence position. In this case call
bam_mods_at_qpos instead with pileup->qpos; which is normally +1 each
time but may be more or less with indels.
A simple example of this is in test/pileup_mod.c although it doesn't
look for modifications within indels. (That would need to mix in
bam_mods_at_next_pos calls for each base in the indel.)
Note currently bam_mods_at_qpos is a brain-dead loop around
bam_mods_at_next_pos which means it may be excessive when huge
soft-clips are being used, however this optimisation may happen later.
Right now the overhead of pileup with modifications isn't vast. On a
~1Kb long seqs it's around 70% slower than the same code without these
calls. On ~50Kb seqs it drops to aroudn 20% slower. Hence the
significant cost in the initial bam_parse_basemod call.