Fix issue 1566 - basemod API and hardclips #1567

jts · 2023-02-17T18:54:19Z

This PR checks whether an aligned read has hard clipped bases, and if so returns -1 from bam_parse_basemod. Fixes #1566

jkbonfield · 2023-02-21T12:38:04Z

The change seems reasonable as an interim measure. Would you mind if I did some squashing/rebasing and pushed back to your branch? (I can do it elsewhere if not, no worries, but your PR will show as closed instead of merged when we get around to it.)

There is the potential for an improvement, but I'm unsure if it really matters. Specifically if vendors used N+m instead of C+m and counted every base instead of every C'th base then the calculation could be made to work (it doesn't right now in that scenario either) by adding in the hard-clip count. However using N instead isn't recommended if lots of base mods are recorded as it increases the file size, but it may be the only option.

I'm unsure of exactly how we cure this long term. The fundamental problem of storing sequence base specific information in a tag is keeping it up to date, given tags are optional and lots of tools manipulate BAM files without knowing about all the latest tag types. This is where nearly all the complexity comes from in handling base mods. (We also couldn't put it in the SEQ field as BAM wouldn't accept that.)

We could say that tools that do the hard-clipping also do the trimming of the MM/ML tags, but how would you be able to distinguish between tools which have been updated to cope with this tag, and those which haven't? You simply can't. So this seems like a very unwise route to go down. Unfortunately it also means that even if we wanted to be good and get tools that hard-clip to do the right thing, there isn't really a right thing available.

So it is indeed feeling like the only solution which is viable is not to count every C'th (or Gth, Ath, etc) base and just case every base - ie the N+m syntax instead. Sad that these pitfalls weren't considered earlier. That means after hard clipping, it's still possible to work out which bases are being referred to rather than the total rejection we have here. It's quite a bit more work to handle this case though, which is why I say this is a good interim solution.

Thoughts?

jts · 2023-02-21T14:20:36Z

Thanks @jkbonfield, fine to squash and rebase in any way that's convenient for you. I'll come back with some thoughts for the points you raised a little later.

jts · 2023-02-24T14:45:29Z

I'm back now with some comments. I do think hardclips should be handled somehow but this isn't an urgent problem since users that care about modifications at supplementary alignments can tell minimap2 to use softclips instead of hardclips with the -Y option.

We could say that tools that do the hard-clipping also do the trimming of the MM/ML tags, but how would you be able to distinguish between tools which have been updated to cope with this tag, and those which haven't? You simply can't.

I agree we can't ask the SAM producers to update the MM tags but is it worth considering a tool that fixes the MM tag for hardclipped reads? There's precedent with samtools fixmate and samtools calmd. We'd need another flag though to indicate whether MM accounts for hardclips (I guess something like a mode bit to say whether the offsets refer to SEQ or whatever the basecalled read was). At this point it might be best to switch to a bitwise flag instead of the symbolic flags (this has the benefit of reserving bits for future cases that come up).

I'm not sure how I feel about switching everything to N+m notation. I think here explicit mode would be required since nearly all of the skipped bases were not called.

@cjw85 thoughts?

cjw85 · 2023-02-24T15:01:08Z

I like the idea of a samtools command that fixes up "broken" MM tags. It does immediately raise the need to then encode some extra state as you say. What a nightmare!

This would then I think necessitate a need to beef up the API in order to serve clients with the correct interpretation of all the information.

I'm in two minds about N+m. On the one hand it just feels wrong. On the other before the MM tags were created I was just storing modification information in dense array-valued tags.

jkbonfield · 2023-02-24T16:12:15Z

Indeed. One reason I haven't started working on the API again is I'm not 100% sure I know the final requirements yet! All help and suggestions welcomed here.

It's hard to know what fix can be applied to a read with hard clipping and MM, unless we extend the MM syntax and the fixmates equiv command turn adds the additional flags as appropriate. So that needs both spec update and API updates. Or it uses N+m style syntax and we just define what that means with hardclips.

jkbonfield · 2023-03-07T15:17:14Z

There is ongoing discussion about this on twitter:

https://twitter.com/Psy_Fer_/status/1632943849386692608
https://twitter.com/DrKangabu/status/1632934256019402752

This appears to show that some tools are doing hard-clipping and creating MM tags in the same pipeline (I assume just the two steps in that order). So this PR would make that correct looking data fail.

We're kind of stuck, but I wonder whether the right thing is just to sress that the MM tags count along the sequence as recorded in the SEQ record (albeit in original orientation) - this is already stated in the spec - and for the implementations to have sanity checks to ensure that out of bounds data is reported.

So rather than just rejecting the data, for reads with hard-clipping we could check if Nth C base is beyond the end of the recorded seq. It's not ideal from a performance viewpoint, and we may erroneously accept some duff data, but given a sufficient number of sequences it seems highly unlikely that they'd all accidentally pass the checks.

Long term we need to figure out a way of recording the difference between MM-corrected and MM-uncorrected (and now broken). I can't think how to do that unless it's a third party piece of data. Eg embedding the sequence length as an error check. If we modify SEQ len in any way and don't make a corresponding edit to MM, then MM is now invalid. It's disappointing as it'd be yet another format change, but I cannot see any other robust way to do it unless it's an additional tag. Eg MZ:i:<seq_len> for Modification-siZe check.

Thoughts?

jkbonfield · 2023-03-21T12:13:32Z

I'm going to close this as we know it will affect some pipelines which already do the right thing. So stuck between a rock and a hard place of detecting erroneous data vs breaking corrected data.

I'll work on an MZ:i alternative tag instead to act as a validation.

jts added 3 commits February 17, 2023 13:42

fail bam_parse_basemod for hardclips (samtools#1566)

eac5d68

explicitly check for hard clip instead of query/ref consumption

266223b

fix comment

6b9e816

daviesrob assigned jkbonfield Feb 28, 2023

jkbonfield mentioned this pull request Mar 7, 2023

MM tag and hard clipped alignments samtools/hts-specs#646

Open

jkbonfield closed this Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 1566 - basemod API and hardclips #1567

Fix issue 1566 - basemod API and hardclips #1567

jts commented Feb 17, 2023 •

edited

jkbonfield commented Feb 21, 2023

jts commented Feb 21, 2023

jts commented Feb 24, 2023 •

edited

cjw85 commented Feb 24, 2023

jkbonfield commented Feb 24, 2023

jkbonfield commented Mar 7, 2023

jkbonfield commented Mar 21, 2023

Fix issue 1566 - basemod API and hardclips #1567

Fix issue 1566 - basemod API and hardclips #1567

Conversation

jts commented Feb 17, 2023 • edited

jkbonfield commented Feb 21, 2023

jts commented Feb 21, 2023

jts commented Feb 24, 2023 • edited

cjw85 commented Feb 24, 2023

jkbonfield commented Feb 24, 2023

jkbonfield commented Mar 7, 2023

jkbonfield commented Mar 21, 2023

jts commented Feb 17, 2023 •

edited

jts commented Feb 24, 2023 •

edited