Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 1566 - basemod API and hardclips #1567

Closed
wants to merge 3 commits into from

Conversation

jts
Copy link

@jts jts commented Feb 17, 2023

This PR checks whether an aligned read has hard clipped bases, and if so returns -1 from bam_parse_basemod. Fixes #1566

@jkbonfield
Copy link
Contributor

The change seems reasonable as an interim measure. Would you mind if I did some squashing/rebasing and pushed back to your branch? (I can do it elsewhere if not, no worries, but your PR will show as closed instead of merged when we get around to it.)

There is the potential for an improvement, but I'm unsure if it really matters. Specifically if vendors used N+m instead of C+m and counted every base instead of every C'th base then the calculation could be made to work (it doesn't right now in that scenario either) by adding in the hard-clip count. However using N instead isn't recommended if lots of base mods are recorded as it increases the file size, but it may be the only option.

I'm unsure of exactly how we cure this long term. The fundamental problem of storing sequence base specific information in a tag is keeping it up to date, given tags are optional and lots of tools manipulate BAM files without knowing about all the latest tag types. This is where nearly all the complexity comes from in handling base mods. (We also couldn't put it in the SEQ field as BAM wouldn't accept that.)

We could say that tools that do the hard-clipping also do the trimming of the MM/ML tags, but how would you be able to distinguish between tools which have been updated to cope with this tag, and those which haven't? You simply can't. So this seems like a very unwise route to go down. Unfortunately it also means that even if we wanted to be good and get tools that hard-clip to do the right thing, there isn't really a right thing available.

So it is indeed feeling like the only solution which is viable is not to count every C'th (or Gth, Ath, etc) base and just case every base - ie the N+m syntax instead. Sad that these pitfalls weren't considered earlier. That means after hard clipping, it's still possible to work out which bases are being referred to rather than the total rejection we have here. It's quite a bit more work to handle this case though, which is why I say this is a good interim solution.

Thoughts?

@jts
Copy link
Author

jts commented Feb 21, 2023

Thanks @jkbonfield, fine to squash and rebase in any way that's convenient for you. I'll come back with some thoughts for the points you raised a little later.

@jts
Copy link
Author

jts commented Feb 24, 2023

I'm back now with some comments. I do think hardclips should be handled somehow but this isn't an urgent problem since users that care about modifications at supplementary alignments can tell minimap2 to use softclips instead of hardclips with the -Y option.

We could say that tools that do the hard-clipping also do the trimming of the MM/ML tags, but how would you be able to distinguish between tools which have been updated to cope with this tag, and those which haven't? You simply can't.

I agree we can't ask the SAM producers to update the MM tags but is it worth considering a tool that fixes the MM tag for hardclipped reads? There's precedent with samtools fixmate and samtools calmd. We'd need another flag though to indicate whether MM accounts for hardclips (I guess something like a mode bit to say whether the offsets refer to SEQ or whatever the basecalled read was). At this point it might be best to switch to a bitwise flag instead of the symbolic flags (this has the benefit of reserving bits for future cases that come up).

I'm not sure how I feel about switching everything to N+m notation. I think here explicit mode would be required since nearly all of the skipped bases were not called.

@cjw85 thoughts?

@cjw85
Copy link

cjw85 commented Feb 24, 2023

I like the idea of a samtools command that fixes up "broken" MM tags. It does immediately raise the need to then encode some extra state as you say. What a nightmare!

This would then I think necessitate a need to beef up the API in order to serve clients with the correct interpretation of all the information.

I'm in two minds about N+m. On the one hand it just feels wrong. On the other before the MM tags were created I was just storing modification information in dense array-valued tags.

@jkbonfield
Copy link
Contributor

Indeed. One reason I haven't started working on the API again is I'm not 100% sure I know the final requirements yet! All help and suggestions welcomed here.

It's hard to know what fix can be applied to a read with hard clipping and MM, unless we extend the MM syntax and the fixmates equiv command turn adds the additional flags as appropriate. So that needs both spec update and API updates. Or it uses N+m style syntax and we just define what that means with hardclips.

@jkbonfield
Copy link
Contributor

There is ongoing discussion about this on twitter:

https://twitter.com/Psy_Fer_/status/1632943849386692608
https://twitter.com/DrKangabu/status/1632934256019402752

This appears to show that some tools are doing hard-clipping and creating MM tags in the same pipeline (I assume just the two steps in that order). So this PR would make that correct looking data fail.

We're kind of stuck, but I wonder whether the right thing is just to sress that the MM tags count along the sequence as recorded in the SEQ record (albeit in original orientation) - this is already stated in the spec - and for the implementations to have sanity checks to ensure that out of bounds data is reported.

So rather than just rejecting the data, for reads with hard-clipping we could check if Nth C base is beyond the end of the recorded seq. It's not ideal from a performance viewpoint, and we may erroneously accept some duff data, but given a sufficient number of sequences it seems highly unlikely that they'd all accidentally pass the checks.

Long term we need to figure out a way of recording the difference between MM-corrected and MM-uncorrected (and now broken). I can't think how to do that unless it's a third party piece of data. Eg embedding the sequence length as an error check. If we modify SEQ len in any way and don't make a corresponding edit to MM, then MM is now invalid. It's disappointing as it'd be yet another format change, but I cannot see any other robust way to do it unless it's an additional tag. Eg MZ:i:<seq_len> for Modification-siZe check.

Thoughts?

@jkbonfield
Copy link
Contributor

I'm going to close this as we know it will affect some pipelines which already do the right thing. So stuck between a rock and a hard place of detecting erroneous data vs breaking corrected data.

I'll work on an MZ:i alternative tag instead to act as a validation.

@jkbonfield jkbonfield closed this Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

basemod API and hardclipped alignments
3 participants