Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIGAR parsing API #1147

Closed
d-cameron opened this issue Sep 29, 2020 · 5 comments
Closed

CIGAR parsing API #1147

d-cameron opened this issue Sep 29, 2020 · 5 comments
Assignees

Comments

@d-cameron
Copy link

CIGAR text parsing is currently hidden inside sam_parse1() and not exposed by the API. CIGAR parsing is a generally useful function that should be exposed - especially since it's required to handle the spec-defined MC and SA tags.

This issue is a feature request for the htslib API to be expanded to include a function to parse a string into the htslib CIGAR representation format.

@valeriuo
Copy link
Contributor

valeriuo commented Sep 29, 2020

This issue is a feature request for the htslib API to be expanded to include a function to parse a string into the htslib CIGAR representation format.

Which format are you referring to? The CIGAR string is converted to a uint32_t array (with 32 bits for every CIGAR op: length<<4|operation), which is attached to a bam1_t record. Now, to carry a dummy bam1_t record around just to hold a CIGAR string is not very convenient, so, most likely, you would end up with something like:

uint32_t *bam_cigar2array(char *cigar, size_t n_cigar)

Is this what you are looking for?

@jmarshall
Copy link
Member

A good name for this would perhaps be sam_parse_cigar() — to parallel sam_parse1() and {fai,hts,sam}_parse_region() etc.

@valeriuo is right that one of the reasons this doesn't exist yet is that HTSlib doesn't really have a standalone CIGAR representation format(!). sam_parse1() effectively decodes it into a uint32_t array embedded in a bam1_t with the bam1_t handling the memory management, and there are bam_cigar2[qr]len() which take a count and uint32_t pointer, but that's about it.

So returning a count and uint32_t array is probably the desired representation, but it would be good if the new function was reasonably flexible re memory management and let you reuse an existing buffer. e.g. it could write to a kstring_t, or have uint32_t **vecp, size_t *maxp parameters (a la kstring_t's s and m) that it would reallocate for you if necessary. (I thought HTSlib had functions like the latter already, but didn't immediately spot one in htslib/*.h.)

@d-cameron
Copy link
Author

Is this what you are looking for?
So returning a count and uint32_t array is probably the desired representation

Yes, this is exactly what I'm after. My particular use case is parsing the MC and SA tags and feeding to bam_cigar2[qr]len() to determine fragment alignment overlaps so if the return value was another format then I'd just be converting to count/uint32_t array anyway.

What I wasn't sure of (hence the issue) was how minimal the parameter set should be@jmarshall makes a good point about memory management and there's also the question of whether a hts_str2int()-style end pointer should be an argument. It would be useful both to a refactored sam_parse1() as well as SA tag parsing but it's a less clean API.

@jkbonfield
Copy link
Contributor

I'd think a sam_format_cigar function taking uint32_t* (in) and kstring_t* (out) arguments would be a useful addition too, and may be called from sam_format1 (although it may need inlining if it's not to harm speed).

@jkbonfield
Copy link
Contributor

Fixed in 9a55e4e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants