Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlignmentHeader.from_dict() silently discards the TP attribute from SQ lines #1237

Open
tfenne opened this issue Oct 18, 2023 · 3 comments
Open

Comments

@tfenne
Copy link

tfenne commented Oct 18, 2023

The TP attribute in sequence lines in the SAM header is used to encode whether a reference sequence is linear or circular. Unfortunately AlignmentHeader.from_dict() silently removes this tag from the data:

> header = { "SQ": [{"SN": "some-plasmid", "LN": 5000, "TP": "circular"}] }
> AlignmentHeader.from_dict(header).to_dict()
OrderedDict([('SQ', [{'SN': 'some-plasmid', 'LN': 5000}])])
@jmarshall
Copy link
Member

I really dislike this “I only believe in fields I know about” aspect of this part of pysam and would rather remove it. Those tables of known attributes are useful to return LN as int and so on, but IMHO unknown-at-the-time attributes should be preserved and default to str instead of being dropped.

Perhaps that's a quick fix too, but in the meantime PR #1238 would be a good stopgap.

@tfenne
Copy link
Author

tfenne commented Oct 19, 2023

I totally agree @jmarshall.

Quoting the spec on the header (emphasis in the second paragraph mine):

1.3 The header section
Each header line begins with the character ‘@’ followed by one of the two-letter header record type codes
defined in this section. In the header, each line is TAB-delimited and, apart from @co lines, each data field
follows a format ‘TAG:VALUE’ where TAG is a two-character string that defines the format and content of
VALUE. Thus header lines match /^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/. Within each (non-@co) header line, no field tag may appear more than once and the order in which the fields
appear is not significant.

The following table describes the header record types that may be used and their predefined tags. Tags
listed with ‘*’ are required; e.g., every @sq header line must have SN and LN fields. As with alignment optional
fields (see Section 1.5), you can freely add new tags for further data fields. Tags containing lowercase letters
are reserved for local use and will not be formally defined in any future version of this specification.

jmarshall added a commit that referenced this issue Oct 20, 2023
Add omitted tags to KNOWN_HEADER_FIELDS and VALID_HEADER_ORDER, so that
AlignmentHeader.from_dict()/to_dict() will preserve all header field tags
defined in the SAM specification. Addresses #1237.

Exercise TP:linear in some header tests.

Co-authored-by: John Marshall <jmarshall@hey.com>
@jmarshall
Copy link
Member

Yes, I remember writing that text 😄 — though it was longer ago than I expected, in 2015.

It turns out that KNOWN_HEADER_FIELDS as used by to_dict() does default to str. However from_dict() uses VALID_HEADER_ORDER to constrain which non-lowercase fields it will process. So I've merged PR #1238 to improve this in the short term.

However in the medium term, I think the right thing to do for these conversions will be to rewrite this code using the post-1.10 HTSlib header API instead. At that point, I think most of this code that enumerates known tags will naturally disappear.

If you don't mind, I'll leave this issue open to represent that work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants