AlignmentHeader.from_dict() silently discards the TP attribute from SQ lines #1237

tfenne · 2023-10-18T23:28:39Z

The TP attribute in sequence lines in the SAM header is used to encode whether a reference sequence is linear or circular. Unfortunately AlignmentHeader.from_dict() silently removes this tag from the data:

> header = { "SQ": [{"SN": "some-plasmid", "LN": 5000, "TP": "circular"}] }
> AlignmentHeader.from_dict(header).to_dict()
OrderedDict([('SQ', [{'SN': 'some-plasmid', 'LN': 5000}])])

The text was updated successfully, but these errors were encountered:

See pysam-developers#1237

jmarshall · 2023-10-19T00:39:21Z

I really dislike this “I only believe in fields I know about” aspect of this part of pysam and would rather remove it. Those tables of known attributes are useful to return LN as int and so on, but IMHO unknown-at-the-time attributes should be preserved and default to str instead of being dropped.

Perhaps that's a quick fix too, but in the meantime PR #1238 would be a good stopgap.

tfenne · 2023-10-19T11:05:20Z

I totally agree @jmarshall.

Quoting the spec on the header (emphasis in the second paragraph mine):

1.3 The header section
Each header line begins with the character ‘@’ followed by one of the two-letter header record type codes
defined in this section. In the header, each line is TAB-delimited and, apart from @co lines, each data field
follows a format ‘TAG:VALUE’ where TAG is a two-character string that defines the format and content of
VALUE. Thus header lines match /^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/. Within each (non-@co) header line, no field tag may appear more than once and the order in which the fields
appear is not significant.

The following table describes the header record types that may be used and their predefined tags. Tags
listed with ‘*’ are required; e.g., every @sq header line must have SN and LN fields. As with alignment optional
fields (see Section 1.5), you can freely add new tags for further data fields. Tags containing lowercase letters
are reserved for local use and will not be formally defined in any future version of this specification.

Add omitted tags to KNOWN_HEADER_FIELDS and VALID_HEADER_ORDER, so that AlignmentHeader.from_dict()/to_dict() will preserve all header field tags defined in the SAM specification. Addresses #1237. Exercise TP:linear in some header tests. Co-authored-by: John Marshall <jmarshall@hey.com>

jmarshall · 2023-10-20T10:54:53Z

Yes, I remember writing that text 😄 — though it was longer ago than I expected, in 2015.

It turns out that KNOWN_HEADER_FIELDS as used by to_dict() does default to str. However from_dict() uses VALID_HEADER_ORDER to constrain which non-lowercase fields it will process. So I've merged PR #1238 to improve this in the short term.

However in the medium term, I think the right thing to do for these conversions will be to rewrite this code using the post-1.10 HTSlib header API instead. At that point, I think most of this code that enumerates known tags will naturally disappear.

If you don't mind, I'll leave this issue open to represent that work.

nh13 added a commit to nh13/pysam that referenced this issue Oct 18, 2023

Add missing SQ tags as per the spec

2a58ec4

See pysam-developers#1237

nh13 mentioned this issue Oct 18, 2023

Add missing SQ tags as per the spec #1238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AlignmentHeader.from_dict() silently discards the TP attribute from SQ lines #1237

AlignmentHeader.from_dict() silently discards the TP attribute from SQ lines #1237

tfenne commented Oct 18, 2023

jmarshall commented Oct 19, 2023

tfenne commented Oct 19, 2023

jmarshall commented Oct 20, 2023

AlignmentHeader.from_dict() silently discards the TP attribute from SQ lines #1237

AlignmentHeader.from_dict() silently discards the TP attribute from SQ lines #1237

Comments

tfenne commented Oct 18, 2023

jmarshall commented Oct 19, 2023

tfenne commented Oct 19, 2023

jmarshall commented Oct 20, 2023