Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam: Consider more conventional format versioning #725

Open
zaeleus opened this issue May 9, 2023 · 6 comments
Open

sam: Consider more conventional format versioning #725

zaeleus opened this issue May 9, 2023 · 6 comments
Labels

Comments

@zaeleus
Copy link

zaeleus commented May 9, 2023

This is in regard to Sequence Alignment/Map Format Specification (2022-08-02) § "Appendix B: SAM Version History".

Specification changes are currently made to the latest version. This is both confusing and unconventional, as versioned formats should be considered immutable.

I propose 1.6 be finalized and future changes made to a draft of the next version of SAM.

@jmarshall
Copy link
Member

The format itself is versioned in a conventional manner: generally when syntax is introduced that previous implementations would fail to parse, the version number is bumped to indicate that. This is indicated by the bold items in Appendix B. For example, when B arrays were added as a new form of optional field, the @HD VN version number was bumped to 1.4 accordingly.

The SAM format is designed to be an extensible format. Thus it defines syntax for various kinds of optional SAM record fields, and invites tools and research groups to define meanings for particularly named tagged fields. Similarly it defines (simpler untyped) syntax for SAM header fields, and invites users to define meanings for particularly named fields. There are also particular tagged fields and header fields that are predefined by the SAM format authors and maintainers because they (and the shared vocabulary they define) are of general use; for convenience, these are listed in the same specification documents.

There is no issue with a particular tagged or header field defined later appearing in a SAM/BAM/etc file that declares an earlier @HD VN version (provided it is not a B array in a VN 1.3 file or a similar scenario, of course), as it is really just another field. For example, @SQ AH and MI:Z were defined in 2017 while the SAM version number was 1.5, but it is perfectly appropriate for them to appear in a file that declares VN:1.3 — it would not benefit users if tools rejected such a scenario.

Similar comments apply to other semantic information such as the list of (valid) well-known values for such fields, as mentioned in #718 (comment).

This means that the specification document contains a mixture of categories of information: there is the format proper that determines syntax and what implementations need to parse, and semantic definitions of particular tags perhaps accompanied by lists of valid values and their definitions. (The definitions of optional record fields used to be a section within SAMv1.pdf too BTW. SAMtags.pdf was moved to a separate document for two main reasons: so that it could be more reasonably referred to from the CRAM specification; to somewhat decouple it and its maintenance from the SAM document and its versioning. These header field definitions could be moved to SAMtags.pdf too for similar reasons, but IMHO that would be overkill — I have a backburner PR to reformat the multipage header table as not-a-table, which would mean we had more space to expand on what “valid values” really means and clarify this.)

versioned formats should be considered immutable

What I think you are suggesting implies that absolutely everything defined in the specification documents would no longer be considered extensible, in the sense that fields and valid values defined in a particular edition of the specification would not be allowed in SAM files that predated that edition (or that declared an @HD VN value prior to that edition). Also the specification would need to define much more fine-grained version numbers for use in VN.

The question is who this would be an improvement for. There are at least three categories of people with an interest in these specifications:

  • Users of SAM/BAM/etc files and of tools that deal with these files
  • Implementors of tools that deal with these formats
  • The specification authors

It is true that tools that validate things carefully may wish to say things like “I support PL:ULTIMA because I support edition XYZ of the SAM specification”. But SAM philosophy has always been that such statements are not particularly useful for the users working with SAM files, and that where it means that otherwise compatible new fields can't be used in old files it's actively counterproductive for users.

TL;DR There is a difference between versioning of the format and versioning of the specification document. This has been discussed ad nauseam before, for example on #194 and #454.

In practice, specification documents are not as immutable as you might think. Typo fixes and formatting improvements do not change the format described by a specification. Clarification of unclear text should not and generally does not change the format described by a specification. Thus not all changes in a specification document should be reflected in the version number declared by a file in that format. (These are examples of when specification documents are not usefully considered immutable.)

Hence the SAM specification separates the SAM version number (like the current 1.6) from the specification's own versioning. The specification document itself is lightly versioned: its title page mentions the Git commit hash it was built from and the date of that commit. This light versioning has IMHO proved sufficient for readers' needs.

That is the status quo.

If the status quo were to be revisited, those advocating for a change ought to familiarise themselves with the previous discussions and consider the following:

  • Do you accept the idea of SAM as an extensible format and the implications of that described above?
  • Do you accept the distinction between syntax and semantics described here and on e.g. sam: Add PL value "MGIG400" as seen in the wild #718? (Perhaps you may have suggestions on making the distinction in the spec clearer.)
  • What, specifically, would be your proposal, and what actual benefits would it have for users, for implementors, and for the specification editors?

@zaeleus
Copy link
Author

zaeleus commented May 10, 2023

There is no issue with a particular tagged or header field defined later appearing in a SAM/BAM/etc file that declares an earlier @HD VN version... For example, @SQ AH and MI:Z were defined in 2017...

Yes, but this is not extensibility but backwards compatibility. Both examples were already accepted before they were formally defined.

But SAM philosophy has always been that such statements are not particularly useful for the users working with SAM files, and that where it means that otherwise compatible new fields can't be used in old files it's actively counterproductive for users.

Validations are quite useful when expecting conformance. If a new field is incompatible with an old format, then that compatibility wasn't there in the first place. I suspect this knowledge actually improves productivity for users.

This has been discussed ad nauseam before, for example on #194 and #454.

I think these issues are unrelated to this discussion? My proposal is about format versioning, not the standard operating procedure of document releases.

Do you accept the idea of SAM as an extensible format and the implications of that described above?

A format is as extensible as it is allowed to be. For example, @RG PL is not extensible. The description gives a list of valid values, not a list of maybe values. I have given my suggestion to allow for extensibility previously in #718 (comment).

Do you accept the distinction between syntax and semantics described here...

No, I consider this all part of the grammar. You're arguing that the lexical analysis overrules all of it.

Your first example is a counterexample to your argument:

For example, when B arrays were added as a new form of optional field, the @HD VN version number was bumped to 1.4 accordingly.

TAG:TYPE:VALUE is, as you call, the syntax. TYPE is "a single case-sensitive letter". Therefore, TAG:B:VALUE was already accepted. Given that TYPE is a token and not an invariant, VALUE can be arbitrary. Why was this able to influence the version bump?

What, specifically, would be your proposal, and what actual benefits would it have for users, for implementors, and for the specification editors?

My proposal still stands from the opening message: "I propose 1.6 be finalized and future changes made to a draft of the next version of SAM."

Ambiguity and undefined behavior in a specification is detrimental to interoperability. Barring naive usages and implementations, a lack of guidance makes this bad for both users and implementors. Changing the behavior of the format within the same version is not conventional and confusing for all parties (except the spec maintainers?).

@jmarshall
Copy link
Member

jmarshall commented May 10, 2023

Do you accept the distinction between syntax and semantics described here[…]

No, I consider this all part of the grammar.

Well, that is a view that is not held by others. It is IMHO an implementor-centric view, and not one that is especially useful for users of SAM or for most implementors of bioinformatics tools.

If it is not clear that things like the lists of valid well-known values for PL are semantics information and of a different category from the grammar proper, then that is something to be clarified in the spec. I outlined how I think we should do this in parentheses in my previous comment.

Certainly there is a tension here for implementations that want to validate PL values: it is useful that there exist implementations that produce diagnostics for unknown values, because that is how we actually get consistency in the values written by users; but there is a deployment issue for such implementations when new platforms arrive and are added to the spec. The spec just lists the canonical values (imperfectly) and leaves decisions about whether and how strictly to validate this field value up to the implementations. Users would not be helped by having to set @HD VN to some inflated value if they used PL:ELEMENT, and having their SAM files rejected by (some) tools if they did not.

Your first example is a counterexample to your argument:

For example, when B arrays were added as a new form of optional field, the @HD VN version number was bumped to 1.4 accordingly.

TAG:TYPE:VALUE is, as you call, the syntax. TYPE is "a single case-sensitive letter". Therefore, TAG:B:VALUE was already accepted. Given that TYPE is a token and not an invariant, VALUE can be arbitrary. Why was this able to influence the version bump?

With our spec maintainers hats on, we need to think about what SAM as a Platonic ideal is and means, and what the spec text means to say and ought to say, not just what the imperfect text as it currently stands actually says.

Here TYPE is “a single case-sensitive letter from the following table”, whether the text says that explicitly or not. The defined TYPE letters are a different category of thing to the PL values, because they define how to interpret VALUE and what syntax is expected when parsing VALUE,1 and they define how the field is represented in BAM. Certainly an implementation would have no hope of parsing a BAM record that contained a B array prior to the addition of B to the specification or prior to the addition of specific support for B arrays as defined in the spec to the implementation. Hence the defined TYPE letters and the implications for representation of VALUE in SAM and BAM are part of the syntax, in a way that the list of PL values is not.

In the light of those considerations, it is obvious that TAG:B:VALUE could not already have been accepted. Hence the version bump.

Footnotes

  1. I certainly agree that regexes by themselves are an insufficient way of describing what things specified via regex are supposed to mean.

@jmarshall
Copy link
Member

Changing the behavior of the format within the same version is not conventional and confusing for all parties (except the spec maintainers?).

The spec maintainers would claim that they have not substantially changed the behaviour of the format within the same version.

Your views on PL are clear. You do have a point and the description of this field should be expanded and improved in some approximation of the ways talked about.

Are there other specific problems you have in mind? (Or is this a storm in a PL-sized teacup…?)

@jkbonfield
Copy link
Contributor

jkbonfield commented May 15, 2023

TAG:TYPE:VALUE is, as you call, the syntax. TYPE is "a single case-sensitive letter". Therefore, TAG:B:VALUE was already accepted. Given that TYPE is a token and not an invariant, VALUE can be arbitrary. Why was this able to influence the version bump?

SAM and BAM are two different specs, with different version numbers, but there is some rather poorly defined tie between them. BAM has its own binary version number, but that is more for the higher encapsulation and encoding of fixed fields. There is also the SAM version number within the text header within BAM.

For B, arguably both BAM and SAM could have changed version, but given BAM encapsulation elsewhere doesn't change and the SAM header version can be specified within BAM, I guess we decided to stick with SAM only.

An example with B:

In SAM, it's just a text string with an array of ASCII numbers. Basically it's just a new type as you say. The syntax doesn't change, and we can still skip from SAM tag to SAM tag trivially by looking for the tabs and recognising TAG:TYPE:VALUE. Even if we get a TYPE we don't know, our parser could be written to be liberal and just skip over the tag, or store it in pure text and regurgitate it back out as-is without any knowledge of the new type code.

In BAM this is fundamentally impossible. BAM tags are all squished together without a separator like TAB. We must know what the new TYPE means and how to parse it, otherwise our entire file is unparsable. To that degree, the B tag is a hard stop. It's not just a case of interpretation of data and recognising new key-value names, but of the fundamental data layout.

I agree it's nuanced, and maybe not desireable, to have both BAM and SAM intertwined in this way, but it's where we are and we cannot change history. So do remember this when thinking in pure abstract terms of syntax and semantics - we're not just in SAM world.

@jkbonfield
Copy link
Contributor

I'll also add my views here, which are pretty much a copy of what John has already summarised.

You have to be pragmatic and think about the software life cycle.

Let us imagine a case where we add OMNICORP as a new platform to the PL field. What should our existing infrastructure do? Should it simply fail? Should it warn? Or should it swallow the data unless we're actively doing something that specifically needs to do platform-specific processing and it's now met one it isn't aware of?

Personally I'd say the users will likely accept the latter, and reject the former as being user-hostile. I know from looking at download logs that we get a VAST number of downloads of ancient legacy copies of software. I wish people didn't, but they do. Would we really be serving the community better if our software just flat out broke when given a file with a syntacically valid data stream but having a new field or new value? Obviously not.

So what is the point therefore of updating the version number to reflect this field has been added? I cannot see one.

We separated out the aux tags into their own document primarily because we wanted to use a shared specification that could be referenced from both SAM/BAM and CRAM. However it served a rather useful purpose in largely divorcing the changes that happen there from the changes in the SAM/BAM and CRAM version numbers. When we added base modification tags (MM/ML/MN) we didn't need to change version. The only tag change that altered version was when the actual encapsulation had to change to accommodate new encoding byte streams - B in BAM basically.

Arguably we could be clearer with the SAM header, and this is also shared between formats. However I largely view it in a similar vein as the aux tags. The key:value there (not key:type:value) should be considered as high level syntax within the specification, with the specific keys and values not being deemed to be something that changes the specification versions. If however we do something that we know would seriously break implementations, such as a new @XY header field, then we would likely consider this to be a version bump.

Obviously you could, and maybe should, question how adding @XY is a version change but adding @RG XY:foo may not be, given the overall syntax doesn't really change, but this is where the pragmatism comes in. We know adding new fundamental header types will break the mainstream implementations. To date, minor changes with sub-keys and definitely changes of values hasn't broken them. This isn't ideal, but it's pragmatic and again we need to do what is best for the community here. (That applies to both us as specificartion maintainers, but also us as the authors of implementations.)

@jkbonfield jkbonfield added the sam label May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Stalled
Development

No branches or pull requests

3 participants