Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: specification versioning #194

Closed
jkbonfield opened this issue Mar 9, 2017 · 16 comments
Closed

Discussion: specification versioning #194

jkbonfield opened this issue Mar 9, 2017 · 16 comments
Assignees

Comments

@jkbonfield
Copy link
Contributor

Following on from discussion here: #40 (comment). This is SAM specific discussion, but the general thought processes apply to all specs. (Although SAM appears unique here in that it has multiple versions in one document instead of multiple documents?)

The SAM specification no longer contains an explicit string explaining the version number. We have examples that contain HD lines, but there is more than one example in the specification and historically they haven't always agreed with each other on the version number. Further more, a SAM file with @HD VN:1.4 is still a valid file within the 1.5 specification so it shouldn't necessarily be interpreted that such an example is implying the entire rest of the document is for 1.4 only.

Commit 321f786 removed the version number string from the specification. I propose adding this back. It used to also have -r49 etc attached. That's hard to keep up to date manually and also tricky for git describe as we have lots of commits in other specifications (and no tags). However maybe it's still beneficial. I find a commit hash of limited use as by itself it doesn't inform readers which of two documents are newest.

I would also like a standard procedure for bumping major or minor version number including commit messages. Commit 80bc68f changed SAM spec from 1.4 to 1.5, but there is no mention of this in the commit message. It took some digging and binary searching with git blame to figure out when this version changed. It would be wrong to go back and correct these historically, but for future we could require something like "Updated version to X.Y" somewhere in the commit message.

Therefore I'd also think about adding tags. These aren't ideal for hts-specs as we have many formats, but it's possible to add say SAM:v1.4, SAM:v1.5, CRAM:v2.1, CRAM:v3.0 etc. Yes a CRAM tag will be meaningless for the SAM spec and vice versa, but it does at least mean we can checkout the SAM v1.4 spec without lots of git blaming.

Comments?

@jkbonfield
Copy link
Contributor Author

Following on to this, if we want to automate the versioning then we could add tags such as SAM:v1.4, CRAM:v3.0 etc as above and then use git describe -match "CRAM:*" to get a string like CRAM:v3.0-7-g1234abc for the CRAM spec or SAM:v1.3-17-g789abcd for the SAM spec.

It's not perfect as the number of edits since the last tag doesn't apply to just that document, but its monotonically increasing and clearly demonstrates one revision as newer than another; something we cannot do with a comittish alone.

@jmarshall
Copy link
Member

jmarshall commented Apr 6, 2017

Commit 321f786 removed the version number string from the specification. I propose adding this back. It used to also have -r49 etc attached. [...] I find a commit hash of limited use as by itself it doesn't inform readers which of two documents are newest.

  1. Having a revision number like -r49 in the source text that we have to update manually is IMHO ridiculous and going back to that would be a step backwards.

  2. The commit hash is not by itself. The document title pages say “This printing is version 494628a from that repository, last modified on [28 Apr 2016]” — it's the date stamp that tells readers which of two documents is newer, while the commit hash tells readers exactly which commit history the document contains.

  3. That said, using git describe --match might be a nice addition to the existing infrastructure.

The SAM specification no longer contains an explicit string explaining the version number. We have examples that contain HD lines, but there is more than one example in the specification and historically they haven't always agreed with each other on the version number. Further more, a SAM file with @hd VN:1.4 is still a valid file within the 1.5 specification so it shouldn't necessarily be interpreted that such an example is implying the entire rest of the document is for 1.4 only.
[...]
I would also like a standard procedure for bumping major or minor version number including commit messages. Commit 80bc68f changed SAM spec from 1.4 to 1.5, but there is no mention of this in the commit message. It took some digging and binary searching with git blame to figure out when this version changed. It would be wrong to go back and correct these historically, but for future we could require something like "Updated version to X.Y" somewhere in the commit message.

What the SAM spec needs is an appendix describing the functionality introduced by each @HD version number change. This is #47, for which I have a draft.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Apr 6, 2017

Summarising the meeting.

  • It has a date so given two printouts we can still tell which is the newer of the two.
  • It lacks any easily noticed version still (1.5 currently), and no git tags either. (Tags a problematic anyway given the mix of documents, but theoretically doable if a bit messy.)
  • There was agreement that a summarised list of changes per specification would be useful, although not entire agreement on where.

My proposal is that the latter two points are solved at the same time. A manually curated summary of significant changes (ie not punctuation or minor text clarifications - they're already visible in git logs) at the start of the document would also act to define the current version number. Eg for SAM it shows just how much has changed between 1.4 and 1.5, without all the commits discussing TeX formatting tweaks or tidying up of wording and footnotes.

1.5 (current)

  • Define @sq AH header tag. (Mar 2017)
  • Auxiliary tags migrated to SAMtags document. (Sep 2016)
  • Z and H auxiliary tags are permitted to be zero length. (Jun 2016)
  • QNAME limited to 254 bytes (was 255). (Aug 2015)
  • Generalise 0x200 flag bit as filtered-out bit. (Aug 2015)
  • Add @hd GO, "ONT" to @rg PL and @rg PM header tags. (Mar 2015)
  • Add meaning to reverse FLAG on unmapped reads. (Mar 2015)
  • Document the "idxstats" .bai elements. (Nov 2014)
  • Addition of CSI index. (Sep 2014)
  • Addition of MC auxiliary tag. (Dec 2013)
  • Add @pg DS header field. (Dec 2013)
  • Document the BAM EOF byte values. (Dec 2013)
  • Glossary of alignment types. (May 2013)
  • Added supplementary FLAG (May 2013)

1.4 (to March 2012).

  • Added padded SAM and CT/PT tags. (Sep 2011)
  • Added barcode BC tag. (Sep 2011)
  • Changed FZ tag from type H to B,S. (Aug 2011)
  • (etc!)

1.3 (to April 2011)

It's quite clear from all of these that the actual spec version number is only loosely descriptive of what you can expect to work for the minute details, but major things like supplementary reads are the cause of spec version bumping. The above is also culled from the first 20k of git log for SAMv1.tex, so a summary most definitely is useful to have and we don't simply want a ChangeLog file with the appropriate bits of git commit messages automatically pasted in.

@jkbonfield
Copy link
Contributor Author

Bump on this. We have choices to discuss at the conference call.

  1. As the example above.
  2. The list in an appendix with a sentence at the very start, e.g. "This is specification version 1.5. For a list of changes please see Appendix A."
  3. Johns draft for SAM spec versioning #47 (please commit it to a branch somewhere).

@cyenyxe
Copy link
Member

cyenyxe commented May 10, 2017

A manually curated summary of significant changes (ie not punctuation or minor text clarifications - they're already visible in git logs) at the start of the document would also act to define the current version number

I would be fine with that granularity level, but would prefer it was in the appendix instead of at the beginning of the document. If that model was accepted, we must ensure that substantial changes are added to the appendix before merging.

If the PDF was also updated before merging, wouldn't that help to make the commit hash unnecessary?

@jmarshall
Copy link
Member

jmarshall commented May 11, 2017

IMHO what the SAM spec needs re format version is an extra paragraph in §1 to the effect of

The current version of the SAM format is 1.5. Each SAM or BAM file optionally declares the format version it adheres to via an @HD-VN header tag, which for the current version is written as VN:1.5. See [appendix] for a list of the historical format versions and the significant differences between them.

(We can argue about whether @HD should be optional separately and later — this reflects existing practice.)

This along with an appendix describing the extant historical versions and their significant differences. (I am in agreement with @cyenyxe that this information is less important than the description of the format so should be in an appendix rather than at the start of the document.)

I am still hunting out my branch with a draft of that paragraph and appendix; below are my notes from constructing that text. There are fewer extant version numbers than you might guess; the appendix is in reverse chronological order, viz

1.5 (23 May 2013)

Added SUPPLEMENTARY flag bit; added SA:Z tag; PNEXT/RNEXT points to next read, not segment

1.4 (12 April 2011)

Added "B" arrays.
Clarify chaining of PG records; [etc]

1.3 (July 2010)

Added CIGAR N means intron; [etc].

1.0 (2009)

Initial edition.


2009 1.0 as in 0.1.2-draft pages doc

July 2010 VERSION VN:1.3; First commit of v1.3 SAMv1.tex [b351164]
23 July 2010 Dropped flag letters [8c15525]

Nov 2010 Adding BAM description
Added BQ
Added index sections
Added @rg-pg

Dec 2010 CIGAR N means intron

2 March 2011 Re-add CC:Z, CP:i

2 April 2011 Add @RG-FO,KS, FZ:, apply IUPAC in MD,SEQ
[598d014; big change; accidentally readded flag letters]
4 April 2011 Clarify chaining of PG records
12 April 2011 Added "B" arrays
12 April 2011 VERSION VN:1.4 [729b3f0]
13 April 2011 Crazy possibility of all-QNAMEs-the-same; nuked flag letters
18 April 2011 Describe QNAME "*" instead of that crazy possibility

29 Aug 2011 Change FZ:H to FZ:B:S

7 Sept 2011 Add BC:Z; s/fragment/segment/g

Sep/Oct 2011 Added the padding guide

27 Sept 2011 Max chr length from 2^29 to 2^31; (clarify MD5)
27 Sept 2011 Added CO:Z, PT:Z, RT:Z
12 Oct 2011 Peter Cock clarifies PT/RT
21 Oct 2011 More from Peter Cock

16 Mar 2012 Peter Cock adds "Annotation dummy reads" recommendations

23 May 2013 VERSION VN:1.5; added SUPPLEMENTARY flag bit [80bc68f]
Added SA:Z tag; PNEXT/RNEXT points to next read, not segment
29 May 2013 Incorporated Alec Wysoker's glossary rewrite

26 July 2013 JM removed v1.5-r49 from the title, added git describe instead

18 Oct 2013 Update PNEXT/TLEN from 2^29 to 2^31 (fixed Sept 2011 oops)

5 Dec 2013 Added @PG-DS

19 Dec 2013 Describe BGZF trailer blocks

28 Feb 2014 Added MC:Z mate cigar

12 Sep 2014 Add links to CSI etc

21 Nov 2014 Merge BAI stats pseudobins and n_no_coor (from April :-()

2 March 2015 Reverse flag is meaningful for unmapped reads

4 March 2015 Merge SAM 'i' range limit removal (from November :-()
4 March 2015 Add @HD-GO, @RG-PL:ONT, @RG-PM

@jmarshall
Copy link
Member

@cyenyxe wrote:

If the PDF was also updated before merging, wouldn't that help to make the commit hash unnecessary?

I'm not sure what you mean. The point of the commit hash on the title page is so that someone who just has a printout of the PDF can go to a git repository and find out exactly which commits have contributed to the document that they're looking at.

Are you saying that if the PDF is updated on the branch, you don't need to update it again post-merge as the embedded on-the-PR-branch commit hash will be close enough? Unfortunately this fails when other changes to that spec have been made on or merged to master in the meantime.

Because of that, to get a meaningful embedded commit hash, PDF updates can only take place as non-merge commits on the mainline. (This means that updating the PDF during a PR, like in some of the commits on #200, is mostly pointless and only bloats the repo.)

@jkbonfield jkbonfield self-assigned this Jun 1, 2017
@jkbonfield
Copy link
Contributor Author

There's some interesting discrepancy between when changes are made and which version they occur in.

Clearly the @PG DS field is in 1.5; 1.5 was released May 2013 and the DS field was added Dec 2013). However we see a flurry of commits during April 2011 with the 1.4 release mid-way. Almost certainly the changes that occurred just before 1.4 are to be considered as part of 1.4 itself as I doubt we pushed out a 1.3 PDF with those changes in it. I think we need a bit of (perhaps arbitrary) judgement when deciding where the boundaries are!

I'm working on it anyway.

@jmarshall
Copy link
Member

Most of the tag changes are immaterial and could be noted in passing, if at all. For example, @PG-DS may have arrived when 1.5 was the current version, but there's no compatibility issue or other problem introduced if a file that says VN:1.0 also has a @PG-DS field.

The significant changes that need to be listed in this Format Version History section are the ones that do affect compatibility. At a first stab, this is limited to:

1.5 introduced the SUPPLEMENTARY flag. If your file contains supplementary reads, it must identify itself as 1.5 or later; files with version numbers <= 1.4 can be assumed not to contain supplementary reads.

1.4 introduced the B array tag type. If your file contains B-arrays, it must identify itself as 1.4 or later; files with version numbers <= 1.3 can be assumed not to contain B-arrays.

etc. What else is there?

@jkbonfield
Copy link
Contributor Author

The key changes are SUPPLEMENTARY flag (1.5), B array (1.4), IUPAC sequence (1.4), maybe QNAME "*" (1.4) and FLAG no longer supporting letters (1.3). My change adds these in bold to indicate they are the primary causes of the version number being bumped.

However all the other things are still important and need to go somewhere. I'm not after a full and gory "git log", but a summary saying when we introduced N as a cigar OP for example (half way through a version release) or added the ability to store padded sequence assemblies in SAM/BAM (early 1.4) is still useful information.

Some of it is fluff though, so we can decide if it needs culling.

@jkbonfield
Copy link
Contributor Author

In addition to the above, I state in my change "The key changes that caused the version number to change are shown in bold". However maybe it should be phrased as "Changes that are likely to cause compatibility issues are shown in bold". This should include the CIGAR "N" operator as old code that didn't support it would almost certainly choke or give the wrong alignments, yet it appeared half way through a version without a bump in version number so we apparently didn't consider it vital enough to update the version.

More debatable, FZ:H to FZ:B may break some tools, but most programs don't decode this tag anyway so it would be very isolated.

@jmarshall
Copy link
Member

[Since this issue was filed, the SAM specification has gained an appendix listing historical significant changes and a clear statement in the introduction noting the version that the document describes. There remains the question of whether adding tags in this repository would be useful in general (not just for SAM) — see also related discussion in #316.]

To my mind, there isn't much to be gained from tagging commits in this repository.

As noted in the original issue comment, the repository contains a disparate collection of unrelated documents, so a tag would not pertain to the repository as a whole. A convention involving git describe --match allows tags to be considered to relate to subsets of the repository, but for specifications where there are separate e.g. VCFv4.1/VCFv4.2/VCFv4.3 documents a version tag provides little information anyway.

What information would a version tag provide for the CRAM or VCF specifications? These have separate documents for each extant major/minor format version, so presumably the best document to look at for e.g. VCF 4.2 is the master HEAD edition of VCFv4.2.tex/.pdf. So a purely version tag like VCF:v4.2 would always just point at master's HEAD.

Conversely, the SAM/SAMtags, htsget, and refget specifications have a single document apiece, reflecting the most up-to-date version of the format or protocol. These documents contain appendices listing the significant differences between historical format/protocol versions.

The current version of SAM is 1.6. On the face of it, it would be useful to have a tag pointing at e.g. SAM version 1.4, but this would have different meanings for different audiences. General readers would want to be pointed at the best document for SAM 1.4, which would be the last commit before it was bumped to VN:1.5. Maintainers would want to record the initial revision that can be considered to be SAM 1.4, which would be approximately the commit in which it was bumped to VN:1.4.

The situation is similar for htsget and refget, though they are so new that they do not have multiple extant versions or much history as yet.

IMHO most needs are best served by the history appendix in the master HEAD edition of the various documents. So version tags are unnecessary. Anyone wanting to see details of the exact historical e.g. SAM v1.4 specification will be spelunking around in the Git repository history anyway.

@jkbonfield
Copy link
Contributor Author

The versioning is text based. It's a big step forward, but a change log is not the same thing as a version control system.

If I want to actually see a spec at a specific version number then currently it is a complex task. Git could make this a cinch with "git checkout SAM_1.5; make" and voila.

It's not a biggy, but IMO you're conflating two issues together; documentation and tagging.

@jmarshall
Copy link
Member

I have two main points:

  1. “I want to actually see a spec at a specific version number” is not well-defined. I reckon different audiences mean different things by this.

  2. Nobody has actually demonstrated a pressing demand for needing to do this.

In your initial comment starting this issue, you said:

I would also like a standard procedure for bumping major or minor version number including commit messages.

For example, from b1ae9f9: Bump SAM version to VN:1.6 due to >65535 CIGAR strings (PR#307). IMHO adhering to such a convention for these commit messages is all we really need.

@jkbonfield
Copy link
Contributor Author

Agreed, point 2 fixes things going forwards. It doesn't for historical commits which have been bad at this (it was a surprising challenge to write that history section), but you're right there's not a huge need so I'm happy.

@jmarshall
Copy link
Member

Discussed again at September 20th meeting. All comments on issue and in meeting equate to “happy with the status quo”. No further points raised subsequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants