New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAM Read Group (@RG) is SM required? #286
Comments
This comment has been minimized.
This comment has been minimized.
@bioinformed good point about versions. The specs are versioned and SAM/BAM files themselves can be 'optionally' versioned with the Looking closer at the pages file, its version is listed as However, this would be a rather significant specification change that is not mentioned in the version change history appendix, e.g. from version 1.5 (current, https://github.com/samtools/hts-specs/blob/7326e54db05075707fe133d003297681a5313c52/SAMv1.tex ):
which indicates that there is an oversight here -- either a failure to label as being required, or a failure to convey the significant change. |
The transition from a Pages document to LaTeX eight years ago was essentially a complete rewrite by @lh3 (of mostly his own previous document) and the differences were not individually called out in source control — which was barely in use for this spec at the time. However Heng noted on the mailing list at the time that he did indeed intend this particular relaxation:
I can see his point. Certainly in the typical usage, clinical or otherwise, a read group corresponds to a library or a sample and sensible pipelines will use the defined Picard in strict mode requires |
Picard has its own set of definitions about what the Picard tool chain requires, but please do not confuse this with the specification. It has a number of additional restrictions (which may be relaxed via VALIDATION_STRINGENCY) of which this is one. The SAM spec is in a much better, more tracked, situation since it converted from Pages to LaTeX as it is now version controlled. That in turn means that in a clinical setting a very precise version may be specified and adhered to. Failure to do this isn't a problem caused through fluidity of the spec, but is poor experimental design or failure to follow procedures in a clinical environment. @bioinformed: Face facts - things change and technology marches on. This is what the versioning is for! |
@blankenberg I wrote the version history part of the spec, but I did so purely based on git commit history. This means we cannot write anything prior to 1.3 as the spec wasn't in version control then and it'd be too much work to try and figure out all the older changes. |
Thanks again for the history and clarifications. |
* Improve description of header @sq M5 tag. Make the description of how the digest is calculated clearer. Expand on padding characters a bit, including a link to the section about padded SAM. Add examples, including one that uses pad characters. * Updated SAMv1.pdf * changing VCF maintainer in MAINTAINERS.md Louis Bergelson is replacing David Roazen as the Broad's VCF maintainer * Add changed requirement of @rg SM to version history Fixes samtools#286 (see discussion there; Heng Li noted at the time [1] that this change was intentional: "Sometimes, we simply want to group reads without sample information"). [1] https://sourceforge.net/p/samtools/mailman/message/25788014/ * add htsget_interop.html * Update htsget_interop.html * Update htsget_interop.html * Update htsget_interop.html * Update htsget_interop.html * Definition of SB INFO field Number and Type (samtools#290) * Changed gVCF format broken link (samtools#288) * Explicitly state character encoding, locale, and RE syntax. Fixes samtools#197. Document SAM header fields that permit UTF-8. Fixes samtools#140. * Updated SAMv1.pdf * htsget VCF support squash of https://github.com/samtools/hts-specs/pull/233/files * htsget.md minor reformatting * Added FASTA and FASTQ external links (PR samtools#247) * INFO key 1000G is allowed as a special legacy value (samtools#289) INFO key 1000G is allowed as a special legacy value * Tidy up BAM definition Split combined fields (bin_mq_nl and flag_nc) into their component parts. Move BIN calculation and Auxiliary tag encoding out of footnotes and into their own sections. * Move the n_cigar_op footnote into its own section. Also clarified the SEQ encoding order (Fixes issue#269) and updated the version history. Rephrased BAM block_size to avoid "remainder". (Fixes samtools#279) * Rewrite auxiliary field section to really show the BAM representation Whether the array element _count_ is properly uint32_t or int32_t has not been described previously. Ideally it would be unsigned (as it is an item count!), but it is limited by block_size which the spec already describes as a signed int32_t, and it is implemented as such in HTSlib. Fixes samtools#273. * BAM table formatting and thinko [minor] No final fullstops within table entries. Fix long-standing s/mate/next/ straggler. * Fix BAI linear index description: "overlapping", not "starting in" As per Heng Li, "Re: Understanding the linear index", on the samtools-help mailing list in June 2010 [1]: "The spec is wrong here. We should store the file offset of the leftmost alignment that *overlaps* (not starts in) the 16kb window. The implementation is different from what the spec is saying." Also hat tip to Adam Birnbaum of Edico Genome who rediscovered this in 2015 (private communication). Rediscovered once again in 2016 in samtools/htslib#405. As noted in a comment on that issue [2], samtools, HTSlib, and Picard have always constructed the linear index based on *overlapping* alignments. [1] https://sourceforge.net/p/samtools/mailman/message/25602941/ [2] samtools/htslib#405 (comment) * Fix array tag typo
Looking at the original spec
SAM1.pages
spec file (e.g: dbdec8b / https://github.com/samtools/hts-specs/raw/dbdec8b0405fa2c4a4708a91722168c7e654bbab/SAM1.pages) bothID
andSM
are listed as being required (pages formatted file removed mid 2013: 78aa805):however, when the LaTeX version appeared (07dc1c6) this requirement did not appear.
Certain tools, e.g. Picard Tools, require
SM
tag in read groups, which matches with the original spec.Should
SM
be required or not?The text was updated successfully, but these errors were encountered: