Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM Read Group (@RG) is SM required? #286

Closed
blankenberg opened this issue Feb 21, 2018 · 6 comments
Closed

SAM Read Group (@RG) is SM required? #286

blankenberg opened this issue Feb 21, 2018 · 6 comments

Comments

@blankenberg
Copy link
Contributor

Looking at the original spec SAM1.pages spec file (e.g: dbdec8b / https://github.com/samtools/hts-specs/raw/dbdec8b0405fa2c4a4708a91722168c7e654bbab/SAM1.pages) both ID and SM are listed as being required (pages formatted file removed mid 2013: 78aa805):
image
however, when the LaTeX version appeared (07dc1c6) this requirement did not appear.

Certain tools, e.g. Picard Tools, require SM tag in read groups, which matches with the original spec.

Should SM be required or not?

@bioinformed

This comment has been minimized.

@blankenberg
Copy link
Contributor Author

@bioinformed good point about versions. The specs are versioned and SAM/BAM files themselves can be 'optionally' versioned with the @HD record (VN is required when @HD is specified).

Looking closer at the pages file, its version is listed asVersion 0.1.2-draft (20090702) whereas the LaTex spec appeared as v1.3 draft.

However, this would be a rather significant specification change that is not mentioned in the version change history appendix, e.g. from version 1.5 (current, https://github.com/samtools/hts-specs/blob/7326e54db05075707fe133d003297681a5313c52/SAMv1.tex ):

\subsection*{1.3: July 2010 to April 2011}

\begin{itemize}
\item Re-add {\tt CC} and {\tt CP} auxiliary tags. (Mar 2011)
\item Add CIGAR N intron/skip operator. (Dec 2010)
\item Add {\tt BQ} BAQ tag. (Nov 2010)
\item Add {\tt RG PG} header field. (Nov 2010)
\item Add BAM description and index sections. (Nov 2010)
\item \textbf{Removal of FLAG letters.} (July 2010)
\end{itemize}

\subsection*{1.0: 2009 to July 2010}

Initial edition.

which indicates that there is an oversight here -- either a failure to label as being required, or a failure to convey the significant change.

@jmarshall
Copy link
Member

The transition from a Pages document to LaTeX eight years ago was essentially a complete rewrite by @lh3 (of mostly his own previous document) and the differences were not individually called out in source control — which was barely in use for this spec at the time.

However Heng noted on the mailing list at the time that he did indeed intend this particular relaxation:

RG SM (sample) is no longer a required field like it was in the previous version of the spec, is this intentional?

This is intentional. Sometimes, we simply want to group reads without sample information.

I can see his point. Certainly in the typical usage, clinical or otherwise, a read group corresponds to a library or a sample and sensible pipelines will use the defined @RG LB and SM tags to track that information. However in other circumstances there may not be significant sample information available, and the specification shouldn't unnecessarily preclude the use of readgroups in such circumstances.

Picard in strict mode requires @RG SM (but not e.g. @RG LB), and this code would have been written thus because SM was required in the original spec, as you noted. Picard validates a number of things that are recommended practice for the typical SAM/BAM usage, but are not strictly required by the spec; and this could be considered one of them.

@jkbonfield
Copy link
Contributor

Picard has its own set of definitions about what the Picard tool chain requires, but please do not confuse this with the specification. It has a number of additional restrictions (which may be relaxed via VALIDATION_STRINGENCY) of which this is one.

The SAM spec is in a much better, more tracked, situation since it converted from Pages to LaTeX as it is now version controlled. That in turn means that in a clinical setting a very precise version may be specified and adhered to. Failure to do this isn't a problem caused through fluidity of the spec, but is poor experimental design or failure to follow procedures in a clinical environment. @bioinformed: Face facts - things change and technology marches on. This is what the versioning is for!

@jkbonfield
Copy link
Contributor

@blankenberg I wrote the version history part of the spec, but I did so purely based on git commit history. This means we cannot write anything prior to 1.3 as the spec wasn't in version control then and it'd be too much work to try and figure out all the older changes.

@blankenberg
Copy link
Contributor Author

Thanks again for the history and clarifications.

thefferon added a commit to thefferon/hts-specs that referenced this issue Apr 24, 2018
* Improve description of header @sq M5 tag.

Make the description of how the digest is calculated clearer.

Expand on padding characters a bit, including a link to the section
about padded SAM.

Add examples, including one that uses pad characters.

* Updated SAMv1.pdf

* changing VCF maintainer in MAINTAINERS.md

Louis Bergelson is replacing David Roazen as the Broad's VCF maintainer

* Add changed requirement of @rg SM to version history

Fixes samtools#286 (see discussion there; Heng Li noted at the time [1] that this
change was intentional: "Sometimes, we simply want to group reads without
sample information").

[1] https://sourceforge.net/p/samtools/mailman/message/25788014/

* add htsget_interop.html

* Update htsget_interop.html

* Update htsget_interop.html

* Update htsget_interop.html

* Update htsget_interop.html

* Definition of SB INFO field Number and Type (samtools#290)

* Changed gVCF format broken link (samtools#288)

* Explicitly state character encoding, locale, and RE syntax.

Fixes samtools#197.

Document SAM header fields that permit UTF-8.

Fixes samtools#140.

* Updated SAMv1.pdf

* htsget VCF support

squash of https://github.com/samtools/hts-specs/pull/233/files

* htsget.md minor reformatting

* Added FASTA and FASTQ external links (PR samtools#247)

* INFO key 1000G is allowed as a special legacy value (samtools#289)

INFO key 1000G is allowed as a special legacy value

* Tidy up BAM definition

Split combined fields (bin_mq_nl and flag_nc) into their component
parts.

Move BIN calculation and Auxiliary tag encoding out of footnotes
and into their own sections.

* Move the n_cigar_op footnote into its own section.

Also clarified the SEQ encoding order (Fixes issue#269) and updated
the version history.

Rephrased BAM block_size to avoid "remainder". (Fixes samtools#279)

* Rewrite auxiliary field section to really show the BAM representation

Whether the array element _count_ is properly uint32_t or int32_t has
not been described previously. Ideally it would be unsigned (as it is
an item count!), but it is limited by block_size which the spec already
describes as a signed int32_t, and it is implemented as such in HTSlib.

Fixes samtools#273.

* BAM table formatting and thinko [minor]

No final fullstops within table entries.
Fix long-standing s/mate/next/ straggler.

* Fix BAI linear index description: "overlapping", not "starting in"

As per Heng Li, "Re: Understanding the linear index", on the
samtools-help mailing list in June 2010 [1]:

   "The spec is wrong here. We should store the file offset of the
    leftmost alignment that *overlaps* (not starts in) the 16kb window.
    The implementation is different from what the spec is saying."

Also hat tip to Adam Birnbaum of Edico Genome who rediscovered this
in 2015 (private communication).

Rediscovered once again in 2016 in samtools/htslib#405. As noted in
a comment on that issue [2], samtools, HTSlib, and Picard have always
constructed the linear index based on *overlapping* alignments.

[1] https://sourceforge.net/p/samtools/mailman/message/25602941/
[2] samtools/htslib#405 (comment)

* Fix array tag typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants