New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Developing a StrictSam standard #281
Comments
I think that overall the start is good. I think that it is overzealous at places (for example forbidding I think it would be good to agree on an identifier for each "rule" some thing like
So that
Also, in the call, we discussed tests. I'm wondering if we should have SAM files (as individual files!) that can be used by validation program authors to check their programs. It should be clear, for each SAM file which validations are expected to fail (if any) |
I'd like to see levels of strictness. Something like INVALID, NON_COMPLIANT and INCOMPLETE. There are clearly some things which we can define as broken / invalid as it's basically nonsensical.
Non-compliance is examples of auxiliary tags in the reserved address space. Sadly this is very common, hence why it should be a separate category. Similarly there is some looseness around the definition of TLEN; the specification is quite clear, but a huge portion of commonly used programs do something subtly different. While technically these are failures, if we label all these as invalid then we lose discrimination power as most data in NCBI / ENA is now labelled as invalid - not very useful in a diagnostic tool. For incomplete data we have valid SAM, but one that may perhaps cause problems for fragile tools. For example
I view these largely as groupings of the failure-ID Yossi listed above (NON_STAR_RNAME_MUST_HAVE_SN_IN_HEADER etc). We could explicitly give a list of the strict IDs we want to check or exclude, or instead give a category which is essentially an "OR" operation on a large set of IDs. |
One of the reasons I requested feedback was because I wasn't sure what the best way to label the errors. Some options are:
I'm currently leaning towards a hierarchy (matching the document format hierarchy) with tags as well.
That particular one I pulled straight from Section 2 (Recommended Practice for the SAM Format) of the SAM specs.
The use case I had in mind was for program/archive input validation. A given program would specify the conformance rules that it requires. BreakDancer would require basic sanity checks and mate record consistency checks to pass but not care about any of the standard tags checks since it doesn't use them. GRIDSS would be much more particular and also require all StandardTags.SA.* checks to pass. EMBL might care only about completeness and data consistency of primary alignments.
Even without including problematic checks such as TLEN (which incidentally you can't properly check since an indel-aware program can write a correct TLEN that doesn't match either of the simple definitions), I expect most data to fail at least one check and that won't actually be problematic. I envisage something like the following: $validatesam --summary mybam.bam I'll keep fleshing out more rules then make a PR and we can have an in-depth discussion how to name and organise them. |
We have two distinct things here.
The first case I'm uneasy about. I'd be happy with something that states if a read claims its mate is reverse complemented but its mate isn't, then that is an inconsistency (this is arguably still point 2; a semantic validation on top of the syntacially validation). However something entirely new like requiring that CIGAR must have at least one operator (disallowing "*") is turning semantically valid SAM files into invalid ones purely for ease of program coding. I don't see this gaining much traction. Better would be tightening up SAM spec itself, where appropriate, with a corresponding version bump. Regarding the second point only, some rules need to be tied to others. For example, if we do not know what version of the SAM specification we have then we cannot validate the header tags correspond to the version. "AH" SQ tag appeared in 1.5 and is therefore invalid in a file with "HD VN:1.4" but not invalid in one without an HD line. Even for StrictSAM, we'd need these rules tied to the SAM version numbers. The should be annotated as such in the text, eg "[1.4]" meaning 1.4 and newer or "[=1.4]" being precisely 1.4 only and not anything else (I doubt we'd need that though). We also need to be clear which case we're talking about - a SAMStrict only rule or a SAM validation rule. Comments:
|
It might be useful to tag SamStrict constraints that might be violated by
subsetting. That would enable a smarter validation tool that can be told to
ignore that set of restrictions, during the parallel part of the pipeline.
then, when the final file is put together one could run the validator again
without that flag.
…On Wed, Jan 31, 2018 at 7:02 AM, James Bonfield ***@***.***> wrote:
We have two distinct things here.
1. A new format derived from SAM but with additional rules.
2. A validator for the existing format.
The first case I'm uneasy about. I'd be happy with something that states
if a read claims its mate is reverse complemented but its mate isn't, then
that is an inconsistency (this is arguably still point 2; a semantic
validation on top of the syntacially validation). However something
entirely new like requiring that CIGAR must have at least one operator
(disallowing "*") is turning semantically valid SAM files into invalid ones
purely for ease of program coding. I don't see this gaining much traction.
Better would be tightening up SAM spec itself, where appropriate, with a
corresponding version bump.
Regarding the second point only, some rules need to be tied to others. For
example, if we do not know what version of the SAM specification we have
then we cannot validate the header tags correspond to the version. "AH" SQ
tag appeared in 1.5 and is therefore invalid in a file with "HD VN:1.4" but
not invalid in one without an HD line.
Even for StrictSAM, we'd need these rules tied to the SAM version numbers.
The should be annotated as such in the text, eg "[1.4]" meaning 1.4 and
newer or "[=1.4]" being precisely 1.4 only and not anything else (I doubt
we'd need that though). We also need to be clear which case we're talking
about - a SAMStrict only rule or a SAM validation rule.
Comments:
-
Some obvious things which are simply restating the main SAM spec, such
as SQ tags SN and LN being mandatory and matching specific regular
expressions. These are fine and it's good to be able to give a failure a
symbolic name.
-
In the StrictSAM you make the HD line mandatory. I'd have loved this
to be so when SAM was designed, but sadly it wasn't done and we have
thousands of files that this rule would make non-conformant. I'd prefer to
have a new SAM version where, say, 1.6 onwards HD is mandatory. Old files
don't get validated so rigorously, but we could run a tool that converts
them to 1.6 and does corresponding validation. That way we get this into
the real spec rather than a sibling spec. The same applies for many other
strict rules. For example currently we have "If RNAME is *, no assumptions
can be made about POS and CIGAR". I think it's justified for a *new*
SAM spec version number to remove the "no assumptions" rule and simply
dictate them as 0 and *.
-
Limiting SQ length to 2Gb is arbitrary. Similarly for POS. *SAM* has
no limit as it is entirely textual. BAM and CRAM do, but I could easily see
a CRAMv4 where these are 64-bit fields instead. I see no reason to limit
the textual version. If we wish to have a BAM validator then that is a
subtly different thing. Perhaps this rule needs to come under a
compatability / warning level rather than something strict.
-
Please use symbolic names for the FLAG values instead of 0x1, 0x2 etc.
We can define them in a table before referring to them (suggest the same
ones used in various bits of software).
-
(SAM spec) "Multiple segments in sequencing" (flag 0x1) is hard to
understand and needs careful interpretation. A single segment in sequencing
may yield multiple primary alignment records in the case of a chimeric or
split alignment. In SAM each line is an alignment, not a segment, so the
word segment doesn't really have any impact on whether we expect to see it
once or not in the file. I think it's hard therefore to have a rule that
validates this.
-
The SAM spec states if 0x1 is set then no assumptions can be made
about 0x40 and 0x80. For validation we cannot therefore state that they
must be unset, so this needs to be clear that it is a SAMstrict additional
rule rather than a SAM validation rule. Similarly for the flag 0x4 rules
and RNAME *.
-
CIGAR operators can, in theory, be entirely insertion. A combination
of P and I ops permits sequences overlapping a large insertion to be
represented within SAM. I don't see how this violates the SAM spec.
-
Similarly it is valid to have reads starting/ending with insertions
(deletions are a bit hard to argue, but maybe possible in some chimeric
situations?). (Neighbouring I and D ops are also valid, despite being
argued against in the past.)
-
Unmapped reads should have their RNAME/POS identical to *one* of the
original template. (It's possible to have more than 2 segments, although
not in use for modern techniques I think.) Again this is a SAMStrict
restriction rather than a SAM semantic validation issue, so needs
appropriate markup. Also I could envisage a joint assembler / aligner that
has concluded a read belongs somewhere inside a large insertion at point X,
but it doesn't have a proper alignment for it so it "places" it there as
unmapped.
-
I don't see why 0x800 (supplementary) *requires* the existance of an
SA tag. The tags are useful, but optional. Tools can trivially just check
for the tag existance and this is less work than running a new tool to
strip out all reads with flag 0x800 and no SA tag so this doesn't gain us
anything.
-
"Each chimeric alignment must have a record with FLAG 0x800 not set" -
this comes down to my argument before about complete vs incomplete data
sets. If we do a range query to produce a new SAM, the above constraint can
trivially be violated. Sub sets of SAM are still valid SAM and we should
never treat them otherwise as this is an extremely common operation, not
least for purposes of parallel processing. A validator when run on the
entire data set can however be in a position to complain about such things.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#281 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0oh1v5TGxE8nFf3g6v8Ym-SJJjdiks5tQFZVgaJpZM4RcqVQ>
.
|
I've attempted to avoid making a new spec, and merely forcing resolution of ambiguities and semantic consistency. Almost all of the additional rules that you are concerned about are actually in the SAM specs themselves.
It became pretty clear that the document was going to turn into a checklist for a validator so I've been including base SAM validator rules in there as well (with a \samv15 tag).
Hence \samv15 on about half of the rules I’ve made thus far.
These (and most of the others that you consider a sub-spec) have been taken directly from section 2 of the SAM specifications themselves. I've pulled in everything in section 2 of the SAM specs wholesale as these are all things the SAM specifications themselves already consider 'recommended practice'. What was the point behind section 2 to start with? Clearly it is considered part of the existing SAM specs as it is in the document, but why have it in there at all if you're uncomfortable in enforcing these recommendations? Would you be happy if I labelled all rules derived from section 2 of the specs with a \samv15bestpractice tag? Should SAM strict just replace Section 2 of the SAM specs?
Section 1.4 of the SAM specs explicitly restrict POS to [0, 2^31 - 1]. I’d like to add additional validation warnings so e.g. >512Mb would generate a warning that you can’t make a .bai index from this reference.
Good idea.
Multiple non-secondary alignment record, yes. Section 1.4.2 does define and constrain a “primary line” and a “supplementary line” wrt chimeric alignments.
We can constrain it to make sense though. 0x1, 0x2, 0x4, 0x8, 0x40, 0x80 all refer to segments, and SAM Section 1.4.2 point 1 requires that each “read” has one “primary line”. The SAM specs themselves are indeed a bit inconsistent about what is a “segment” and what is a “read” so I interpreted “read” as “segment” which I think is a reasonable position to take.
I’ve tagged all the SAM rule as \samv15.
Section 1.4.4 defines POS in terms of
In that case, we can argue that it should be mapped with an insertion CIGAR, not unmapped to a random location in the genome. Technically, one could argue that sequencing technique such as Chromium are multi-segment but their random locations don’t fit the SAM multi-segment model anyway so it’s a moot point.
I enforced this as otherwise multiple secondary chimeric alignments could not be resolved unambiguously. This is somewhat of a symptom of the debate a few years back about whether alignments came in set of chimeric alignment records (the interpretation I used), or as nodes in an alignment graph. Happy to drop the requirement down to requiring SA tags on all or none for a given read.
Not true. SAM Section 1.4.2 point 1 states: “For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’.” If you have any secondary or supplementary alignments, sub-setting your valid SAM will most likely result in a file which is not a valid SAM. This particular constraint is part of the SAM specs (and is marked as such), not an additional SAM strict check. From a practical perspective, sub-setting is done all the time and we need to be able to deal with it. How does an overall SAM strict recommendation that these all validators should have the capability of performing validation only a specified a subset of genomic regions sound to you? Coordinates are required as I don’t want to allow arbitrary subsets as random downsampling that loses half your read pairing should be treated very differently to just validating the chr1 records. |
I've made a PR to continue this as discussing particular rules is much easier with a diff in front of you. |
As discussed on the last and penultimate call, there's a desire to standardize a "Strict Sam" specification.
There's already a stub of an implementation here: https://github.com/d-cameron/hts-specs/blob/strict/SAMstrict.tex
@d-cameron
I didn't know how to respond to your request that we take a look at your initial branch...so I am opening this issue for a discussion...if you have a better idea, I'm game.
The text was updated successfully, but these errors were encountered: