Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM header #201

Closed
jkbonfield opened this issue Apr 11, 2017 · 2 comments
Closed

CRAM header #201

jkbonfield opened this issue Apr 11, 2017 · 2 comments
Labels

Comments

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 11, 2017

Section 8.3 of the specification has several errors, but also things that the C and Java implementations have drifted apart on. We need to tighten this up.

  1. The SQ:MD5 checksum is required unless the reference sequence has been embedded into the file. Maybe. If I use scramble -x to produce a referenceless CRAM file then it also should operate without MD5sums surely?

This then segways into a totally different incompatibility. I wasn't able to validate whether C/Java interop works without M5 tags for referenceless encoding because of my use of MD5 zero in slice headers. The C code knows the slice reference is unused, therefore does not need validating. It does this by setting MD5 to zero. The Java code however checks this and fails:

Exception in thread "main" htsjdk.samtools.cram.CRAMException: Reference sequence MD5 mismatch for slice: sequence id 0, start 10001, span 33954, expected MD5 00000000000000000000000000000000
	at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:187)
	at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:261)
	at htsjdk.samtools.SamReader$AssertingIterator.hasNext(SamReader.java:592)
	at picard.sam.SamFormatConverter.doWork(SamFormatConverter.java:83)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

This is perhaps a bug in my code. I'm way out beyond the spec here (although I thought we discussed and changed it, but apparently that was just on my to-do list rather than a reality!) because there is no mention of zero in that situation. We so have a specification that 0 is used for unmapped or unsorted data, so this is a minor tweak. Is it a useful addition though?

What does picard.jar do when it creates referenceless CRAM files?

  1. "At least one RG record is required."

This is not true. C and Java implementations can read and write CRAM without RG tags. (It was true for version 1.0 I believe.)

  1. "The HD:SO sort order is always POS."

Firstly, POS is not a valid sort order; it is "coordinate" instead.
Secondly we all support unsorted and name sorted data too, so the statement is incorrect.

Although name sorted data in picard is "interesting". With SO:queryname it strictly enforces one interpretation of name sorting, so A100 is before A99. This is not the same name sort order than samtools uses, so the two are incompatible. Perhaps the SAM spec needs tightening here as it has no definition of what sort actually means and whether it is an intelligent sort or not.

@jmarshall
Copy link
Member

What SO:queryname is supposed to mean is #5.

@jkbonfield
Copy link
Contributor Author

Thanks for the link. It's a bit of an aside anyway and I probably shouldn't have gone down that particular rabbit hole. :-) The particular issue here is that CRAM doesn't have a requirement of position sorted data so the text is invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants