-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRAM: Refactor and Test Slice #1242
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1242 +/- ##
===============================================
+ Coverage 69.376% 69.535% +0.159%
- Complexity 8140 8325 +185
===============================================
Files 547 549 +2
Lines 32622 33199 +577
Branches 5515 5654 +139
===============================================
+ Hits 22632 23085 +453
- Misses 7772 7865 +93
- Partials 2218 2249 +31
|
*/ | ||
void recordMetaData(Slice slice) { | ||
void recordMetaData(final SliceBAIMetadata sliceMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not require a proper Slice
. Callers to this method would fake a Slice to do so: see the changes in CRAMBAIIndexer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fake slices are only in the multiref code path (?), and are that way because multi-ref was grafted onto the existing code. It seems like this should take a slice, or maybe even be delegated to Slice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is accessed 3 ways:
- indexing Multi-ref Slices by constructing "fake slices" (old way) or a
SliceBAIMetadata
(new way). I really want this to be something other thanSlice
because I don't want to imply they have the full capabilities of a Slice. - indexing single-ref Slices.
- converting CRAI index entries to BAI - this is missing functionality. (see
CRAIIndex.java
)
return; | ||
} | ||
|
||
final long start = slice.offset; | ||
final long end = slice.offset + 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This made no sense to me - please let me know if there was a reason for this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. Even though your change seems right, I'd be reluctant to take it in with all of these other changes without knowing for sure that we have good tests that ensure that an index created with this code can actually satisfy queries in all cases (single/multiple containers, multiple slices, etc). This change will affect the code below, and thus change the resulting index. We either need to do the test audit now, and beef them up, or else maybe we should revert this for now, add a comment, and make a ticket to follow up separately ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert and point to the open ticket #401
|
||
for (final int refId : mappedReferences) { | ||
final AlignmentSpan span = referenceAlignmentMap.get(refId); | ||
processSingleReferenceSlice(slice.getBAIMetadata(refId, span, container.offset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of creating a fakeSlice
we generate the BAI Metadata from a real Slice
and a few parameters
|
||
final byte[] refs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a confusing name. It's a single reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thats definitely better...
@@ -450,11 +446,7 @@ protected void flushContainer() throws IllegalArgumentException, IllegalAccessEx | |||
} | |||
} | |||
|
|||
final Container container = containerFactory.buildContainer(cramRecords); | |||
for (final Slice slice : container.slices) { | |||
slice.setRefMD5(refs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this and the offset
assignment inside buildContainer
@@ -45,7 +46,7 @@ | |||
private final ArrayList<SAMRecord> records; | |||
private SAMRecord nextRecord = null; | |||
private final CramNormalizer normalizer; | |||
private byte[] refs; | |||
private byte[] refBases; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before: this is a single reference.
slice.alignmentStart, | ||
slice.alignmentSpan, | ||
String.format("%032x", new BigInteger(1, slice.refMD5))); | ||
throw new CRAMException(msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved inside the validate call: see SliceHeader.java
line 239
|
||
/** | ||
* A class representing CRAI index entry: file and alignment offsets for each slice. | ||
* Created by vadim on 10/08/2015. | ||
*/ | ||
public class CRAIEntry implements Comparable<CRAIEntry>, Cloneable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is now immutable.
It also does not implement Cloneable
any more because it no longer has a clone()
- that was only used by tests.
} | ||
|
||
/** | ||
* Create a SliceBAIMetadata object from a CRAIEntry by supplying the necessary fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BAI stores information that CRAI does not, so we need to add the two missing fields here.
See #531
@@ -114,19 +128,6 @@ public int compareTo(final CRAIEntry o) { | |||
return (int) (containerStartOffset - o.containerStartOffset); | |||
} | |||
|
|||
@Override | |||
public CRAIEntry clone() throws CloneNotSupportedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was only used by tests. I updated those to use constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent.
|
||
entries.add(e); | ||
if (!container.isEOF()) { | ||
for (final IndexableSlice slice : container.slices) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Containers need IndexableSlice
s because they need to refer to the index metadata in cases like this
return entries; | ||
} | ||
|
||
public static SeekableStream openCraiFileAsBaiStream(final File cramIndexFile, final SAMSequenceDictionary dictionary) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
* Return a list of CRAI Entries; one for each reference in the multireference slice. | ||
* TODO: this should be refactored and delegate to container/slice | ||
*/ | ||
private static Collection<CRAIEntry> getCRAIEntriesForMultiRefSlice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inlined above, with the CRAIEntry
creation step changed to a stream-mapping of Slice.getCRAIEntry()
// https://github.com/samtools/htsjdk/issues/531 | ||
// entry.sliceSize is the slice size in bytes, not the number of | ||
// records; this results in the BAMIndex metadata being wrong | ||
slice.nofRecords = entry.sliceSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now called recordCount
, set to 0 here
@@ -52,57 +54,75 @@ public ContainerFactory(final SAMFileHeader samFileHeader, final int recordsPerS | |||
this.recordsPerSlice = recordsPerSlice; | |||
} | |||
|
|||
public Container buildContainer(final List<CramCompressionRecord> records) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged these two methods. substitutionMatrix
was never set so it's just null
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any codepath left that actually passes one of these into CompressionHeaderFactory and isn't a test ? It seems like its a test artifact that can be removed from both buildContainer and the CompressionHeader. Then the tests would have to be updated since they're probably useful tests, just at the wrong layer, so if you don't want to do that in this PR, can you add a TODO or something to make sure we do it later ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll add a TODO.
|
||
final Container container = new Container(); | ||
container.header = header; | ||
container.nofRecords = records.size(); | ||
container.globalRecordCounter = globalRecordCounter; | ||
container.bases = 0; | ||
container.blockCount = 0; | ||
container.offset = offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was set outside the buildContainer()
call before, for some reason
// assuming one sequence per container max: | ||
if (container.sequenceId == -1 && slice.sequenceId != -1) | ||
container.sequenceId = slice.sequenceId; | ||
// write to a stream to determine byte sizes for Slice BAI Metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems expensive. Before, this was doing the indexing after the fact right ? Maybe better and simpler to go back to that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By that do you mean storing unindexed Slices in Containers?
|
||
slices.add(slice.withIndexingMetadata(sliceByteOffset, sliceByteSize, sliceIndex++)); | ||
|
||
container.bases += sliceRecords.stream().map(r -> r.readLength).reduce(Integer::sum).orElse(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slice.bases
only existed to help populate Container.bases
. It no longer exists, replaced by this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be worried this will be expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code is:
- Loop through chunks of records. Populate
Slice.bases
by looping through individualCramCompressionRecord
s and accumulating theirreadLength
s. Add each Slice's total toContainer.bases
Is the concern that it's another loop through the record chunk?
if (s.sequenceId != SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX) { | ||
start = Math.min(start, s.alignmentStart); | ||
end = Math.max(end, s.alignmentStart + s.alignmentSpan); | ||
for (final Slice slice : container.slices) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being able to ignore the indexing parts when we don't need them is my motivation for making IndexableSlice
inherit from Slice
return spanMap; | ||
} | ||
|
||
ArrayList<CramCompressionRecord> getRecords(ArrayList<CramCompressionRecord> records, | ||
final Slice slice, final CompressionHeader header, final ValidationStringency validationStringency) throws IllegalArgumentException { | ||
final IndexableSlice slice, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be an IndexableSlice
because we store the index in the record
* @return CRAM container or null if no more data | ||
* @throws IOException | ||
*/ | ||
private static Container readContainer(final int major, final InputStream inputStream) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged with the other method of the same name. The other was never called, so we can remove its 2 parameters
final Container container = readContainerHeader(major, inputStream); | ||
if (container.isEOF()) { | ||
return container; | ||
} | ||
|
||
container.header = CompressionHeader.read(major, inputStream); | ||
|
||
howManySlices = Math.min(container.landmarks.length, howManySlices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was always MAX_VALUE
|
||
if (fromSlice > 0) //noinspection ResultOfMethodCallIgnored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was always 0
container.blockCount++; | ||
container.blockCount++; | ||
if (slice.embeddedRefBlock != null) container.blockCount++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know of any Container block-count bugs? I bet this is one.
@@ -1,294 +0,0 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to a new slice
subpackage
* Since this is not part of the CRAM stream, we must first construct a {@link Slice} from the stream | ||
* and then construct the IndexableSlice using values derived from the Slice | ||
*/ | ||
public class IndexableSlice extends Slice { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternate implementation: contains a Slice
and indexing values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the alternative you propose (put these in Slice
) would be better. It seems like the incremental values modeled by this class overlap completely with SliceBAIMetadata
, and the rest of SliceBAIMetadata
in turn overlaps with values already modeled by Slice
(or SliceHeader
). I think it would be cleaner to factor out a single class representation of the index metadata, and just have Slice
hold one of those or not as appropriate.
|
||
/** | ||
* Uses a Multiple Reference Slice Alignment Reader to determine the Reference Spans of a Slice. | ||
* The intended use is for CRAI indexing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. Maybe I should move this to IndexableSlice.
src/main/java/htsjdk/samtools/cram/structure/slice/SliceHeader.java
Outdated
Show resolved
Hide resolved
src/main/java/htsjdk/samtools/cram/structure/slice/SliceHeader.java
Outdated
Show resolved
Hide resolved
return getEmbeddedRefBlockContentID() != NO_EMBEDDED_REFERENCE; | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generated these for a test
entry.sliceOffset = 3; | ||
entry.sliceSize = 4; | ||
entry.containerStartOffset = 5; | ||
final CRAIEntry entry = new CRAIEntry(0, 1, 2, 5, 3, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the order that corresponds to the above. shrug
slice.sequenceId = 1; | ||
slice.alignmentStart = 2; | ||
slice.alignmentSpan = 3; | ||
slice.containerOffset = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this field from Slice because it was redundant to Container.offset
and I believe it got out of sync in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
public void testIntersetcsZeroSpan() { | ||
Assert.assertFalse(CRAIEntry.intersect(newEntry(1, 1), newEntry(1, 0))); | ||
public void testIntersectsZeroSpan() { | ||
Assert.assertFalse(CRAIEntry.intersect(newEntry(1, 1, 1), newEntry(1, 1, 0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed because I made newEntry
more generic
private static CRAIEntry newEntrySeqStart(final int seqId, final int start) { | ||
return new CRAIEntry(seqId, start, 0, 0, 0, 0); | ||
} | ||
private static CRAIEntry updateStart(final CRAIEntry toClone, final int alignmentStart) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these replaced CRAIEntry.clone()
Assert.assertTrue(slice.validateRefMD5(null)); | ||
Assert.assertTrue(slice.validateRefMD5(new byte[0])); | ||
Assert.assertTrue(slice.validateRefMD5(new byte[1024])); | ||
slice.validateRefMD5(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now throws instead of returning boolean
return dummySliceForTesting(sequenceId, alignmentStart, alignmentSpan, refBases); | ||
} | ||
|
||
@Test(dataProvider = "badRefs1", expectedExceptions = CRAMException.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consolidate or update these 3 error cases?
@@ -0,0 +1,24 @@ | |||
package htsjdk.samtools.cram.structure.slice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everyone loves Scala, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a few comments for the first pass on this one. See what you think.
src/main/java/htsjdk/samtools/cram/structure/slice/SliceHeader.java
Outdated
Show resolved
Hide resolved
src/main/java/htsjdk/samtools/cram/structure/slice/SliceHeader.java
Outdated
Show resolved
Hide resolved
src/main/java/htsjdk/samtools/cram/structure/slice/SliceHeader.java
Outdated
Show resolved
Hide resolved
src/main/java/htsjdk/samtools/cram/structure/slice/SliceHeader.java
Outdated
Show resolved
Hide resolved
InputStreamUtils.readFully(inputStream, referenceMD5s, 0, MD5_LEN); | ||
|
||
final byte[] tagBytes = InputStreamUtils.readFully(inputStream); | ||
final SAMBinaryTagAndValue tags = (major >= CramVersions.CRAM_v3.major) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these are (slice header ?) tags, not SAM tags, right ? Not sure what they're for, but does CRAM v2 not have these at all ? I vaguely recall there being issues with tags in V2, but we should find out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRAM 2 slice headers don't have tags. I don't know enough about tags to answer the first question, but the relevant section of the CRAM 3 spec has "a series of tag,type,value tuples encoded as per BAM auxiliary fields."
src/main/java/htsjdk/samtools/cram/structure/slice/SliceHeader.java
Outdated
Show resolved
Hide resolved
public static final int MULTI_REFERENCE = -2; | ||
public static final int NO_REFERENCE = SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX; // -1 | ||
public static final int NO_ALIGNMENT_START = -1; | ||
public static final int NO_ALIGNMENT_SPAN = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit unclear on what these various values imply about the slice. When does a slice not have a valid start, span, or reference (when its unmapped, but also its multi-ref) ? If a slice doesn't have a valid start, or a valid span, or a valid reference, does it imply something about the other values, ie., if it has no start, it must always the case that span isn't valid either ? I think we should try to tighten up the state management of these through mutators of some kind, rather than publicly exposing all of these. There are a few places outside of Slice
that check one or the other of these and do something, but its not always clear from the context what exactly they're checking for. Maybe we could even get rid of the need for alignmentBordersSanityCheck
. Also maybe some of the state sanity checking can be done in AlignmentSpan
(see my comment elsewhere about trying using that class in place of the individual start/span/id values).
|
||
if (slice.alignmentSpan < 1) { | ||
unAlignedRecords += slice.nofRecords; | ||
if (sliceMetadata.getAlignmentSpan() == SliceHeader.NO_ALIGNMENT_SPAN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not clear to me that NO_ALIGNMENT_SPAN
is really equivalent to having only unaligned records. I don't think this PR changes the issue; if its wrong it was already wrong - but see my other comments below about the meaning of these various states.
*/ | ||
void recordMetaData(Slice slice) { | ||
void recordMetaData(final SliceBAIMetadata sliceMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fake slices are only in the multiref code path (?), and are that way because multi-ref was grafted onto the existing code. It seems like this should take a slice, or maybe even be delegated to Slice ?
final SliceHeader sliceHeader = new SliceHeader(sequenceId, alignmentStart, alignmentSpan, records.size(), globalRecordCounter, | ||
dataBlockCount, contentIDs, embeddedRefBlockContentId, refMD5, hasher.getAsTags()); | ||
|
||
return new Slice(sliceHeader, coreBlock, buildBlocksFromStreams(compressionHeader.externalCompressors, externalStreamMap)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to just delegate more of this to Slice
, and let Slice create it's own SliceHeader ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this?
return new Slice(sequenceId, alignmentStart, alignmentSpan, records.size(),
globalRecordCounter, dataBlockCount, contentIDs, embeddedRefBlockContentId, refMD5,
hasher.getAsTags(), coreBlock, buildBlocksFromStreams(compressionHeader.externalCompressors, externalStreamMap));
Not sure what that would gain us. I would have to add another Slice
constructor for this case while keeping the current one for Slice.read()
to use.
- minor cleanup of ContainerParser
Closing this while we work through the spec implications here: samtools/hts-specs#370 |
Description
Encapsulate
Slice
and make immutableSliceHeader
classIndexableSlice
subclass with indexing metadataSliceBAIMetadata
class and update index-consuming methods to take this as input instead of sliceCRAIEntry
immutable and add a method to convert to aSliceBAIMetadata
Checklist