Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM: Refactor and Test Slice #1242

Closed
wants to merge 16 commits into from
Closed

CRAM: Refactor and Test Slice #1242

wants to merge 16 commits into from

Conversation

jmthibault79
Copy link
Contributor

Description

Encapsulate Slice and make immutable

  • extract immutable SliceHeader class
  • create immutable IndexableSlice subclass with indexing metadata
  • create immutable SliceBAIMetadata class and update index-consuming methods to take this as input instead of slice
  • make CRAIEntry immutable and add a method to convert to a SliceBAIMetadata

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@codecov-io
Copy link

codecov-io commented Dec 13, 2018

Codecov Report

Merging #1242 into master will increase coverage by 0.159%.
The diff coverage is 82.441%.

@@               Coverage Diff               @@
##              master     #1242       +/-   ##
===============================================
+ Coverage     69.376%   69.535%   +0.159%     
- Complexity      8140      8325      +185     
===============================================
  Files            547       549        +2     
  Lines          32622     33199      +577     
  Branches        5515      5654      +139     
===============================================
+ Hits           22632     23085      +453     
- Misses          7772      7865       +93     
- Partials        2218      2249       +31
Impacted Files Coverage Δ Complexity Δ
.../samtools/cram/build/CompressionHeaderFactory.java 90.842% <ø> (ø) 66 <0> (ø) ⬇️
...java/htsjdk/samtools/cram/structure/Container.java 81.25% <100%> (ø) 7 <0> (ø) ⬇️
...oding/reader/MultiRefSliceAlignmentSpanReader.java 100% <100%> (ø) 6 <2> (ø) ⬇️
src/main/java/htsjdk/samtools/CRAMIterator.java 78.676% <100%> (-1.044%) 32 <0> (-1)
...amtools/cram/encoding/writer/CramRecordWriter.java 89.063% <100%> (ø) 26 <0> (ø) ⬇️
.../samtools/cram/structure/slice/SliceAlignment.java 86.207% <100%> (ø) 10 <4> (?)
...jdk/samtools/cram/structure/CompressionHeader.java 90.244% <100%> (ø) 28 <0> (ø) ⬇️
...amtools/cram/encoding/reader/CramRecordReader.java 85.256% <100%> (ø) 26 <0> (ø) ⬇️
...c/main/java/htsjdk/samtools/cram/build/CramIO.java 83.721% <100%> (ø) 20 <1> (ø) ⬇️
...amtools/cram/structure/slice/SliceBAIMetadata.java 100% <100%> (ø) 9 <9> (?)
... and 28 more

*/
void recordMetaData(Slice slice) {
void recordMetaData(final SliceBAIMetadata sliceMetadata) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not require a proper Slice. Callers to this method would fake a Slice to do so: see the changes in CRAMBAIIndexer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fake slices are only in the multiref code path (?), and are that way because multi-ref was grafted onto the existing code. It seems like this should take a slice, or maybe even be delegated to Slice ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is accessed 3 ways:

  • indexing Multi-ref Slices by constructing "fake slices" (old way) or a SliceBAIMetadata (new way). I really want this to be something other than Slice because I don't want to imply they have the full capabilities of a Slice.
  • indexing single-ref Slices.
  • converting CRAI index entries to BAI - this is missing functionality. (see CRAIIndex.java)

return;
}

final long start = slice.offset;
final long end = slice.offset + 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made no sense to me - please let me know if there was a reason for this line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Even though your change seems right, I'd be reluctant to take it in with all of these other changes without knowing for sure that we have good tests that ensure that an index created with this code can actually satisfy queries in all cases (single/multiple containers, multiple slices, etc). This change will affect the code below, and thus change the resulting index. We either need to do the test audit now, and beef them up, or else maybe we should revert this for now, add a comment, and make a ticket to follow up separately ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert and point to the open ticket #401


for (final int refId : mappedReferences) {
final AlignmentSpan span = referenceAlignmentMap.get(refId);
processSingleReferenceSlice(slice.getBAIMetadata(refId, span, container.offset));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of creating a fakeSlice we generate the BAI Metadata from a real Slice and a few parameters


final byte[] refs;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a confusing name. It's a single reference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thats definitely better...

@@ -450,11 +446,7 @@ protected void flushContainer() throws IllegalArgumentException, IllegalAccessEx
}
}

final Container container = containerFactory.buildContainer(cramRecords);
for (final Slice slice : container.slices) {
slice.setRefMD5(refs);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this and the offset assignment inside buildContainer

@@ -45,7 +46,7 @@
private final ArrayList<SAMRecord> records;
private SAMRecord nextRecord = null;
private final CramNormalizer normalizer;
private byte[] refs;
private byte[] refBases;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before: this is a single reference.

slice.alignmentStart,
slice.alignmentSpan,
String.format("%032x", new BigInteger(1, slice.refMD5)));
throw new CRAMException(msg);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved inside the validate call: see SliceHeader.java line 239


/**
* A class representing CRAI index entry: file and alignment offsets for each slice.
* Created by vadim on 10/08/2015.
*/
public class CRAIEntry implements Comparable<CRAIEntry>, Cloneable {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is now immutable.

It also does not implement Cloneable any more because it no longer has a clone() - that was only used by tests.

}

/**
* Create a SliceBAIMetadata object from a CRAIEntry by supplying the necessary fields
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BAI stores information that CRAI does not, so we need to add the two missing fields here.

See #531

@@ -114,19 +128,6 @@ public int compareTo(final CRAIEntry o) {
return (int) (containerStartOffset - o.containerStartOffset);
}

@Override
public CRAIEntry clone() throws CloneNotSupportedException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was only used by tests. I updated those to use constructors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent.


entries.add(e);
if (!container.isEOF()) {
for (final IndexableSlice slice : container.slices) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Containers need IndexableSlices because they need to refer to the index metadata in cases like this

return entries;
}

public static SeekableStream openCraiFileAsBaiStream(final File cramIndexFile, final SAMSequenceDictionary dictionary) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

* Return a list of CRAI Entries; one for each reference in the multireference slice.
* TODO: this should be refactored and delegate to container/slice
*/
private static Collection<CRAIEntry> getCRAIEntriesForMultiRefSlice(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inlined above, with the CRAIEntry creation step changed to a stream-mapping of Slice.getCRAIEntry()

// https://github.com/samtools/htsjdk/issues/531
// entry.sliceSize is the slice size in bytes, not the number of
// records; this results in the BAMIndex metadata being wrong
slice.nofRecords = entry.sliceSize;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now called recordCount, set to 0 here

@@ -52,57 +54,75 @@ public ContainerFactory(final SAMFileHeader samFileHeader, final int recordsPerS
this.recordsPerSlice = recordsPerSlice;
}

public Container buildContainer(final List<CramCompressionRecord> records) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged these two methods. substitutionMatrix was never set so it's just null below

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any codepath left that actually passes one of these into CompressionHeaderFactory and isn't a test ? It seems like its a test artifact that can be removed from both buildContainer and the CompressionHeader. Then the tests would have to be updated since they're probably useful tests, just at the wrong layer, so if you don't want to do that in this PR, can you add a TODO or something to make sure we do it later ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll add a TODO.


final Container container = new Container();
container.header = header;
container.nofRecords = records.size();
container.globalRecordCounter = globalRecordCounter;
container.bases = 0;
container.blockCount = 0;
container.offset = offset;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was set outside the buildContainer() call before, for some reason

// assuming one sequence per container max:
if (container.sequenceId == -1 && slice.sequenceId != -1)
container.sequenceId = slice.sequenceId;
// write to a stream to determine byte sizes for Slice BAI Metadata
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems expensive. Before, this was doing the indexing after the fact right ? Maybe better and simpler to go back to that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that do you mean storing unindexed Slices in Containers?


slices.add(slice.withIndexingMetadata(sliceByteOffset, sliceByteSize, sliceIndex++));

container.bases += sliceRecords.stream().map(r -> r.readLength).reduce(Integer::sum).orElse(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice.bases only existed to help populate Container.bases. It no longer exists, replaced by this step.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be worried this will be expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code is:

  • Loop through chunks of records. Populate Slice.bases by looping through individual CramCompressionRecords and accumulating their readLengths. Add each Slice's total to Container.bases

Is the concern that it's another loop through the record chunk?

if (s.sequenceId != SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX) {
start = Math.min(start, s.alignmentStart);
end = Math.max(end, s.alignmentStart + s.alignmentSpan);
for (final Slice slice : container.slices) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being able to ignore the indexing parts when we don't need them is my motivation for making IndexableSlice inherit from Slice

return spanMap;
}

ArrayList<CramCompressionRecord> getRecords(ArrayList<CramCompressionRecord> records,
final Slice slice, final CompressionHeader header, final ValidationStringency validationStringency) throws IllegalArgumentException {
final IndexableSlice slice,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be an IndexableSlice because we store the index in the record

* @return CRAM container or null if no more data
* @throws IOException
*/
private static Container readContainer(final int major, final InputStream inputStream) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged with the other method of the same name. The other was never called, so we can remove its 2 parameters

final Container container = readContainerHeader(major, inputStream);
if (container.isEOF()) {
return container;
}

container.header = CompressionHeader.read(major, inputStream);

howManySlices = Math.min(container.landmarks.length, howManySlices);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was always MAX_VALUE


if (fromSlice > 0) //noinspection ResultOfMethodCallIgnored
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was always 0

container.blockCount++;
container.blockCount++;
if (slice.embeddedRefBlock != null) container.blockCount++;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know of any Container block-count bugs? I bet this is one.

@@ -1,294 +0,0 @@
/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to a new slice subpackage

* Since this is not part of the CRAM stream, we must first construct a {@link Slice} from the stream
* and then construct the IndexableSlice using values derived from the Slice
*/
public class IndexableSlice extends Slice {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternate implementation: contains a Slice and indexing values

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alternative you propose (put these in Slice) would be better. It seems like the incremental values modeled by this class overlap completely with SliceBAIMetadata, and the rest of SliceBAIMetadata in turn overlaps with values already modeled by Slice (or SliceHeader). I think it would be cleaner to factor out a single class representation of the index metadata, and just have Slice hold one of those or not as appropriate.


/**
* Uses a Multiple Reference Slice Alignment Reader to determine the Reference Spans of a Slice.
* The intended use is for CRAI indexing.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Maybe I should move this to IndexableSlice.

return getEmbeddedRefBlockContentID() != NO_EMBEDDED_REFERENCE;
}

@Override
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generated these for a test

entry.sliceOffset = 3;
entry.sliceSize = 4;
entry.containerStartOffset = 5;
final CRAIEntry entry = new CRAIEntry(0, 1, 2, 5, 3, 4);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the order that corresponds to the above. shrug

slice.sequenceId = 1;
slice.alignmentStart = 2;
slice.alignmentSpan = 3;
slice.containerOffset = 4;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this field from Slice because it was redundant to Container.offset and I believe it got out of sync in some cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

public void testIntersetcsZeroSpan() {
Assert.assertFalse(CRAIEntry.intersect(newEntry(1, 1), newEntry(1, 0)));
public void testIntersectsZeroSpan() {
Assert.assertFalse(CRAIEntry.intersect(newEntry(1, 1, 1), newEntry(1, 1, 0)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed because I made newEntry more generic

private static CRAIEntry newEntrySeqStart(final int seqId, final int start) {
return new CRAIEntry(seqId, start, 0, 0, 0, 0);
}
private static CRAIEntry updateStart(final CRAIEntry toClone, final int alignmentStart) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these replaced CRAIEntry.clone()

Assert.assertTrue(slice.validateRefMD5(null));
Assert.assertTrue(slice.validateRefMD5(new byte[0]));
Assert.assertTrue(slice.validateRefMD5(new byte[1024]));
slice.validateRefMD5(null);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now throws instead of returning boolean

return dummySliceForTesting(sequenceId, alignmentStart, alignmentSpan, refBases);
}

@Test(dataProvider = "badRefs1", expectedExceptions = CRAMException.class)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consolidate or update these 3 error cases?

@@ -0,0 +1,24 @@
package htsjdk.samtools.cram.structure.slice
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone loves Scala, right?

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a few comments for the first pass on this one. See what you think.

InputStreamUtils.readFully(inputStream, referenceMD5s, 0, MD5_LEN);

final byte[] tagBytes = InputStreamUtils.readFully(inputStream);
final SAMBinaryTagAndValue tags = (major >= CramVersions.CRAM_v3.major) ?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these are (slice header ?) tags, not SAM tags, right ? Not sure what they're for, but does CRAM v2 not have these at all ? I vaguely recall there being issues with tags in V2, but we should find out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRAM 2 slice headers don't have tags. I don't know enough about tags to answer the first question, but the relevant section of the CRAM 3 spec has "a series of tag,type,value tuples encoded as per BAM auxiliary fields."

public static final int MULTI_REFERENCE = -2;
public static final int NO_REFERENCE = SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX; // -1
public static final int NO_ALIGNMENT_START = -1;
public static final int NO_ALIGNMENT_SPAN = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unclear on what these various values imply about the slice. When does a slice not have a valid start, span, or reference (when its unmapped, but also its multi-ref) ? If a slice doesn't have a valid start, or a valid span, or a valid reference, does it imply something about the other values, ie., if it has no start, it must always the case that span isn't valid either ? I think we should try to tighten up the state management of these through mutators of some kind, rather than publicly exposing all of these. There are a few places outside of Slice that check one or the other of these and do something, but its not always clear from the context what exactly they're checking for. Maybe we could even get rid of the need for alignmentBordersSanityCheck. Also maybe some of the state sanity checking can be done in AlignmentSpan (see my comment elsewhere about trying using that class in place of the individual start/span/id values).


if (slice.alignmentSpan < 1) {
unAlignedRecords += slice.nofRecords;
if (sliceMetadata.getAlignmentSpan() == SliceHeader.NO_ALIGNMENT_SPAN) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not clear to me that NO_ALIGNMENT_SPAN is really equivalent to having only unaligned records. I don't think this PR changes the issue; if its wrong it was already wrong - but see my other comments below about the meaning of these various states.

*/
void recordMetaData(Slice slice) {
void recordMetaData(final SliceBAIMetadata sliceMetadata) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fake slices are only in the multiref code path (?), and are that way because multi-ref was grafted onto the existing code. It seems like this should take a slice, or maybe even be delegated to Slice ?

final SliceHeader sliceHeader = new SliceHeader(sequenceId, alignmentStart, alignmentSpan, records.size(), globalRecordCounter,
dataBlockCount, contentIDs, embeddedRefBlockContentId, refMD5, hasher.getAsTags());

return new Slice(sliceHeader, coreBlock, buildBlocksFromStreams(compressionHeader.externalCompressors, externalStreamMap));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just delegate more of this to Slice, and let Slice create it's own SliceHeader ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

return new Slice(sequenceId, alignmentStart, alignmentSpan, records.size(),
 globalRecordCounter, dataBlockCount, contentIDs, embeddedRefBlockContentId, refMD5,
 hasher.getAsTags(), coreBlock, buildBlocksFromStreams(compressionHeader.externalCompressors, externalStreamMap));

Not sure what that would gain us. I would have to add another Slice constructor for this case while keeping the current one for Slice.read() to use.

@jmthibault79
Copy link
Contributor Author

Closing this while we work through the spec implications here: samtools/hts-specs#370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants