-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for partial VCF writing #904
Conversation
Codecov Report
@@ Coverage Diff @@
## master #904 +/- ##
===============================================
+ Coverage 65.048% 66.012% +0.964%
- Complexity 7256 7471 +215
===============================================
Files 528 529 +1
Lines 31938 31973 +35
Branches 5454 5462 +8
===============================================
+ Hits 20775 21106 +331
+ Misses 9019 8725 -294
+ Partials 2144 2142 -2
|
@@ -221,6 +221,11 @@ public void close() { | |||
super.close(); | |||
} | |||
|
|||
@Override | |||
public void setVcfHeader(VCFHeader header) { | |||
//no-op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a "no-op"? This doesn't seem right. The add()
method in BCF2Writer
will blow up without a header, since it calls vc = vc.fullyDecode(header, false);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented.
@@ -49,4 +49,6 @@ | |||
public boolean checkError(); | |||
|
|||
public void add(VariantContext vc); | |||
|
|||
void setVcfHeader(VCFHeader header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add javadoc for the new interface method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
@Override | ||
public void setVcfHeader(VCFHeader header) { | ||
this.underlyingWriter.setVcfHeader(header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
untested according to code cov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the test case
@@ -132,6 +132,11 @@ public synchronized void add(VariantContext vc) { | |||
emitSafeRecords(); | |||
} | |||
|
|||
@Override | |||
public void setVcfHeader(VCFHeader header) { | |||
innerWriter.setVcfHeader(header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also untested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but don't worry about this one since SortingVariantContextWriterBase is basically ununsed.
@@ -26,10 +26,13 @@ | |||
package htsjdk.variant.variantcontext.writer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need tests to prove that the setVcfHeader()
works for all concrete implementations of VariantContextWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also test interaction between setVcfHeader()
and writeHeader()
(ie., test calling one and then the other for the various implementations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@Override | ||
public void setVcfHeader(VCFHeader header) { | ||
this.mHeader = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header; | ||
this.vcfEncoder = new VCFEncoder(this.mHeader, this.allowMissingFieldsInHeader, this.writeFullFormatField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this duplicates code in the existing writeHeader()
method, would it be possible to have writeHeader()
call into setVcfHeader()
, or call into a shared helper method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think we should guard against the problematic case where both setHeader and writeHeader are called, in any order, using different headers. The simplest way would be to just reject either call if a header has already been established. That would ensure consistency. @amilamanoj would that work for your use case ? If so, we should include that contract in the javadoc here at the interface level, and include tests that verify that for each implementation. If not we may need to resort to something more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen I've changed it so that writeHeader()
calls into setVcfHeader()
.
@cmnbroad Yes that should work for our case, since we don't need to change the header once it's set. When rejecting, do we still allow repeated calls of writeHeader()
and setVcfHeader()
(individually, without calling them both) or just simply check if header
has been set once before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj I originally thought the right thing would be for both methods to reject if the header is already set, but changing writeHeader to work that way could break existing code. So perhaps we should just make setVcfHeader enforce that, and leave writeHeader's current promiscuous behavior as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj Actually, on giving this more thought, any code that currently calls writeHeader more than once is already on a bad path, so I would vote for rejecting in either case if a header exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Thanks for the feedback. I've also discussed this with @cyenyxe. We also think changing header should be prevented if the header (or just a part of body, in our use case) is already written to the output stream, since there's only one output stream per writer instance. But setting the header many times should not matter as long as nothing is written to the output stream yet. I've added some checks along with tests, please take a look and see if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Test(dataProvider = "vcfExtensionsDataProvider") | ||
public void testWriteAndReadVCFBody(final String extension) throws IOException { | ||
final File fakeVCFFile = File.createTempFile("testWriteAndReadVCFBody.", extension); | ||
fakeVCFFile.deleteOnExit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use method VariantBaseTest.createTempFile()
here, which does this for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
writer.setVcfHeader(header); | ||
writer.add(createVC(header)); | ||
writer.add(createVC(header)); | ||
writer.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a try-with-resources
block rather than an explicit close()
call, to ensure that the writer gets closed even if there's an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
writer.close(); | ||
|
||
final VCFCodec codec = new VCFCodec(); | ||
codec.setVCFHeader(header, VCFHeaderVersion.VCF4_0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 4.0? The writer above will write a 4.2 vcf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used 4.2.
if (".vcf.gz".equals(extension)) { | ||
iterator = new AsciiLineReaderIterator(new AsciiLineReader(new BlockCompressedInputStream(fakeVCFFile))); | ||
} else { | ||
iterator = new AsciiLineReaderIterator(new AsciiLineReader(new FileInputStream(fakeVCFFile))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All input stream creations should be wrapped in a try-with-resources
to ensure the streams get closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it possible to use AbstractFeatureReader.getFeatureReader()
here to get a reader on the vcf file? It will handle the block-gzipped case for you automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractFeatureReader.getFeatureReader()
cannot be used here because it includes a check for header is the vcf file. Since we are reading a data-block, calling this method throws htsjdk.tribble.TribbleException$MalformedFeatureFile
.
Used the method: VariantBaseTest.createTempFile(). Used try-with-resources. Updated VCF version to 4.2.
…ing BCF body without header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj Thanks for doing this - it actually raises a few interesting issues. I did a first review pass and added a few requests, along with a couple questions. I expect it will take another round or two of review to get this finished. It would be helpful for the next round if you could squash any subsequent commits down to a single, incremental commit with just the changes based on the code review. Thanks.
@Override | ||
public void setVcfHeader(VCFHeader header) { | ||
this.mHeader = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header; | ||
this.vcfEncoder = new VCFEncoder(this.mHeader, this.allowMissingFieldsInHeader, this.writeFullFormatField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think we should guard against the problematic case where both setHeader and writeHeader are called, in any order, using different headers. The simplest way would be to just reject either call if a header has already been established. That would ensure consistency. @amilamanoj would that work for your use case ? If so, we should include that contract in the javadoc here at the interface level, and include tests that verify that for each implementation. If not we may need to resort to something more complicated.
@@ -0,0 +1,253 @@ | |||
/* | |||
* Copyright (c) 2012 The Broad Institute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need the copyright notice, but it should probably be 2017 ?
/** | ||
* @author amila | ||
* <p/> | ||
* Class VCFWriterUnitTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the class name.
* <p/> | ||
* Class VCFWriterUnitTest | ||
* <p/> | ||
* This class tests out the ability of the VCF writer to correctly write VCF files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for testing the BCF writer/BCF files, right ?
private Set<String> additionalColumns; | ||
|
||
private File tempDir; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these variables be scoped locally to each method that uses them ?
@@ -0,0 +1,169 @@ | |||
/* | |||
* Copyright (c) 2012 The Broad Institute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*/ | ||
public static VCFHeader createFakeHeader(final Set<VCFHeaderLine> metaData, final Set<String> additionalColumns, | ||
final SAMSequenceDictionary sequenceDict) { | ||
metaData.add(new VCFHeaderLine(VCFHeaderVersion.VCF4_0.getFormatString(), VCFHeaderVersion.VCF4_0.getVersionString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed, or at the least use VCF4_2.
f.delete(); | ||
} | ||
tempDir.delete(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, use deleteOnExit.
int counter = 0; | ||
while (iterator.hasNext()) { | ||
VariantContext context = codec.decode(iterator.next()); | ||
if (context != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above. Can this be removed ?
int counter = 0; | ||
while (iterator.hasNext()) { | ||
VariantContext context = codec.decode(iterator.next()); | ||
if (context != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this is guarding against ? If hasNet returns true I think next and decode should succeed. Can this be removed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the null check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj I think we're getting closer. There are still a couple of logic issues/questions, and a few more minor requests. For some of the comments (marking args final, making sure all public methods have javadoc, and breaking up long tests into individual test methods), please make those changes wherever applicable - I didn't comment on every individual instance). Thanks.
|
||
// is at least one part of body written to the output stream? | ||
private boolean isBodyWritten; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj Is there a reason that the writers need to keep two state variables to track what is essentially a single binary state (the header is frozen) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of providing two distinct error messages based on which one (header/body) is written (In our use case, for example, we don't write the header). But I can merge the two variables if it's not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's about using an int
variable for different error messages? If different to -1, there is an error, and that can be switch to different messages...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magicDGS I didn't understand how exactly you meant using an int
variable. For now I merged the two variables. We can change it later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind @amilamanoj, it was just a proposal if you really want to be more verbose about the error and what happened. Just FYI, this was what I meant:
// this integer indicates the status for writing the header down
// -1 indicates no problem to write the header
// 0 means header was already written
// 1 means body was already written
// otherwise, this is an unexpected error
private int shouldWriteHeader = -1;
@Override
public void setVCFHeader(final VCFHeader header) {
if (shouldWriteHeader != -1) {
final String msg;
switch(shouldWriteHeader) {
case 0: msg = "header already written"; break;
case 1: msg = "body already written"; break;
default: msg = "unexpected error code (" + shouldWriteHeader + ")";
}
throw new IllegalStateException("The header cannot be modified after the header or variants have been written to the output stream. Error was: " + msg);
}
....
}
for ( int i = 0; i < dict.size(); i++ ) { | ||
stringDictionaryMap.put(dict.get(i), i); | ||
if (isHeaderWritten) { | ||
throw new IllegalStateException("Header is already written"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block seems redundant - the next statement is a call to setVcfheader, which handles this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will remove this.
|
||
// setup the field encodings | ||
fieldManager.setup(header, encoder, stringDictionaryMap); | ||
setVcfHeader(header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, even though htsjdk is not completely consistent about this, can we name these methods "setVCFHeader" instead of "setVcfHeader"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@@ -221,6 +206,43 @@ public void close() { | |||
super.close(); | |||
} | |||
|
|||
@Override | |||
public void setVcfHeader(VCFHeader header) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the param should be marked final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// make sure the header is sorted correctly | ||
header = new VCFHeader(header.getMetaDataInSortedOrder(), header.getGenotypeSamples()); | ||
|
||
this.header = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should need to make a copy of the header twice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like this? this.header = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : new VCFHeader(header.getMetaDataInSortedOrder(), header.getGenotypeSamples());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj Yes, thats what I meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Fixed.
@@ -53,4 +53,10 @@ public void writeHeader(final VCFHeader header) { | |||
public boolean checkError() { | |||
return false; | |||
} | |||
|
|||
@Override | |||
public void setVcfHeader(VCFHeader header) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code isn't consistent about this, but we generally mark arguments as final in new code. Thanks.
|
||
try { | ||
// write out the header into a byte stream, get its length, and write everything to the file | ||
final ByteArrayOutputStream capture = new ByteArrayOutputStream(); | ||
final OutputStreamWriter writer = new OutputStreamWriter(capture); | ||
this.header = VCFWriter.writeHeader(header, writer, doNotWriteGenotypes, VCFWriter.getVersionLine(), "BCF2 stream"); | ||
this.header = VCFWriter.writeHeader(header, writer, VCFWriter.getVersionLine(), "BCF2 stream"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this code is using the wrong header now. Previous to this PR, this method updated the "header" argument variable (questionable practice) with the updated header, and then passed that to the writer. With these changes, it call setVcfHeader, which updates the header member variable with the updated header, but then passes the (now stale) input argument "header" to the writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to use the member variable.
* <p/> | ||
* Class BCF2WriterUnitTest | ||
* <p/> | ||
* This class tests out the ability of the VCF writer to correctly write VCF files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is still wrong - this is testing that the BCF writer works correctly.
writer2.writeHeader(header); | ||
Assert.fail("Should not allow writing header twice"); | ||
} catch (Exception e) { | ||
Assert.assertTrue(e instanceof IllegalStateException); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test (prevent writing the header twice) should be factored out into a separate test method that declares IllegalStateException as an expectedException in the test annotation (see example here). It eliminates the need to write a try/catch block and cleanly separates the individual tests from each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Assert.fail("Should not allow changing header if header is already written"); | ||
} catch (Exception e) { | ||
Assert.assertTrue(e instanceof IllegalStateException); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here, and in the other test files. Each individual test should be a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…t cases, renamed setVCFHeader() method for consistency,
this.mHeader = writeHeader(header, writer, doNotWriteGenotypes, getVersionLine(), getStreamName()); | ||
this.vcfEncoder = new VCFEncoder(this.mHeader, this.allowMissingFieldsInHeader, this.writeFullFormatField); | ||
setVCFHeader(header); | ||
writeHeader(header, writer, getVersionLine(), getStreamName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj This one also is using the stale (local) header. It should use the member variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Set<VCFHeaderLine> metaData = new HashSet<VCFHeaderLine>(); | ||
Set<String> additionalColumns = new HashSet<String>(); | ||
final SAMSequenceDictionary sequenceDict = createArtificialSequenceDictionary(); | ||
final VCFHeader header = createFakeHeader(metaData, additionalColumns, sequenceDict); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like all of the calls to createFakeHeader pass in the same empty metadata and additional columns lists, and fake seq dictionary, so this can be much simplified by just removing those and returning the fake header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
private VariantContext createVC(final VCFHeader header) { | ||
|
||
return createVCGeneral(header, "1", 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should just be collapsed into createVCGeneral (it seems like all calls are identical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
.build(); | ||
writer2.writeHeader(header); | ||
writer2.setVCFHeader(header); | ||
writer2.add(createVC(header)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is unnecessary for the test - the exception gets thrown in the previous stmt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
public void testChangeHeaderAfterWritingHeader() { | ||
final File bcfOutputFile = VariantBaseTest.createTempFile("testWriteAndReadVCF.", ".bcf"); | ||
|
||
Tribble.indexFile(bcfOutputFile).deleteOnExit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I suggested creating a temporary directory in the earlier review pass was so this kind of call would be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, JUnit supports that via rules: http://junit.org/junit4/javadoc/4.12/org/junit/rules/TemporaryFolder.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// prevent changing header if it's already written | ||
final VariantContextWriter writer2 = new VariantContextWriterBuilder() | ||
.setOutputFile(bcfOutputFile).setReferenceDictionary(sequenceDict) | ||
.setOptions(EnumSet.of(Options.INDEX_ON_THE_FLY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both the sequence dictionary and the INDEX_ON_THE_FLY options are unnecessary for this test, and can be removed. We're just trying to test that we throw if you change the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed INDEX_ON_THE_FLY option. But sequence dictionary has to be set because INDEX_ON_THE_FLY is a default option anyway, and without setting the sequence dictionary, writer cannot be created (an exception gets thrown).
@amilamanoj I added a couple more comments on the tests (most of which apply to both test files, though I only put comment on one of them), since I think they can be simplified a bit. In general we should only include the minimal code necessary for the specific test. Thanks in advance for your patience in seeing these through. |
@amilamanoj I need to do another review pass on the updated changes. Sorry for the long turnaround time, but we have quite a backlog of PRs. I'll do it as soon as I can; hopefully it will be this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj There are a couple of minor things in the code itself in this round, but most of these comments are on the test code, which I think can be simplified a lot. Also, I realize I steered you wrong on the deleteOnExit comment last time - my apologies for that (see the comments).
Many of the test comments are just about reducing the test case code to the minimal set of things that are necessary for the specific test at hand; though some are stylistic. In many cases the comments apply in places other than where I mentioned them (i.e., final, try-with-resources), so please apply them wherever possible. I do expect we'll need at least one more round of review for this - thanks for your patience.
fieldManager.setup(header, encoder, stringDictionaryMap); | ||
|
||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really need to ensure that this PR doesn't introduce any subtle changes that are not essential to the core change. There are still several places here where the stale header that was passed in is being used, which makes it different from the previous implementation (which updated the header and used the updated version after that). Its not clear to me if the differences are significant or not; but thats just all the more reason to make sure we don't change it unless its necessary.
*/ | ||
public static VCFHeader createFakeHeader(final SAMSequenceDictionary sequenceDict) { | ||
Set<VCFHeaderLine> metaData = new HashSet<VCFHeaderLine>(); | ||
Set<String> additionalColumns = new HashSet<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These an be final. Also, you don't need to include the types in the constructor calls - you can just use the diamond operator ("<>") and they will be inferred. The should be removed everywhere in this file/PR.
ret.addMetaDataLine(new VCFFormatHeaderLine("BB", 1, VCFHeaderLineType.String, "x")); | ||
ret.addMetaDataLine(new VCFFormatHeaderLine("GQ", 1, VCFHeaderLineType.String, "x")); | ||
ret.setSequenceDictionary(sequenceDict); | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please name the header variable something like "header" instead of "ret". Also, I don't think all these column/info/format lines are adding anything relevant to this test (?). Maybe just one of each to simplify this, unless I'm missing something.
* | ||
* @return a fake VCF header | ||
*/ | ||
public static VCFHeader createFakeHeader(final SAMSequenceDictionary sequenceDict) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be private. Also, it looks like every single call to this method use the same fake sequenceDictionary. If so, that should all be collapsed to the minimum amount of code possible.
@Test | ||
public void testWriteAndReadBCF() throws IOException { | ||
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir); | ||
Tribble.indexFile(bcfOutputFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies - I think I steered you wrong on this in the last review round. This indexFile() call was previously here so that deleteOnExit could be called on the resulting index file, which we do need to do if we're creating an index, even if the file is created in the temp dir. So the deleteOnExit calls need to be restored for these index files everywhere (my bad).
However, as I mentioned before, we could get rid of this index problem (and the need for a sequence dictionary) altogether in most of these tests by turning off index creation (call ".unsetOption(Options.INDEX_ON_THE_FLY)" on the builder. It would be useful to have one separate test that does create an index, just to make sure we have test coverage for that, but then that would be the only place we have to deal with deleting the index file. Then the rest of these tests could be very minimal.
.build(); | ||
writer2.writeHeader(header); | ||
writer2.setVCFHeader(header); | ||
writer2.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try
try (FileInputStream fis = new FileInputStream(fakeVCFFile)) { | ||
AsciiLineReaderIterator iterator; | ||
|
||
iterator = new AsciiLineReaderIterator(new AsciiLineReader(fis)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please combine lines 96 and 98 into one line.
final VCFCodec codec = new VCFCodec(); | ||
codec.setVCFHeader(header, VCFHeaderVersion.VCF4_2); | ||
|
||
try (FileInputStream fis = new FileInputStream(fakeVCFFile)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
iterator = new AsciiLineReaderIterator(new AsciiLineReader(bcis)); | ||
} else { | ||
iterator = new AsciiLineReaderIterator(new AsciiLineReader(fis)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ternary operator cold make this all much cleaner and easier to read:
AsciiLineReaderIterator iterator =
new AsciiLineReaderIterator(new AsciiLineReader(".vcf.gz".equals(extension) ? bcis : fis));
|
||
|
||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of all of the blank lines?
@cmnbroad Thanks for the feedback. I've addressed your comments. |
@cmnbroad I'm waiting for any further feedback. |
@amilamanoj I haven't forgotten that this needs another review pass, and will get to it as soon as possible. Realistically, that will not be until next week though. |
This still needs a final review. In addition to enabling the original motivating (htsget) use case, it would also allow us to get rid of this Hadoop-BAM hack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amilamanoj Thanks for doing this, I have a few nitpicky comments and then this looks good to merge 👍
@@ -49,4 +55,13 @@ | |||
public boolean checkError(); | |||
|
|||
public void add(VariantContext vc); | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the javadoc should mention that you should call exactly one of writeHeader()
/ setHeader()
@@ -122,6 +122,10 @@ | |||
private VCFHeader lastVCFHeaderOfUnparsedGenotypes = null; | |||
private boolean canPassOnUnparsedGenotypeDataForLastVCFHeader = false; | |||
|
|||
// is the header or body written to the output stream? | |||
private boolean isWrittenToOutput; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name outputHasBeenWritten
would be more clear. Can you change this in all places where this variable has been added?
*/ | ||
@Test | ||
public void testWriteAndReadBCF() throws IOException { | ||
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you mark this for deletion?
*/ | ||
@Test | ||
public void testWriteAndReadBCFWithIndex() throws IOException { | ||
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with this
*/ | ||
@Test | ||
public void testWriteAndReadBCFHeaderless() throws IOException { | ||
final File bcfOutputFile = File.createTempFile("testWriteAndReadBCFWithHeader.", ".bcf", tempDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with these
|
||
@Test(expectedExceptions = IllegalStateException.class) | ||
public void testChangeHeaderAfterWritingHeader() throws IOException { | ||
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
|
||
@Test(expectedExceptions = IllegalStateException.class) | ||
public void testChangeHeaderAfterWritingBody() throws IOException { | ||
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
|
||
final VCFHeader header = createFakeHeader(); | ||
// prevent writing header twice | ||
try (final VariantContextWriter writer1 = new VariantContextWriterBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would drop the number from these writers 1, 2, and 3
/** test, using the writer and reader, that we can output and input a VCF body without problems */ | ||
@Test | ||
public void testWriteAndReadAsyncVCFHeaderless() throws IOException { | ||
final File fakeVCFFile = VariantBaseTest.createTempFile("testWriteAndReadAsyncVCFHeaderless.", ".vcf"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, sadly all the test files have to be manually marked as deleteOnExit()
@@ -77,28 +80,21 @@ | |||
@BeforeClass | |||
private void createTemporaryDirectory() { | |||
tempDir = TestUtil.getTempDirectory("VCFWriter", "StaleIndex"); | |||
tempDir.deleteOnExit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this pattern works. The older deleteTemporaryDirectory
method that was called previously seems more likely to work. I would just mark all of the temporary files as deleteOnExit()
@lbergelson Thanks for the feedback. I've addressed your comments. |
@amilamanoj Thank you for sticking through with this! This is a useful change, I'm not sure how it became so complicated with so many reviewers. |
@tomwhite This is now merged into htjskd and we may want to update hadoop bam to take advantage of it. Although that will probably require a release and a coordinated change with GATK since this breaks our GVCF writer. |
Description
This PR addresses Issue #902.
Added support for setting and writing the header in two different steps for the classes derived from
VariantContextWriter
.Checklist