Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Fix for issue #432 in Picard Tools. CreateSequenceDictionary stalls indefinitely with large genomes #744
Conversation
SilinPavel
referenced
this pull request
in broadinstitute/picard
Nov 17, 2016
Merged
Fix for issue #432 CreateSequenceDictionary stalls indefinitely with large genomes #687
coveralls
commented
Nov 17, 2016
coveralls
commented
Nov 18, 2016
coveralls
commented
Nov 18, 2016
coveralls
commented
Nov 18, 2016
|
| +import java.io.Writer; | ||
| + | ||
| +/** | ||
| + * "On fly" codec SAMSequenceDictionaryCodec. |
| + * Write {@link SAMSequenceRecord}. | ||
| + * @param sequenceRecord object to be converted to text. | ||
| + */ | ||
| + public void encodeSQLine(final SAMSequenceRecord sequenceRecord) { |
| + */ | ||
| + public void encode(final SAMSequenceDictionary dictionary) { | ||
| + dictionary.getSequences().forEach(this::encodeSQLine); | ||
| + |
| + reader = new StringLineReader(writer.toString()); | ||
| + SAMSequenceDictionary actual = codec.decode(reader, null); | ||
| + assertEquals(actual, dictionary); | ||
| + }finally { |
| + reader.close(); | ||
| + } | ||
| + } | ||
| + |
yfarjoun
Nov 20, 2016
Contributor
move this newline to the end of the file so that github doesn't have an ugly arrow
| @@ -437,8 +441,8 @@ private void writeHDLine(final boolean keepExistingVersionNumber) { | ||
| println(StringUtil.join(FIELD_SEPARATOR, fields)); | ||
| } | ||
| - private void writeSQLine(final SAMSequenceRecord sequenceRecord) { | ||
| - final int numAttributes =sequenceRecord.getAttributes() != null ? sequenceRecord.getAttributes().size() : 0; | ||
| + void writeSQLine(final SAMSequenceRecord sequenceRecord) { |
yfarjoun
Nov 20, 2016
Contributor
hmmm. I'm not so happy about the access change of this internal function. could you perhaps wrap it with a public function that also checks that the writer is not null and is more cognizant of it's package-protected status? For example, a java-doc would be good that would inform the assumptions in place (like the existence of a writer..)
SilinPavel
added some commits
Nov 17, 2016
coveralls
commented
Nov 20, 2016
|
@yfarjoun I have done what you asked for. Take a look, please! |
coveralls
commented
Nov 20, 2016
| + public void testEncodeDecode() throws Exception { | ||
| + LineReader reader = null; | ||
| + try { | ||
| + codec.encodeHeaderLine(false); |
yfarjoun
Nov 20, 2016
Contributor
I think that this line is throwing a NPE since mFileHeader is null...how does encodeHeaderLine when mFileHeader hasn't been initialized?
SilinPavel
Nov 23, 2016
Contributor
Header is set to SAMTextHeaderCodec in SAMSequenceDictionaryCodec constructor now. It fixes the problem with NPE.
| + */ | ||
| +public class SAMSequenceDictionaryCodecTest { | ||
| + | ||
| + private static final Random RANDOM = new Random(); |
| + * "On the fly" codec SAMSequenceDictionaryCodec. | ||
| + * Encodes each sequence and directly writes it to the Dictionary file. | ||
| + * | ||
| + * @author Pavel_Silin@epam.com, EPAM Systems, Inc. <www.epam.com> |
yfarjoun
Nov 20, 2016
Contributor
Could you include a short paragraph on how to use this class? something that would describe the various steps needed (open a writer, set the writer, encode the header, encode each sequence record, close the writer). Also include whatever fix you put in-place for the possible NPE of mFileHeader.
coveralls
commented
Nov 23, 2016
coveralls
commented
Nov 23, 2016
|
@yfarjoun We have done what you asked for. Take a look, please! |
coveralls
commented
Nov 23, 2016
| + * SAMSequenceDictionaryCodec codec = new SAMSequenceDictionaryCodec(writer); | ||
| + * | ||
| + * //we have complete SAMSequenceDictionary, so just encode it. | ||
| + * codec.encode(dict); |
yfarjoun
Nov 23, 2016
Contributor
when you encode a SAMSequenceDictionary do you need to encode the header line yourself? your test below suggest that you do, but the example here doesn't.
SilinPavel
Nov 23, 2016
Contributor
Method encode from SAMSequenceDictionaryCodec encodes the header, so we don't need to do it ourself:
public void encode(final SAMSequenceDictionary dictionary) {
encodeHeaderLine(false);
dictionary.getSequences().forEach(this::encodeSequenceRecord);
}
As you have noticed, in the test I encode @hd line 2 times, I will fix it, or maybe it would be better to delete encodeHeaderLine(false); from encode() method and do it manually every time, when we use this class, what is your opinion?
yfarjoun
Nov 23, 2016
Contributor
I would not break backwards compatibility. I think that encode(SAMSequenceDictionary) should encode the header line and I would change that. simply do the right thing in the test and make sure that the usage comment up top is correct. also, could you please, in your test verify that the encoded file reads correctly?
coveralls
commented
Nov 23, 2016
|
Sorry, @SilinPavel I misunderstood you. encode on a SAMSequenceDictionary should behave like SAMTextHeaderCodec::encode which does emit the @hd line on its own. Please change the behavior and the example in the comment back to what you (rightly) did before. |
coveralls
commented
Nov 23, 2016
|
@yfarjoun I did the rollback to the previous version. Could you check please? |
|
great! thanks and sorry for the back and forth. |
yfarjoun
merged commit 6e4e875
into
samtools:master
Nov 24, 2016
|
@yfarjoun Thanks! |
magicDGS
referenced
this pull request
in broadinstitute/gatk
Nov 24, 2016
Closed
Port fix for CreateSequenceDictionary with large genomes #2284
ronlevine
added a commit
that referenced
this pull request
Nov 27, 2016
|
|
SilinPavel + ronlevine |
e1f49cb
|
ronlevine
added a commit
that referenced
this pull request
Nov 27, 2016
|
|
SilinPavel + ronlevine |
e69ad8a
|
ronlevine
added a commit
that referenced
this pull request
Nov 27, 2016
|
|
SilinPavel + ronlevine |
1879f01
|
SilinPavel commentedNov 17, 2016
•
edited
We introduce two pull requests to Htsjdk and Picard Tools(broadinstitute/picard#687) to resolve the issue broadinstitute/picard#432 .
Original version of CreateSequenceDictionary creates and stores all reference sequences entirely in memory in a SAMHeader object and then writes it to a file.
In order to check sequence's names for uniqueness, they are added to a HashSet. This might lead to "Out of memory" exceptions for files with a large number of sequences .
Changes in Htsjdk:
In our implementation we introduce a new "on fly" codec SAMSequenceDictionaryCodec, which encodes each sequence and directly writes it to the Dictionary file. This allows us not to store sequences in memory.
Changes in Picard Tools:
CreateSequenceDictionary now uses new SAMSequenceDictionaryCodec to write dictionary file.
SortingCollection is used instead of a HashSet to verify the sequences names uniqueness , this allows us not to keep all the names in memory since identical names will be stored in the collection one after another and we can find duplicates in one pass.
Please pay attention, that now CreateSequenceDictionary doesn't write HD line, since we think that it is not required, SequenceDictionary must contain only SQ line, not HD. If our assumption is wrong, correct us, please.
So, if we look at the plot "GC Times" from JMC: FlightRecorder, we will see that standard version of CreateSequenceDictionary have a big trouble with a large amount of records (several millions):
The plot "GC Times" for a Standard version:
The plot "GC Times" for a fixed version:


How we can see, now CreateSequenceDictionary processes a big reference without troubles, for example time of execution for file from issue broadinstitute/picard#432 is:
Full GC reports (jmc flight recorder):
JMC FR logs.zip
Checklist