Fix for issue #432 in Picard Tools. CreateSequenceDictionary stalls indefinitely with large genomes #744

Merged
merged 8 commits into from Nov 24, 2016

Conversation

Projects
None yet
3 participants
Contributor

SilinPavel commented Nov 17, 2016 edited

We introduce two pull requests to Htsjdk and Picard Tools(broadinstitute/picard#687) to resolve the issue broadinstitute/picard#432 .

Original version of CreateSequenceDictionary creates and stores all reference sequences entirely in memory in a SAMHeader object and then writes it to a file.

In order to check sequence's names for uniqueness, they are added to a HashSet. This might lead to "Out of memory" exceptions for files with a large number of sequences .

Changes in Htsjdk:

In our implementation we introduce a new "on fly" codec SAMSequenceDictionaryCodec, which encodes each sequence and directly writes it to the Dictionary file. This allows us not to store sequences in memory.

Changes in Picard Tools:

CreateSequenceDictionary now uses new SAMSequenceDictionaryCodec to write dictionary file.

SortingCollection is used instead of a HashSet to verify the sequences names uniqueness , this allows us not to keep all the names in memory since identical names will be stored in the collection one after another and we can find duplicates in one pass.

Please pay attention, that now CreateSequenceDictionary doesn't write HD line, since we think that it is not required, SequenceDictionary must contain only SQ line, not HD. If our assumption is wrong, correct us, please.

So, if we look at the plot "GC Times" from JMC: FlightRecorder, we will see that standard version of CreateSequenceDictionary have a big trouble with a large amount of records (several millions):

The plot "GC Times" for a Standard version:

selection_041
selection_043

The plot "GC Times" for a fixed version:
selection_040
selection_042

How we can see, now CreateSequenceDictionary processes a big reference without troubles, for example time of execution for file from issue broadinstitute/picard#432 is:

real 762.83
user 626.94
sys 67.02

Full GC reports (jmc flight recorder):
JMC FR logs.zip

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing

SilinPavel referenced this pull request in broadinstitute/picard Nov 17, 2016

Merged

Fix for issue #432 CreateSequenceDictionary stalls indefinitely with large genomes #687

2 of 3 tasks complete

Coverage Status

Coverage decreased (-0.03%) to 69.81% when pulling 370362a on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into 70783fc on samtools:master.

Coverage Status

Coverage increased (+0.006%) to 69.845% when pulling c3fe849 on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into 70783fc on samtools:master.

Coverage Status

Coverage increased (+0.01%) to 69.852% when pulling 37dd981 on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into 70783fc on samtools:master.

Coverage Status

Coverage increased (+0.006%) to 69.845% when pulling 37dd981 on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into 70783fc on samtools:master.

Contributor

yfarjoun commented Nov 19, 2016

  1. This is great performance! Thanks for fixing this.
  2. The sequence dictionary should indeed contain the @hd line. please reinstate it.
@yfarjoun

Thanks for this addition. a few small requests.

+import java.io.Writer;
+
+/**
+ * "On fly" codec SAMSequenceDictionaryCodec.
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

"on the fly"

+ * Write {@link SAMSequenceRecord}.
+ * @param sequenceRecord object to be converted to text.
+ */
+ public void encodeSQLine(final SAMSequenceRecord sequenceRecord) {
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

change name to encodeSequenceRecord

+ */
+ public void encode(final SAMSequenceDictionary dictionary) {
+ dictionary.getSequences().forEach(this::encodeSQLine);
+
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

frivolous newline

+ reader = new StringLineReader(writer.toString());
+ SAMSequenceDictionary actual = codec.decode(reader, null);
+ assertEquals(actual, dictionary);
+ }finally {
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

space after }

+ reader.close();
+ }
+ }
+
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

move this newline to the end of the file so that github doesn't have an ugly arrow

@@ -437,8 +441,8 @@ private void writeHDLine(final boolean keepExistingVersionNumber) {
println(StringUtil.join(FIELD_SEPARATOR, fields));
}
- private void writeSQLine(final SAMSequenceRecord sequenceRecord) {
- final int numAttributes =sequenceRecord.getAttributes() != null ? sequenceRecord.getAttributes().size() : 0;
+ void writeSQLine(final SAMSequenceRecord sequenceRecord) {
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

hmmm. I'm not so happy about the access change of this internal function. could you perhaps wrap it with a public function that also checks that the writer is not null and is more cognizant of it's package-protected status? For example, a java-doc would be good that would inform the assumptions in place (like the existence of a writer..)

SilinPavel added some commits Nov 17, 2016

@SilinPavel SilinPavel fix issue 432 d0bd9eb
@SilinPavel SilinPavel add test for SAMSDC f549271
@SilinPavel SilinPavel add java docs d7c12bb
@SilinPavel SilinPavel correction to comments: new package-local methods & java doc
64a9c07
@SilinPavel SilinPavel fix java docs, reorder methods
e63442d

Coverage Status

Coverage increased (+0.03%) to 69.964% when pulling 64a9c07 on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into 398c0ee on samtools:master.

Contributor

SilinPavel commented Nov 20, 2016

@yfarjoun I have done what you asked for. Take a look, please!

Coverage Status

Coverage increased (+0.02%) to 69.958% when pulling e63442d on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into 398c0ee on samtools:master.

+ public void testEncodeDecode() throws Exception {
+ LineReader reader = null;
+ try {
+ codec.encodeHeaderLine(false);
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

I think that this line is throwing a NPE since mFileHeader is null...how does encodeHeaderLine when mFileHeader hasn't been initialized?

@SilinPavel

SilinPavel Nov 23, 2016

Contributor

Header is set to SAMTextHeaderCodec in SAMSequenceDictionaryCodec constructor now. It fixes the problem with NPE.

+ */
+public class SAMSequenceDictionaryCodecTest {
+
+ private static final Random RANDOM = new Random();
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

change to lowercase random

+ * "On the fly" codec SAMSequenceDictionaryCodec.
+ * Encodes each sequence and directly writes it to the Dictionary file.
+ *
+ * @author Pavel_Silin@epam.com, EPAM Systems, Inc. <www.epam.com>
@yfarjoun

yfarjoun Nov 20, 2016

Contributor

Could you include a short paragraph on how to use this class? something that would describe the various steps needed (open a writer, set the writer, encode the header, encode each sequence record, close the writer). Also include whatever fix you put in-place for the possible NPE of mFileHeader.

@SilinPavel SilinPavel add java docs and rename constant field in test
97fbe63

Coverage Status

Coverage increased (+0.02%) to 70.026% when pulling 97fbe63 on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into de27f18 on samtools:master.

Coverage Status

Coverage increased (+0.02%) to 70.026% when pulling 97fbe63 on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into de27f18 on samtools:master.

Contributor

SilinPavel commented Nov 23, 2016

@yfarjoun We have done what you asked for. Take a look, please!

Coverage Status

Coverage increased (+0.02%) to 70.027% when pulling 97fbe63 on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into de27f18 on samtools:master.

+ * SAMSequenceDictionaryCodec codec = new SAMSequenceDictionaryCodec(writer);
+ *
+ * //we have complete SAMSequenceDictionary, so just encode it.
+ * codec.encode(dict);
@yfarjoun

yfarjoun Nov 23, 2016

Contributor

when you encode a SAMSequenceDictionary do you need to encode the header line yourself? your test below suggest that you do, but the example here doesn't.

@SilinPavel

SilinPavel Nov 23, 2016

Contributor

Method encode from SAMSequenceDictionaryCodec encodes the header, so we don't need to do it ourself:

public void encode(final SAMSequenceDictionary dictionary) { 
        encodeHeaderLine(false);
        dictionary.getSequences().forEach(this::encodeSequenceRecord);
}

As you have noticed, in the test I encode @hd line 2 times, I will fix it, or maybe it would be better to delete encodeHeaderLine(false); from encode() method and do it manually every time, when we use this class, what is your opinion?

@yfarjoun

yfarjoun Nov 23, 2016

Contributor

I would not break backwards compatibility. I think that encode(SAMSequenceDictionary) should encode the header line and I would change that. simply do the right thing in the test and make sure that the usage comment up top is correct. also, could you please, in your test verify that the encoded file reads correctly?

@SilinPavel SilinPavel encode, java doc, tests
ec354b3

Coverage Status

Coverage increased (+0.03%) to 70.033% when pulling ec354b3 on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into de27f18 on samtools:master.

Contributor

yfarjoun commented Nov 23, 2016

Sorry, @SilinPavel I misunderstood you. encode on a SAMSequenceDictionary should behave like SAMTextHeaderCodec::encode which does emit the @hd line on its own. Please change the behavior and the example in the comment back to what you (rightly) did before.

@SilinPavel SilinPavel rollback: encode HD line in the encode method
7536f6f

Coverage Status

Coverage increased (+0.02%) to 70.027% when pulling 7536f6f on EpamLifeSciencesTeam:epam-ls_CreateSequenceDictionary_fix into de27f18 on samtools:master.

Contributor

SilinPavel commented Nov 23, 2016

@yfarjoun I did the rollback to the previous version. Could you check please?

Contributor

yfarjoun commented Nov 24, 2016

great! thanks and sorry for the back and forth. 👍

@yfarjoun yfarjoun merged commit 6e4e875 into samtools:master Nov 24, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 70.027%
Details
Contributor

SilinPavel commented Nov 24, 2016

@yfarjoun Thanks!

magicDGS referenced this pull request in broadinstitute/gatk Nov 24, 2016

Closed

Port fix for CreateSequenceDictionary with large genomes #2284

@ronlevine ronlevine added a commit that referenced this pull request Nov 27, 2016

@SilinPavel @ronlevine SilinPavel + ronlevine Fix for issue #432 in Picard Tools. CreateSequenceDictionary stalls i…
…ndefinitely with large genomes (#744)

* preparation for fix issue  broadinstitute/picard#432
* added a codec for SAMSequenceDictionary
* added an option to encode each Sequence individually, rather than having to old the entire Dictionary in memory (for cases were there are many sequences, the source of trouble).
* add tests
e1f49cb

@ronlevine ronlevine added a commit that referenced this pull request Nov 27, 2016

@SilinPavel @ronlevine SilinPavel + ronlevine Fix for issue #432 in Picard Tools. CreateSequenceDictionary stalls i…
…ndefinitely with large genomes (#744)

* preparation for fix issue  broadinstitute/picard#432
* added a codec for SAMSequenceDictionary
* added an option to encode each Sequence individually, rather than having to old the entire Dictionary in memory (for cases were there are many sequences, the source of trouble).
* add tests
e69ad8a

@ronlevine ronlevine added a commit that referenced this pull request Nov 27, 2016

@SilinPavel @ronlevine SilinPavel + ronlevine Fix for issue #432 in Picard Tools. CreateSequenceDictionary stalls i…
…ndefinitely with large genomes (#744)

* preparation for fix issue  broadinstitute/picard#432
* added a codec for SAMSequenceDictionary
* added an option to encode each Sequence individually, rather than having to old the entire Dictionary in memory (for cases were there are many sequences, the source of trouble).
* add tests
1879f01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment