Yf add tag aware merge samdictionary #689

Merged
merged 4 commits into from Oct 27, 2016

Conversation

Projects
None yet
4 participants
Contributor

yfarjoun commented Aug 21, 2016 edited

Description

This adds a static function that can merge two SamSequenceDictionaries into one. The problem it's solving is that in MergeBamAlignment the reference dictionary and the aligned Bam dictionary, while having the same sequences, may have different tags. In particular, the reference dictionary has lots of interesting tags like MD, SP, UR etc. and the (bwa) dictionary in the aligned bam may contain AH which informs us that the sequence is an Alternative Haplotype.

This PR enables MergeBamAlignment to merge the two dictionaries rather than only use the reference one (as it currently does)

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

yfarjoun added some commits Aug 20, 2016

@yfarjoun yfarjoun Add a mergeDictionary utility to SAMSequenceDictionary to be used by …
…picard in MergeBamAlignment
8041c8b
@yfarjoun yfarjoun tests (found a bug!)
d240b2f

Coverage Status

Coverage increased (+0.09%) to 68.895% when pulling d240b2f on yf_add_tag_Aware_merge_samdictionary into b81d036 on master.

@yfarjoun yfarjoun more reviewer comments now incorporated
ec52360
Contributor

yfarjoun commented Aug 22, 2016

@tfenne This is the htsjdk part we discussed in broadinstitute/picard#627

Can you take a look?

Coverage Status

Coverage increased (+0.08%) to 68.889% when pulling ec52360 on yf_add_tag_Aware_merge_samdictionary into b81d036 on master.

tfenne was assigned by yfarjoun Aug 28, 2016

Contributor

yfarjoun commented Sep 15, 2016

friendly ping. @tfenne.

+ * Requires that dictionaries have the same SAMSequence records in the same order.
+ * For each sequenceIndex, the union of the tags from both sequences will be added to the new sequence, mismatching
+ * values (for tags that are in both) will generate a warning, and the value from dict1 will be used.
+ * For tags that are in tagsToEquate. an unequal value will generate an error (an IllegalArgumentException will
@tfenne

tfenne Sep 20, 2016

Owner

Extra period after tagsToEquate

@tfenne

tfenne Sep 20, 2016

Owner

Also it's called tagsToMatch

+ * @param tagsToMatch list of tags that must be equal if present in both sequence. Typical values will be MD, and LN
+ * @return dictionary consisting of the same sequences as the two inputs with the merged values of tags.
+ */
+ static public SAMSequenceDictionary mergeDictionaries(final SAMSequenceDictionary dict1,
@tfenne

tfenne Sep 20, 2016

Owner

Function name should be more descriptive given all the different ways in which one could mergeDictionaries. Perhaps mergeTags instead?

+ final SAMSequenceDictionary dict2,
+ final List<String> tagsToMatch) {
+
+ if(dict1.getSequences().size()!=dict2.getSequences().size()) {
@tfenne

tfenne Sep 20, 2016

Owner

if( -> if (

@tfenne

tfenne Sep 20, 2016

Owner

Spaces around != please.

@tfenne

tfenne Sep 20, 2016

Owner

If you're going to check this, why not go all the way and check that the two dictionaries contain the exact same set of names up front?

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

what's the value of doing that upfront?

@tfenne

tfenne Sep 22, 2016

Owner

I think it's much clearer to write methods where possible that:
a) validate their inputs up front
b) proceed to implement the intended function assuming/knowing they have valid inputs

@tfenne

Lots of comments @yfarjoun. Most are minor but I think it would be good to:
a) Do all the validation up front
b) Not continually do1dict.getSequence(name)...
c) Always require LN and MD to match if they are [logically] present on both records

+ */
+ static public SAMSequenceDictionary mergeDictionaries(final SAMSequenceDictionary dict1,
+ final SAMSequenceDictionary dict2,
+ final List<String> tagsToMatch) {
@tfenne

tfenne Sep 20, 2016

Owner

Circling back, I think this parameter should probably go away. Length (if present) and MD5 (if present) should be required to match in all cases, since by definition they should match. Other tags, I'm not sure it matters if they match.

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

I agree about Length and MD5.

Other tags there are two options when they do not match: take the first, or throw an exception.

My idea is that if the tags are in the list, an exception will be thrown, otherwise the value from the first dictionary will be taken. Sounds good?

@tfenne

tfenne Sep 22, 2016

Owner

I guess my thought is that the set of tags on SQ lines is known, so we know which are ok to differ vs. not, and there's no need to supply a list of tags. I don't care that much though.

+ final SAMSequenceDictionary finalDict = new SAMSequenceDictionary();
+ for (final SAMSequenceRecord sequence : dict1.getSequences()) {
+ final SAMSequenceRecord mergedSequence = new SAMSequenceRecord(sequence.getSequenceName(),
+ UNKNOWN_SEQUENCE_LENGTH);
@tfenne

tfenne Sep 20, 2016

Owner

SAMSequenceRecord already has a constructor that takes just the name - no need to define UNKNOWN_SEQUENCE_LENGTH and do this.

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

It's deprecated for some reason...

+ UNKNOWN_SEQUENCE_LENGTH);
+ finalDict.addSequence(mergedSequence);
+ final int sequenceIndex = sequence.getSequenceIndex();
+ if (!dict2.getSequence(sequenceIndex).getSequenceName().equals(sequence.getSequenceName())) {
@tfenne

tfenne Sep 20, 2016

Owner

I think it would be better to validate up front, and then here you can assume everything is going to line up.

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

ok. I'll do that.

+ sequence.getSequenceName(), dict2.getSequence(sequenceIndex).getSequenceName(), sequenceIndex));
+ }
+
+ final Set<String> allTags = new HashSet<>(dict1.getSequence(sequenceIndex).getAttributes()
@tfenne

tfenne Sep 20, 2016

Owner

I think this would be cleaner as:

final Set<String> allTags = new HashSet<>;
dict1.getSequence(sequenceIndex).getAttributes().forEach(a -> allTags.add(a.getKey());
dict2.getSequence(sequenceIndex).getAttributes().forEach(a -> allTags.add(a.getKey());

Though I would also suggest renaming sequence to s1 and defining an s2 at line 305 and then using them througout instead of repeatedly accessing through the dictionaries.

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

good idea. done.

+
+ if (value1 != null && value2 != null && !value1.equals(value2)) {
+ if (tagsToMatch.contains(tag)) {
+ final String error = String.format("Cannot merge the two dictionaries. Found sequence entry for which " +
@tfenne

tfenne Sep 20, 2016

Owner

No need for String.format here, since log.error() can just be fed a comma separated list of things to concatenate and log.

+ final int length2 = dict2.getSequence(sequenceIndex).getSequenceLength();
+
+ if (length1 != UNKNOWN_SEQUENCE_LENGTH && length2 != UNKNOWN_SEQUENCE_LENGTH && length1 != length2) {
+ if (tagsToMatch.contains(SAMSequenceRecord.SEQUENCE_LENGTH_TAG)) {
@tfenne

tfenne Sep 20, 2016

Owner

I don't think you should allow merging to happen if the lengths differ, regardless.

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

ok.

+ length1, length2);
+
+ log.error(error);
+ throw new IllegalArgumentException(error);
@tfenne

tfenne Sep 20, 2016

Owner

Logging and then throwing is just going to lead to the same error getting printed multiple times. Please pick one!

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

OK. I thought we tend to do both to protect against someone catching the exception...

+ log.error(error);
+ throw new IllegalArgumentException(error);
+ } else {
+ log.warn("Found sequence entry for which " +
@tfenne

tfenne Sep 20, 2016

Owner

a) String concat when not needed. b) I don't think log.warn() will String.format() for you.

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

ok.

@tfenne tfenne assigned yfarjoun and unassigned tfenne Sep 20, 2016

+ log.warn("Found sequence entry for which " +
+ "lengths differ: Sequence %s has lengths %s and %s. Using Value %s",
+ dict1.getSequence(sequenceIndex).getSequenceName(),
+ length1, length2, length1);
@magicDGS

magicDGS Sep 20, 2016

Contributor

If I understand correctly, the logging of the used value is not always length1 regarding the next line.

@yfarjoun

yfarjoun Sep 22, 2016

Contributor

I removed this if statement. length must be equal.

@yfarjoun yfarjoun - responding to review comments
316c085
@yfarjoun

I responded to the comments.. please re-review when available.

Coverage Status

Coverage increased (+0.3%) to 69.084% when pulling 316c085 on yf_add_tag_Aware_merge_samdictionary into b81d036 on master.

Contributor

yfarjoun commented Oct 2, 2016

@tfenne I responded to your review requests, could you take another look?

Contributor

yfarjoun commented Oct 27, 2016

@tfenne bump 🙏

yfarjoun closed this Oct 27, 2016

yfarjoun reopened this Oct 27, 2016

Coverage Status

Coverage increased (+1.0%) to 69.771% when pulling 316c085 on yf_add_tag_Aware_merge_samdictionary into b81d036 on master.

@tfenne

tfenne approved these changes Oct 27, 2016

@yfarjoun yfarjoun merged commit 45ee380 into master Oct 27, 2016

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+1.0%) to 69.771%
Details

yfarjoun deleted the yf_add_tag_Aware_merge_samdictionary branch Oct 27, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment