Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Yf add tag aware merge samdictionary #689
Conversation
coveralls
commented
Aug 21, 2016
|
@tfenne This is the htsjdk part we discussed in broadinstitute/picard#627 Can you take a look? |
coveralls
commented
Aug 22, 2016
tfenne
was assigned
by yfarjoun
Aug 28, 2016
|
friendly ping. @tfenne. |
| + * Requires that dictionaries have the same SAMSequence records in the same order. | ||
| + * For each sequenceIndex, the union of the tags from both sequences will be added to the new sequence, mismatching | ||
| + * values (for tags that are in both) will generate a warning, and the value from dict1 will be used. | ||
| + * For tags that are in tagsToEquate. an unequal value will generate an error (an IllegalArgumentException will |
| + * @param tagsToMatch list of tags that must be equal if present in both sequence. Typical values will be MD, and LN | ||
| + * @return dictionary consisting of the same sequences as the two inputs with the merged values of tags. | ||
| + */ | ||
| + static public SAMSequenceDictionary mergeDictionaries(final SAMSequenceDictionary dict1, |
tfenne
Sep 20, 2016
Owner
Function name should be more descriptive given all the different ways in which one could mergeDictionaries. Perhaps mergeTags instead?
| + final SAMSequenceDictionary dict2, | ||
| + final List<String> tagsToMatch) { | ||
| + | ||
| + if(dict1.getSequences().size()!=dict2.getSequences().size()) { |
tfenne
Sep 20, 2016
Owner
If you're going to check this, why not go all the way and check that the two dictionaries contain the exact same set of names up front?
tfenne
Sep 22, 2016
Owner
I think it's much clearer to write methods where possible that:
a) validate their inputs up front
b) proceed to implement the intended function assuming/knowing they have valid inputs
tfenne
requested changes
Sep 20, 2016
Lots of comments @yfarjoun. Most are minor but I think it would be good to:
a) Do all the validation up front
b) Not continually do1dict.getSequence(name)...
c) Always require LN and MD to match if they are [logically] present on both records
| + */ | ||
| + static public SAMSequenceDictionary mergeDictionaries(final SAMSequenceDictionary dict1, | ||
| + final SAMSequenceDictionary dict2, | ||
| + final List<String> tagsToMatch) { |
tfenne
Sep 20, 2016
Owner
Circling back, I think this parameter should probably go away. Length (if present) and MD5 (if present) should be required to match in all cases, since by definition they should match. Other tags, I'm not sure it matters if they match.
yfarjoun
Sep 22, 2016
Contributor
I agree about Length and MD5.
Other tags there are two options when they do not match: take the first, or throw an exception.
My idea is that if the tags are in the list, an exception will be thrown, otherwise the value from the first dictionary will be taken. Sounds good?
tfenne
Sep 22, 2016
Owner
I guess my thought is that the set of tags on SQ lines is known, so we know which are ok to differ vs. not, and there's no need to supply a list of tags. I don't care that much though.
| + final SAMSequenceDictionary finalDict = new SAMSequenceDictionary(); | ||
| + for (final SAMSequenceRecord sequence : dict1.getSequences()) { | ||
| + final SAMSequenceRecord mergedSequence = new SAMSequenceRecord(sequence.getSequenceName(), | ||
| + UNKNOWN_SEQUENCE_LENGTH); |
tfenne
Sep 20, 2016
Owner
SAMSequenceRecord already has a constructor that takes just the name - no need to define UNKNOWN_SEQUENCE_LENGTH and do this.
| + UNKNOWN_SEQUENCE_LENGTH); | ||
| + finalDict.addSequence(mergedSequence); | ||
| + final int sequenceIndex = sequence.getSequenceIndex(); | ||
| + if (!dict2.getSequence(sequenceIndex).getSequenceName().equals(sequence.getSequenceName())) { |
tfenne
Sep 20, 2016
Owner
I think it would be better to validate up front, and then here you can assume everything is going to line up.
| + sequence.getSequenceName(), dict2.getSequence(sequenceIndex).getSequenceName(), sequenceIndex)); | ||
| + } | ||
| + | ||
| + final Set<String> allTags = new HashSet<>(dict1.getSequence(sequenceIndex).getAttributes() |
tfenne
Sep 20, 2016
Owner
I think this would be cleaner as:
final Set<String> allTags = new HashSet<>;
dict1.getSequence(sequenceIndex).getAttributes().forEach(a -> allTags.add(a.getKey());
dict2.getSequence(sequenceIndex).getAttributes().forEach(a -> allTags.add(a.getKey());Though I would also suggest renaming sequence to s1 and defining an s2 at line 305 and then using them througout instead of repeatedly accessing through the dictionaries.
| + | ||
| + if (value1 != null && value2 != null && !value1.equals(value2)) { | ||
| + if (tagsToMatch.contains(tag)) { | ||
| + final String error = String.format("Cannot merge the two dictionaries. Found sequence entry for which " + |
tfenne
Sep 20, 2016
Owner
No need for String.format here, since log.error() can just be fed a comma separated list of things to concatenate and log.
| + final int length2 = dict2.getSequence(sequenceIndex).getSequenceLength(); | ||
| + | ||
| + if (length1 != UNKNOWN_SEQUENCE_LENGTH && length2 != UNKNOWN_SEQUENCE_LENGTH && length1 != length2) { | ||
| + if (tagsToMatch.contains(SAMSequenceRecord.SEQUENCE_LENGTH_TAG)) { |
tfenne
Sep 20, 2016
Owner
I don't think you should allow merging to happen if the lengths differ, regardless.
| + length1, length2); | ||
| + | ||
| + log.error(error); | ||
| + throw new IllegalArgumentException(error); |
tfenne
Sep 20, 2016
Owner
Logging and then throwing is just going to lead to the same error getting printed multiple times. Please pick one!
yfarjoun
Sep 22, 2016
Contributor
OK. I thought we tend to do both to protect against someone catching the exception...
| + log.error(error); | ||
| + throw new IllegalArgumentException(error); | ||
| + } else { | ||
| + log.warn("Found sequence entry for which " + |
tfenne
Sep 20, 2016
Owner
a) String concat when not needed. b) I don't think log.warn() will String.format() for you.
tfenne
assigned yfarjoun and unassigned tfenne
Sep 20, 2016
| + log.warn("Found sequence entry for which " + | ||
| + "lengths differ: Sequence %s has lengths %s and %s. Using Value %s", | ||
| + dict1.getSequence(sequenceIndex).getSequenceName(), | ||
| + length1, length2, length1); |
magicDGS
Sep 20, 2016
Contributor
If I understand correctly, the logging of the used value is not always length1 regarding the next line.
coveralls
commented
Sep 23, 2016
|
@tfenne I responded to your review requests, could you take another look? |
|
@tfenne bump |
yfarjoun commentedAug 21, 2016
•
edited
Description
This adds a static function that can merge two SamSequenceDictionaries into one. The problem it's solving is that in MergeBamAlignment the reference dictionary and the aligned Bam dictionary, while having the same sequences, may have different tags. In particular, the reference dictionary has lots of interesting tags like MD, SP, UR etc. and the (bwa) dictionary in the aligned bam may contain AH which informs us that the sequence is an Alternative Haplotype.
This PR enables MergeBamAlignment to merge the two dictionaries rather than only use the reference one (as it currently does)
Checklist