SAMUtils:getOtherCanonicalAlignments extract 'SA' tag and return a list of supplementary alignments #685

Merged
merged 5 commits into from Aug 25, 2016

Conversation

Projects
None yet
5 participants
Contributor

lindenb commented Aug 16, 2016

Description

I've added a new function getOtherCanonicalAlignments in SAMUtils. This function is used to extract the 'SA' tag of a SAMRecord as a List<SAMRecord>of supplementary alignements.

    /**
     * Extract a List of 'other canonical alignments' from a SAM record. Those alignments are stored as a string in the 'SA' tag as defined
     * in the SAM specification.
     * Each record in the List is a non-paired read.
     * The name, sequence and qualities are copied from the original record
     * The SAM flag is set to <code>SUPPLEMENTARY_ALIGNMENT (+ READ_REVERSE_STRAND )</code>
     * @param record must be non null and must have a non-null associated header.
     * @return a list of 'other canonical alignments' SAMRecords. The list is empty if the 'SA' attribute is missing.
     */
public static List<SAMRecord> getOtherCanonicalAlignments(final SAMRecord record)
  • testOtherCanonicalAlignments was added to SAMUtilsTest ( create one read with SA flag, check the values of the supplementary alignments)

PS: The SA tag is defined in the spec as

Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list: (rname,pos,strand,CIGAR,mapQ,NM;)+. Each element in the list represents a part of the chimeric alignment.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

lindenb added some commits Aug 16, 2016

@lindenb lindenb extract suppl alignments 31b7195
@lindenb lindenb typo c100ed8
@lindenb lindenb variable name
1a8c36a

Coverage Status

Coverage increased (+0.02%) to 68.827% when pulling 1a8c36a on lindenb:sam_xa into c8202f2 on samtools:master.

Coverage Status

Coverage increased (+0.03%) to 68.833% when pulling 1a8c36a on lindenb:sam_xa into c8202f2 on samtools:master.

Coverage Status

Coverage increased (+0.02%) to 68.827% when pulling 1a8c36a on lindenb:sam_xa into c8202f2 on samtools:master.

Contributor

droazen commented Aug 16, 2016

@jamesemery could you please review?

@lindenb lindenb strand
de3873f

Coverage Status

Coverage increased (+0.03%) to 68.835% when pulling de3873f on lindenb:sam_xa into c8202f2 on samtools:master.

@jamesemery jamesemery commented on an outdated diff Aug 17, 2016

src/main/java/htsjdk/samtools/SAMUtils.java
+ otherRec.setReadBases( record.getReadBases() );
+ otherRec.setBaseQualities( record.getBaseQualities() );
+
+ /* get reference sequence */
+ final int tid = record.getHeader().getSequenceIndex( commaStrs[0] );
+ if( tid == -1 ) throw new SAMException("Unknown contig in " + semiColonStr);
+ otherRec.setReferenceIndex( tid );
+
+ /* fill other fields */
+ otherRec.setAlignmentStart( Integer.parseInt(commaStrs[1]) );
+ otherRec.setFlags(
+ SAMFlag.SUPPLEMENTARY_ALIGNMENT.flag +
+ (commaStrs[2].equals("+") ? 0 : SAMFlag.READ_REVERSE_STRAND.flag) );
+ otherRec.setCigar( TextCigarCodec.decode( commaStrs[3] ) );
+ otherRec.setMappingQuality( Integer.parseInt(commaStrs[4]) );
+ otherRec.setAttribute( SAMTagUtil.getSingleton().NM , Integer.parseInt(commaStrs[5]) );
@jamesemery

jamesemery Aug 17, 2016

Contributor

All of these .parseInt() calls open the possibility of breaking on bad input and exposing a stack trace to the user. Perhaps you should emit a SAMException if any of these fields don't parse correctly?

@jamesemery jamesemery commented on an outdated diff Aug 17, 2016

src/main/java/htsjdk/samtools/SAMUtils.java
@@ -1096,4 +1097,79 @@ public static SAMRecord clipOverlappingAlignedBases(final SAMRecord record, fina
public static boolean isValidUnsignedIntegerAttribute(long value) {
return value >= 0 && value <= BinaryCodec.MAX_UINT;
}
+
+ /**
+ * Extract a List of 'other canonical alignments' from a SAM record. Those alignments are stored as a string in the 'SA' tag as defined
+ * in the SAM specification.
+ * Each record in the List is a non-paired read.
+ * The name, sequence and qualities are copied from the original record.
+ * The SAM flag is set to <code>SUPPLEMENTARY_ALIGNMENT (+ READ_REVERSE_STRAND )</code>
+ * @param record must be non null and must have a non-null associated header.
+ * @return a list of 'other canonical alignments' SAMRecords. The list is empty if the 'SA' attribute is missing.
+ */
+ public static List<SAMRecord> getOtherCanonicalAlignments(final SAMRecord record) {
+ final Pattern semiColonPattern = Pattern.compile("[;]");
+ final Pattern commaPattern = Pattern.compile("[,]");
@jamesemery

jamesemery Aug 17, 2016

Contributor

Do you need to compile these with every call of this method? Perhaps these could be extracted into a static field that you pre-compile so you don't have to remake these patterns every time.

@jamesemery jamesemery commented on the diff Aug 17, 2016

src/test/java/htsjdk/samtools/SAMUtilsTest.java
+
+ SAMRecord other = suppl.get(0);
+ Assert.assertEquals(other.getReferenceName(),"2");
+ Assert.assertEquals(other.getAlignmentStart(),500);
+ Assert.assertFalse(other.getReadNegativeStrandFlag());
+ Assert.assertEquals(other.getMappingQuality(), 60);
+ Assert.assertEquals(other.getAttribute(SAMTagUtil.getSingleton().NM),1);
+ Assert.assertEquals(other.getCigarString(),"3S2=1X2=2S");
+
+ other = suppl.get(1);
+ Assert.assertEquals(other.getReferenceName(),"1");
+ Assert.assertEquals(other.getAlignmentStart(),191);
+ Assert.assertTrue(other.getReadNegativeStrandFlag());
+ Assert.assertEquals(other.getMappingQuality(), 60);
+ Assert.assertEquals(other.getAttribute(SAMTagUtil.getSingleton().NM),0);
+ Assert.assertEquals(other.getCigarString(),"8M2S");
@jamesemery

jamesemery Aug 17, 2016

Contributor

You should at least add a test that a record with no SA tag returns an empty list properly

@jamesemery jamesemery commented on the diff Aug 17, 2016

src/main/java/htsjdk/samtools/SAMUtils.java
+
+ for( final String semiColonStr : semiColonStrs ) {
+ /* ignore empty string */
+ if( semiColonStr.isEmpty() ) continue;
+
+ /* break string using comma */
+ final String commaStrs[] = commaPattern.split(semiColonStr);
+ if( commaStrs.length != 6 ) throw new SAMException("Bad 'SA' attribute in " + semiColonStr);
+
+ /* create the new record */
+ final SAMRecord otherRec = samReaderFactory.createSAMRecord( record.getHeader() );
+
+ /* copy fields from the original record */
+ otherRec.setReadName( record.getReadName() );
+ otherRec.setReadBases( record.getReadBases() );
+ otherRec.setBaseQualities( record.getBaseQualities() );
@jamesemery

jamesemery Aug 17, 2016

Contributor

Furthermore, what about the mate information?

@jamesemery jamesemery commented on an outdated diff Aug 17, 2016

src/main/java/htsjdk/samtools/SAMUtils.java
+ final SAMRecord otherRec = samReaderFactory.createSAMRecord( record.getHeader() );
+
+ /* copy fields from the original record */
+ otherRec.setReadName( record.getReadName() );
+ otherRec.setReadBases( record.getReadBases() );
+ otherRec.setBaseQualities( record.getBaseQualities() );
+
+ /* get reference sequence */
+ final int tid = record.getHeader().getSequenceIndex( commaStrs[0] );
+ if( tid == -1 ) throw new SAMException("Unknown contig in " + semiColonStr);
+ otherRec.setReferenceIndex( tid );
+
+ /* fill other fields */
+ otherRec.setAlignmentStart( Integer.parseInt(commaStrs[1]) );
+ otherRec.setFlags(
+ SAMFlag.SUPPLEMENTARY_ALIGNMENT.flag +
@jamesemery

jamesemery Aug 17, 2016

Contributor

Is it correct to set the Supplementary Alignment flag for every SA read? What if record is supplementary? In that case wouldn't you want to set the the first flag to primary according to convention?

@lindenb lindenb fix PR
899227b
Contributor

lindenb commented Aug 17, 2016

Thank you for the review and back to you @jamesemery

  • I've considered the mate
  • surrounded parseInt with try catch
  • considered the first suppl alignment if the read is supplementary
  • moved the Pattern out of the function.
  • updated the test

NB: I'll be away from github for a few days.

Coverage Status

Coverage increased (+0.03%) to 68.841% when pulling 899227b on lindenb:sam_xa into c8202f2 on samtools:master.

Contributor

jamesemery commented Aug 19, 2016

@droazen These changes look good to me

@lbergelson lbergelson merged commit c3d5a88 into samtools:master Aug 25, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.03%) to 68.841%
Details
Contributor

lbergelson commented Aug 25, 2016

Thanks @lindenb and @jamesemery

jamesemery referenced this pull request in broadinstitute/gatk Aug 25, 2016

Closed

Add API support for SA tags #2115

lindenb deleted the lindenb:sam_xa branch Aug 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment