Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring FastqRecord #572

Merged
merged 6 commits into from
Feb 27, 2017
Merged

Conversation

magicDGS
Copy link
Member

@magicDGS magicDGS commented Apr 15, 2016

Motivation

SAMRecords store bases and qualities in byte[], but using StringUtil and SAMUtils for it. For an user that would like to retrieve in the same way the sequence/qualities from a FastqRecord, which basically store the same read, it is needed to go deeper in the API to understand how this is transformed (personal experience).

Having a Read RawRead interface with defined contract for these methods could help to overcome that issue. Like that, methods like SolexaQualityConverter or TrimmingUtils could be used for FastqRecord, and QualityEncodingDetector could be clean up to include any kind of reads.

Another usage is when transforming a SAMRecord to a FastqRecord, which could be done easier if the original base qualities are not requested. It will be useful, for instance, in Picard tools SamToFastq.

Description (only first commit)

  • Implementation of ReadRawRead interface
  • SAMRecord and FastqRecord extends Read (does not change the API)
  • Adding constructors to FastqRecord for byte[] and from other Read
  • Clean up of FastqRecord interface changing private variable names according to the Read interface
  • Adding new javadoc descriptions for FastqRecord
  • Method deprecation in favor of ReadRawRead API

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary

@magicDGS
Copy link
Member Author

magicDGS commented Apr 15, 2016

I would like to re-implement also FastqWriter to include a method to write ReadRawRead in a FASTQ file, but this will change the API and I don't know if it will be useful for everybody.

@magicDGS
Copy link
Member Author

magicDGS commented Apr 15, 2016

Third commit change the API: toString() does not encode a FASTQ-file format and FastqCodec class do that job.

*
* @author Daniel Gómez-Sánchez (magicDGS)
*/
public interface Read {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this Read will be confusing for a lot of people who think of a SamRecord as a Read. Could you rename this so it's clear that it's more limited than SamRecord? RawRead was a suggestion from an in person conversation that a lot of people agreed with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ReadSequence may be a better name. Or some other name. Read is too broad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about RawRead?

On Tue, Jun 14, 2016 at 2:26 PM, Adam Kiezun notifications@github.com
wrote:

In src/java/htsjdk/samtools/Read.java
#572 (comment):

  • * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  • * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  • * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
  • * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
  • * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
  • * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  • * THE SOFTWARE.
  • /
    +package htsjdk.samtools;
    +
    +/
    *
  • * Simple interface for reads: DNA sequences with associated base quality.
  • * @author Daniel Gómez-Sánchez (magicDGS)
  • */
    +public interface Read {

I think ReadSequence may be a better name. Or some other name. Read is
too broad


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/samtools/htsjdk/pull/572/files/15a832457f0e5ac062053ea04d8c5532c0efe6c5#r67028043,
or mute the thread
https://github.com/notifications/unsubscribe/ACnk0uttdRMtS17ZmRv-AHpWhEK0ml2Eks5qLvJCgaJpZM4IIcf-
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it for the one that you prefer...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the name should actually depend on the (as yet unspecified) behaviour of some of the interface methods! RawRead would imply to me that I'm getting what came off the sequencer without any trimming, RC'ing or BSQR or other changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed as RawRead (also in the PR)

@magicDGS
Copy link
Member Author

I think that this is also a good opportunity for change the behavior of empty reads as suggested in #557.

* Default constructor
*
* @param readName the read name
* @param readStrig the sequence string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in readStrig. Also, could you make clear that these are the bases? I feel like sequence can be interpreted in a number of ways

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed for more descriptive parameter + improving javadoc

@magicDGS magicDGS changed the title added read interface and refactoring FastqRecord added RawRead interface and refactoring FastqRecord Jun 17, 2016
@magicDGS
Copy link
Member Author

magicDGS commented Jun 17, 2016

@lbergelson, back to you with some questions:

  1. Which FASTQ convention will we follow? I guess that applying the chages in the API proposed by 100% test coverage for FastqRecord #557 (no zero-length reads), and non-null read names will be the best practice here.
  2. Should we get rid of the quality header? I think that for FASTQ records the quality header will be still needed if people want to add some information, although in my case I don't use it never...
  3. Copy constructor for FastqRecord and RawRead are different, but I don't know if it will be more or less expensive to check the class with instanceof.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.001%) to 68.399% when pulling 4198545 on magicDGS:dgs_read_interface into bcc3f4a on samtools:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.007%) to 68.406% when pulling 4198545 on magicDGS:dgs_read_interface into bcc3f4a on samtools:master.

@magicDGS
Copy link
Member Author

Now it pass the checks, @lbergelson. Just waiting for the decision of which FASTQ convention should we follow...

@droazen
Copy link
Contributor

droazen commented Jun 28, 2016

@magicDGS Discussing this PR, it seems to us that the union of SAMRecord and FastQRecord under a single interface is a bit awkward/forced, since you still need to bypass the interface in your codec to get at the quality header, and the read name means different things for the two formats. Perhaps simple overloads of encode() for both SAMRecord and FastQRecord would be preferable to a forced union at the type system level? What do you think? Would separate overloads cause problems for you?

@magicDGS
Copy link
Member Author

magicDGS commented Jul 1, 2016

I don't see the forcing in the single interface for both, because a read which comes in a FASTQ is feed to the aligner to obtain a SAM record. Thus, they are representing the same kind of record before and after processing them. In addition, some services are starting to provide raw reads in the BAM format, so it also make sense for me to generate the interface. Another benefit for the common interface is that other code in htsjdk and related projects could be simplify, like the QualityEncodingDetector.

For me it will be easy to have them together because I'm using FASTQ/BAM sources for raw reads, and develop two different utils classes/methods generates lots of repetitive code. In addition, the quality encoding in bytes for SAMRecords is not straightforward and I had troubles trying to mimic the behavious in my own code for FastqRecords.

Nevertheless, if the rest of the community thinks that the RawRead interface is not worth, I can change the FastqRecord code in this PR to just generate the byte arrays in the same way as SAMRecord and update the methods. One of the aims of this PR is make SolexaQualityConverter work with both classes in a later PR, and generate an IlluminaQualityConverter to simplify the management of missencoded SAM/BAM files and standardize FASTQ files to Sanger.

I will be happy either with the RawRead interface or with the update of FastqRecord, so I wait to your answer @droazen. Thank you again for the help and fruitful discussion in my PRs.

@magicDGS
Copy link
Member Author

@lbergelson or @droazen, any feedback about this?

@nh13
Copy link
Member

nh13 commented Jul 26, 2016

@magicDGS

  1. Read names for pairs must be the same in SAM, but in FASTQ there are different conventions (ex. trailing /1 and /2 for end one and two respectively). See this method too.
  2. Quality scores in SAM have one format, in FASTQ they don't (see this). If folks don't have the correct encoding, then the should look to the tools that originally encoded them! I recommend Picard's FastqToSam.
  3. Not using FASTQs to store raw reads but using SAM has been done for a long time now. For example, see Picard's IlluminaBasecallsToSam which converts an Illumina run folder (BCLs) to SAM directly. The only time we want to create FASTQs is temporarily to feed into the aligner. The latter is usually streaming anyways (Picard's SamToFastq into bwa into Picard's MergeBamAlignment).

I give this a 👎 for two reasons:

  1. The use of FASTQs should be minimized as much as possible as a best practice. Get the data into SAM as soon as possible, which in many cases can be done without creating FASTQs, and then work on it from there.
  2. Supporting non-standard base quality encodings in SAM is a bad idea, as then it is not a SAM file according to the spec. Look to the tool or users that created the non-standard encoded SAM file. I'd be curious about this line of inquiry.

@magicDGS
Copy link
Member Author

@nh13, first of all thank you for your comments. I really understand your point because I'm very annoyed by the lack of FASTQ standard practices. Most of the aligners feed the FASTQ files directly without any checking of the qualities, and this result a lot of times into misencoded BAM files. I agree that FASTQ files should be used (I'm not using them) and that supporting other encodings in SAM/BAM is a very bad idea, and I'm not suggesting here that (actually, my PR does not contain anything about supported qualities).

Answering your first list:

  1. I know that the conventions for the read names are different in the specs, but actually there is no restriction in them for any of the implementations (there is not checking as far as I know). The interface that I implemented here is just a convenience interface, but implementations could impose their own rules regarding the restriction of the read name.
  2. I really like this point because I'm very disappointed with the decisions regarding FASTQ quality encodings, and that's one of the reasons that I would like this interface into htsjdk. My plan is not support SAM/BAM files in other encodings, but have common methods for SAMRecord/FastqRecord to convert to the standard format both kind of files in the same way, without dealing with differences in the implementation.
  3. I don't like to store reads in FASTQ format, but it doesn't depends on me at the moment. The problem that I'm facing now is when the reads comes from very different sources and I have to deal with both formats at the same time.

I agree with you in every point, but it's not up to me how the reads comes from the service or how my institution keeps the reads, and at this point both are providing raw reads in tons of different formats.

In addition, I think that this interface is a good idea to start normalizing things in genomic data, because from my point of view is very weird that data before processing and data after processing could not be represented by a common interface. According to a programming structure, it is exactly the same data but after include more information: the FastqRecord represents a sequence associated with quality scores and nothing else; the SAMRecod represents the same sequence with quality scores, but aligned to a reference genome.

Nevertheless, I can change this PR to only implement the changes in the FastqRecord API, becuase I think that at least the method to retrieve qualities and bases that mimic the methods in SAMRecord could be useful for other API users who need to deal with FASTQ files.

@nh13
Copy link
Member

nh13 commented Jul 26, 2016

@magicDGS

  • I am a bit confused by "My plan is not support SAM/BAM files in other encodings, but have common methods for SAMRecord/FastqRecord to convert to the standard format" since that implies that SAMRecord is in the wrong format or is itself not a standard format. Again, I would urge you to understand why the SAM record has the wrong encoding, and correct the upstream issue (when was the first time you or someone else converted it to SAM?).
  • If you think that there is a high-value use case to use both FASTQ and SAM at the same time for which this PR would make it easier, please present it. Keep in mind that if you could convert FASTQ to SAM in my mind it doesn't qualify.
  • I am suggesting that you do not deal with FASTQ and SAM in parallel, but namely convert FASTQ to SAM as soon as possible; treat FASTQ as a temporary format and SAM as the standard format.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 68.932% when pulling 9dfe7da on magicDGS:dgs_read_interface into c3d5a88 on samtools:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 68.932% when pulling 9dfe7da on magicDGS:dgs_read_interface into c3d5a88 on samtools:master.

@magicDGS
Copy link
Member Author

Can you review this, @lbergelson? Thanks a lot in advance!

@magicDGS
Copy link
Member Author

This is here for a long time now. Should I close, @lbergelson? I think that tons of methods here are very useful, apart of making more consistent the API...

@lbergelson
Copy link
Member

Ack, yes, this has been here for a very long time. Don't close. I will take a look.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magicDGS A few minor comments and then good to go. Sorry this one sat forever. I assigned it to the "things to do later" mental category and the longer it sat the more easily I ignored it.

import htsjdk.samtools.util.SequenceUtil;

/**
* Codec for encode records into FASTQ format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reads better as "codec for encoding"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/**
* @return the read name
* @deprecated use {@link #getReadName()} instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a note to each deprecated function saying when they were deprecated? month and year is enough. (assume we'll merge this this month....)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return true;
}


/** Simple toString() that gives a read name and length */
@Override
public String toString() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm a bit afraid that there is a code out there in the while that relies on this giving the old output.

Could you also add a method toFastQString() that also calls through to encode so that's there's an obvious way to do what this used to do? I think toString should probably call through to that and produce the same output as it used to just to avoid giving people surprise headaches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is discourage to use toString() to other purposes that is not developmental, but it is true that some code may rely on this. I changed it and point to the toFastQString, but I would rather prefer to add a comment saying that this is discouraged and that this will change at some point (if we set a date, will be nice). What do you think, @lbergelson?

*
* @author Daniel Gomez-Sanchez (magicDGS)
*/
public class FastqCodec {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a codec? It doesn't have a decode, just an encode. Maybe it should be called FastqEncoder? Or maybe it should just be FastqUtils since it also does the conversion with SamRecords?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember why I named it like that, but I guess that it is because it is also decoding SAMRecord into FastqRecord. Renamed as FastqEncoder, because in FastqUtils I would expect to have the constants and methods like getSamReadNameFromFastqHeader.

*
* @author Daniel Gomez-Sanchez (magicDGS)
*/
public class FastqCodec {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make class final

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @author Daniel Gomez-Sanchez (magicDGS)
*/
public class FastqCodec {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a private noarg constructor to prevent people from instantiating instances of this class since it's really a utility class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and adding comment to the constructor to point out why.

/**
* Encodes a FastqRecord in the String FASTQ format.
*/
public static String encode(final FastqRecord record) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an encode override that takes a sam record as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very nice idea, although it is just a sugar syntax. Done with the following implementation: encode(asFastqRecord(record))

@magicDGS
Copy link
Member Author

I addressed all your comments, @lbergelson. I would like to have a more developmental-like toString in FastqRecord because using it for other purposes should be discouraged, but I understand that some code may rely on it. I propose to add a comment to the javadoc saying that it is going to change and be substituted by toFastQString at some point (setting a date/release will be great). What do you think? And the rest of people involved in the conversation?

@codecov-io
Copy link

codecov-io commented Feb 24, 2017

Codecov Report

Merging #572 into master will increase coverage by 0.021%.
The diff coverage is 82.051%.

@@               Coverage Diff               @@
##              master      #572       +/-   ##
===============================================
+ Coverage     64.843%   64.864%   +0.021%     
- Complexity      7160      7174       +14     
===============================================
  Files            525       526        +1     
  Lines          31701     31731       +30     
  Branches        5420      5424        +4     
===============================================
+ Hits           20556     20582       +26     
- Misses          8996      9000        +4     
  Partials        2149      2149
Impacted Files Coverage Δ Complexity Δ
...ain/java/htsjdk/samtools/fastq/FastqConstants.java 0% <ø> (ø) 0 <0> (ø)
...c/main/java/htsjdk/samtools/util/SequenceUtil.java 69.821% <100%> (ø) 115 <0> (ø)
...n/java/htsjdk/samtools/fastq/BasicFastqWriter.java 50% <100%> (-3.846%) 5 <1> (ø)
...c/main/java/htsjdk/samtools/fastq/FastqRecord.java 85.507% <80.435%> (-1.993%) 33 <15> (+2)
.../main/java/htsjdk/samtools/fastq/FastqEncoder.java 82.759% <82.759%> (ø) 12 <12> (?)
...dk/samtools/seekablestream/SeekableHTTPStream.java 54.545% <0%> (-1.515%) 9% <0%> (-1%)
...samtools/util/AsyncBlockCompressedInputStream.java 76% <0%> (+4%) 13% <0%> (+1%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 912c28b...a990401. Read the comment docs.

@lindenb
Copy link
Contributor

lindenb commented Feb 24, 2017

A minor suggestion: for FastqCodec, instead of creating a StringBuilder for each record, wouldn't it be better to create a method using Appendable , then we can use this FastqCodec in a FastqWriter without creating a new String in memory ?. Also, for large reads, we can reserve some memory for StringBuilder as we have an idea of the final length ?

Something like:

    public static String encode(final FastqRecord record) {
        return write(new StringBuilder(record.getReadLength()*2+record.getReadName().length()+5),record).toString();
    }


    public static java.lang.Appendable write(final java.lang.Appendable out,final FastqRecord record) {
        final String readName = record.getReadName();
        final String readString = record.getReadString();
        final String qualHeader = record.getBaseQualityHeader();
        final String qualityString = record.getBaseQualityString();
        return out.append(FastqConstants.SEQUENCE_HEADER).append(readName == null ? "" : readName).append('\n')
                .append(readString == null ? "" : readString).append('\n')
                .append(FastqConstants.QUALITY_HEADER).append(qualHeader == null ? "" : qualHeader).append('\n')
                .append(qualityString == null ? "" : qualityString);
    }

P.

@magicDGS
Copy link
Member Author

@lindenb, sounds like a good improvement. I'd push a commit with the change and use it in FastqWriter.

@magicDGS
Copy link
Member Author

Commit d5cc096 includes @lindenb suggestion, which is quite useful.

* Writes a FastqRecord into the Appendable output.
* @throws SAMException if any I/O error occurs.
*/
public static Appendable write(final Appendable out,final FastqRecord record) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good addition.

*/
public final class FastqEncoder {

// cannot be instantiated because it is an utility class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

*/
public static String encode(final FastqRecord record) {
// reserve some memory based on the read length and read name
final int capacity = record.getReadLength() * 2 + record.getReadName().length() + 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magicDGS The tests are failing with a null pointer exception. It looks like it's caused by this unguarded call of record.getReadName().length()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if I should change the contract of the getters to do not return null? What do you think? I guess that it's safer for all the cases, including the new getters.

@@ -23,62 +23,169 @@
*/
package htsjdk.samtools.fastq;

import htsjdk.samtools.SAMRecord;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import should probably go away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lbergelson
Copy link
Member

@magicDGS There's a test failure due to a NullPointerException. Could you fix that and rebase this on master? I think it's good to go once that's done. Thanks for the suggestion @lindenb .

@magicDGS
Copy link
Member Author

Three new commits in the branch, @lbergelson:

  1. Removed the unused import (although this is added in 3)
  2. Fix the NPE for null read names
  3. New getters for byte[] bases/quals does not return null anymore, but SAMRecord.NULL_SEQUENCE and SAMRecord.NULL_QUALS

I separated the 3rd commit because it's the only one that should be reviewed and/or removed if not accepted the solution. Because they are new getters, we are not changing the API. But maybe it will be important to change it for the other ones to avoid NPE for clients.

@lbergelson
Copy link
Member

@magicDGS I hate the original null returning api. I don't know why the original decision was to return null instead of empty string, that always seems wrong to me. It's very explicitly choosing to return null though, so I imagine breaking it would probably cause problems. 👍 To returning the samrecord empty values for the new ones.

@lbergelson lbergelson merged commit c1227d8 into samtools:master Feb 27, 2017
@magicDGS
Copy link
Member Author

Thanks @lbergelson!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.