Refactoring FastqRecord #572

Merged
merged 6 commits into from Feb 27, 2017

Conversation

Projects
None yet
Contributor

magicDGS commented Apr 15, 2016 edited

Motivation

SAMRecords store bases and qualities in byte[], but using StringUtil and SAMUtils for it. For an user that would like to retrieve in the same way the sequence/qualities from a FastqRecord, which basically store the same read, it is needed to go deeper in the API to understand how this is transformed (personal experience).

Having a Read RawRead interface with defined contract for these methods could help to overcome that issue. Like that, methods like SolexaQualityConverter or TrimmingUtils could be used for FastqRecord, and QualityEncodingDetector could be clean up to include any kind of reads.

Another usage is when transforming a SAMRecord to a FastqRecord, which could be done easier if the original base qualities are not requested. It will be useful, for instance, in Picard tools SamToFastq.

Description (only first commit)

  • Implementation of ReadRawRead interface
  • SAMRecord and FastqRecord extends Read (does not change the API)
  • Adding constructors to FastqRecord for byte[] and from other Read
  • Clean up of FastqRecord interface changing private variable names according to the Read interface
  • Adding new javadoc descriptions for FastqRecord
  • Method deprecation in favor of ReadRawRead API

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
Contributor

magicDGS commented Apr 15, 2016 edited

I would like to re-implement also FastqWriter to include a method to write ReadRawRead in a FASTQ file, but this will change the API and I don't know if it will be useful for everybody.

Contributor

magicDGS commented Apr 15, 2016 edited

Third commit change the API: toString() does not encode a FASTQ-file format and FastqCodec class do that job.

yfarjoun referenced this pull request Jun 11, 2016

Open

Review "Party" #548

lbergelson was assigned by droazen Jun 14, 2016

@lbergelson lbergelson and 4 others commented on an outdated diff Jun 14, 2016

src/java/htsjdk/samtools/Read.java
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+package htsjdk.samtools;
+
+/**
+ * Simple interface for reads: DNA sequences with associated base quality.
+ *
+ * @author Daniel Gómez-Sánchez (magicDGS)
+ */
+public interface Read {
@lbergelson

lbergelson Jun 14, 2016

Contributor

Calling this Read will be confusing for a lot of people who think of a SamRecord as a Read. Could you rename this so it's clear that it's more limited than SamRecord? RawRead was a suggestion from an in person conversation that a lot of people agreed with.

@akiezun

akiezun Jun 14, 2016

Contributor

I think ReadSequence may be a better name. Or some other name. Read is too broad

@yfarjoun

yfarjoun Jun 14, 2016

Contributor

how about RawRead?

On Tue, Jun 14, 2016 at 2:26 PM, Adam Kiezun notifications@github.com
wrote:

In src/java/htsjdk/samtools/Read.java
#572 (comment):

  • * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  • * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  • * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
  • * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
  • * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
  • * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  • * THE SOFTWARE.
  • /
    +package htsjdk.samtools;
    +
    +/
    *
  • * Simple interface for reads: DNA sequences with associated base quality.
  • * @author Daniel Gómez-Sánchez (magicDGS)
  • */
    +public interface Read {

I think ReadSequence may be a better name. Or some other name. Read is
too broad


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/samtools/htsjdk/pull/572/files/15a832457f0e5ac062053ea04d8c5532c0efe6c5#r67028043,
or mute the thread
https://github.com/notifications/unsubscribe/ACnk0uttdRMtS17ZmRv-AHpWhEK0ml2Eks5qLvJCgaJpZM4IIcf-
.

@magicDGS

magicDGS Jun 15, 2016

Contributor

I will change it for the one that you prefer...

@tfenne

tfenne Jun 15, 2016

Owner

I think that the name should actually depend on the (as yet unspecified) behaviour of some of the interface methods! RawRead would imply to me that I'm getting what came off the sequencer without any trimming, RC'ing or BSQR or other changes.

@magicDGS

magicDGS Jun 17, 2016

Contributor

Renamed as RawRead (also in the PR)

Contributor

magicDGS commented Jun 15, 2016

I think that this is also a good opportunity for change the behavior of empty reads as suggested in #557.

@lbergelson lbergelson and 1 other commented on an outdated diff Jun 15, 2016

src/java/htsjdk/samtools/fastq/FastqRecord.java
- if (seqHeaderPrefix != null && !seqHeaderPrefix.isEmpty()) this.seqHeaderPrefix = seqHeaderPrefix;
- else this.seqHeaderPrefix = null;
- if (qualHeaderPrefix != null && !qualHeaderPrefix.isEmpty()) this.qualHeaderPrefix = qualHeaderPrefix;
- else this.qualHeaderPrefix = null;
- this.seqLine = seqLine ;
- this.qualLine = qualLine ;
+ private final String readName;
+ private final String readString;
+ private final String qualityHeader;
+ private final String qualityString;
+
+ /**
+ * Default constructor
+ *
+ * @param readName the read name
+ * @param readStrig the sequence string
@lbergelson

lbergelson Jun 15, 2016

Contributor

Typo in readStrig. Also, could you make clear that these are the bases? I feel like sequence can be interpreted in a number of ways

@magicDGS

magicDGS Jun 17, 2016

Contributor

Changed for more descriptive parameter + improving javadoc

@lbergelson lbergelson and 1 other commented on an outdated diff Jun 15, 2016

src/java/htsjdk/samtools/fastq/FastqRecord.java
- this.qualLine = qualLine ;
+ private final String readName;
+ private final String readString;
+ private final String qualityHeader;
+ private final String qualityString;
+
+ /**
+ * Default constructor
+ *
+ * @param readName the read name
+ * @param readStrig the sequence string
+ * @param qualityHeader the quality header
+ * @param qualityString the quality string
+ */
+ public FastqRecord(final String readName, final String readStrig, final String qualityHeader, final String qualityString) {
+ if (readName != null && !readName.isEmpty()) this.readName = readName;
@lbergelson

lbergelson Jun 15, 2016

Contributor

could you either add brackets to theses lines or make them into ternary expressions?

@magicDGS

magicDGS Jun 17, 2016

Contributor

Added brackets.

@lbergelson lbergelson and 1 other commented on an outdated diff Jun 15, 2016

src/java/htsjdk/samtools/fastq/FastqRecord.java
- /** shortcut to getReadString().length() */
- public int length() { return this.seqLine==null?0:this.seqLine.length();}
-
+ @Override
+ public String getBaseQualityString() {
+ return qualityString;
+ }
+
+ @Override
+ public byte[] getBaseQualities() {
+ return SAMUtils.fastqToPhred(qualityString);
+ }
+
+ @Override
+ public int getReadLength() {
+ return readString.length();
@lbergelson

lbergelson Jun 15, 2016

Contributor

Shouldn't this have a null check?

@magicDGS

magicDGS Jun 17, 2016

Contributor

Done

@lbergelson lbergelson and 1 other commented on an outdated diff Jun 15, 2016

src/java/htsjdk/samtools/fastq/FastqRecord.java
+
+ /**
+ * Get the base quality header
+ *
+ * @return the base quality header
+ */
+ public String getBaseQualityHeader() {
+ return qualityHeader;
+ }
+
+ /**
+ * shortcut to getReadString().length()
+ *
+ * @deprecated use {@link #getReadLength()} instead
+ */
+ @Deprecated
@lbergelson

lbergelson Jun 15, 2016

Contributor

This should probably just delegate to getReadLength()

@magicDGS

magicDGS Jun 17, 2016

Contributor

Done

@lbergelson lbergelson and 1 other commented on an outdated diff Jun 15, 2016

src/java/htsjdk/samtools/fastq/FastqRecord.java
- /** copy constructor */
+
+ /**
+ * Constructor for byte[] arrays
+ *
+ * @param readName the read name
+ * @param readBases the read sequence as ASCII bytes ACGTN=.
+ * @param qualityHeader the quality header
+ * @param readQualities the base qualities as binary PHRED scores (not ASCII)
+ */
+ public FastqRecord(final String readName, final byte[] readBases, final String qualityHeader, final byte[] readQualities) {
+ this(readName, StringUtil.bytesToString(readBases), qualityHeader, SAMUtils.phredToFastq(readQualities));
+ }
+
+ /**
+ * Copy constructor from a read
@lbergelson

lbergelson Jun 15, 2016

Contributor

remember to update this doc when you update the interface name

@magicDGS

magicDGS Jun 17, 2016

Contributor

Done

@lbergelson lbergelson and 1 other commented on an outdated diff Jun 15, 2016

src/java/htsjdk/samtools/fastq/FastqRecord.java
+ }
+
+ /**
+ * Copy constructor from a read
+ *
+ * @param read the read to convert to a FastqRecord
+ */
+ public FastqRecord(final Read read) {
+ this(read.getReadName(), read.getReadString(), (read instanceof FastqRecord) ? ((FastqRecord) read).qualityHeader : null, read.getBaseQualityString());
+ }
+
+ /**
+ * Copy constructor
+ *
+ * @param other record to copy
+ */
public FastqRecord(final FastqRecord other) {
@lbergelson

lbergelson Jun 15, 2016

Contributor

Do we need this constructor since we already have the copy constructor from Read? It doesn't seem like it's going to be particularly more efficient since the one from Read already uses the string encoding.

@lbergelson

lbergelson Jun 15, 2016

Contributor

Unless that instance of check is very expensive...

@magicDGS

magicDGS Jun 17, 2016

Contributor

I don't know if it is very expensive or not. I didn't change it yet, so it's up to you if I remove this constructor...

@lbergelson lbergelson and 1 other commented on an outdated diff Jun 15, 2016

src/java/htsjdk/samtools/fastq/FastqRecord.java
+ * @param readName the read name
+ * @param readBases the read sequence as ASCII bytes ACGTN=.
+ * @param qualityHeader the quality header
+ * @param readQualities the base qualities as binary PHRED scores (not ASCII)
+ */
+ public FastqRecord(final String readName, final byte[] readBases, final String qualityHeader, final byte[] readQualities) {
+ this(readName, StringUtil.bytesToString(readBases), qualityHeader, SAMUtils.phredToFastq(readQualities));
+ }
+
+ /**
+ * Copy constructor from a read
+ *
+ * @param read the read to convert to a FastqRecord
+ */
+ public FastqRecord(final Read read) {
+ this(read.getReadName(), read.getReadString(), (read instanceof FastqRecord) ? ((FastqRecord) read).qualityHeader : null, read.getBaseQualityString());
@lbergelson

lbergelson Jun 15, 2016

Contributor

This instanceOf check is kind of unfortunate to need. I wonder if we can eliminate the qualityHeader field completely. The spec only allows it to take 2 values, nothing or the same value as the read name as far as I understand.

@magicDGS

magicDGS Jun 16, 2016

Contributor

It is true that it is the same value, but the problem is that the qualityHeader could contain some information that it is not in the readHeader. For instance, you can have a readHeader only with the read name and the qualityHeader with some other information like read group and other stuff.

I'm actually don't know if anyone is using the qualityHeader, because in my experience this one does not have any value...

@tfenne tfenne and 3 others commented on an outdated diff Jun 15, 2016

src/java/htsjdk/samtools/Read.java
+ */
+package htsjdk.samtools;
+
+/**
+ * Simple interface for reads: DNA sequences with associated base quality.
+ *
+ * @author Daniel Gómez-Sánchez (magicDGS)
+ */
+public interface Read {
+
+ /**
+ * Get the read name
+ *
+ * @return the read name
+ */
+ public String getReadName();
@tfenne

tfenne Jun 15, 2016

Owner

Needs more specification. E.g. can it contain spaces? In a fastq header like @foo 1:Y:N what should name be?

@lbergelson

lbergelson Jun 15, 2016

Contributor

@tfenne What FASTQ standard should htsjdk support? The samtools FASTQ format page doesn't allow spaces, but the Illumina one specifies them. I assume in practice everyone uses the Illumina style?

@jmarshall

jmarshall Jun 16, 2016

Owner

The samtools FASTQ format page

That's an ancient maq page, nothing to do with samtools. You should read wikipedia's FASTQ article and the paper linked therein, if you have not already.

@lbergelson

lbergelson Jun 16, 2016

Contributor

@jmarshall Ack, you're totally right. I'm not sure why I attributed to samtools. Sorry about that. I just saw that there seem to be several different different specs out in the wild and pulled some up from google. I rarely use Fastq directly so I'm not sure what the accepted standard is.

@magicDGS

magicDGS Jun 17, 2016

Contributor

I haven't change it yet, but I'm sure that we will need to follow some conventions. As far as I know, the read name could be whatever String. Nevertheless, the current implementation for FastqRecord could have records with null read names...

magicDGS changed the title from added read interface and refactoring FastqRecord to added RawRead interface and refactoring FastqRecord Jun 17, 2016

Contributor

magicDGS commented Jun 17, 2016 edited

@lbergelson, back to you with some questions:

  1. Which FASTQ convention will we follow? I guess that applying the chages in the API proposed by #557 (no zero-length reads), and non-null read names will be the best practice here.
  2. Should we get rid of the quality header? I think that for FASTQ records the quality header will be still needed if people want to add some information, although in my case I don't use it never...
  3. Copy constructor for FastqRecord and RawRead are different, but I don't know if it will be more or less expensive to check the class with instanceof.

magicDGS referenced this pull request in magicDGS/ReadTools Jun 17, 2016

Closed

Use RawRead interface when added to htsjdk #25

0 of 4 tasks complete

Coverage Status

Coverage increased (+0.001%) to 68.399% when pulling 4198545 on magicDGS:dgs_read_interface into bcc3f4a on samtools:master.

Coverage Status

Coverage increased (+0.007%) to 68.406% when pulling 4198545 on magicDGS:dgs_read_interface into bcc3f4a on samtools:master.

Contributor

magicDGS commented Jun 26, 2016

Now it pass the checks, @lbergelson. Just waiting for the decision of which FASTQ convention should we follow...

Contributor

droazen commented Jun 28, 2016

@magicDGS Discussing this PR, it seems to us that the union of SAMRecord and FastQRecord under a single interface is a bit awkward/forced, since you still need to bypass the interface in your codec to get at the quality header, and the read name means different things for the two formats. Perhaps simple overloads of encode() for both SAMRecord and FastQRecord would be preferable to a forced union at the type system level? What do you think? Would separate overloads cause problems for you?

Contributor

magicDGS commented Jul 1, 2016

I don't see the forcing in the single interface for both, because a read which comes in a FASTQ is feed to the aligner to obtain a SAM record. Thus, they are representing the same kind of record before and after processing them. In addition, some services are starting to provide raw reads in the BAM format, so it also make sense for me to generate the interface. Another benefit for the common interface is that other code in htsjdk and related projects could be simplify, like the QualityEncodingDetector.

For me it will be easy to have them together because I'm using FASTQ/BAM sources for raw reads, and develop two different utils classes/methods generates lots of repetitive code. In addition, the quality encoding in bytes for SAMRecords is not straightforward and I had troubles trying to mimic the behavious in my own code for FastqRecords.

Nevertheless, if the rest of the community thinks that the RawRead interface is not worth, I can change the FastqRecord code in this PR to just generate the byte arrays in the same way as SAMRecord and update the methods. One of the aims of this PR is make SolexaQualityConverter work with both classes in a later PR, and generate an IlluminaQualityConverter to simplify the management of missencoded SAM/BAM files and standardize FASTQ files to Sanger.

I will be happy either with the RawRead interface or with the update of FastqRecord, so I wait to your answer @droazen. Thank you again for the help and fruitful discussion in my PRs.

Contributor

magicDGS commented Jul 26, 2016

@lbergelson or @droazen, any feedback about this?

Contributor

nh13 commented Jul 26, 2016 edited

@magicDGS

  1. Read names for pairs must be the same in SAM, but in FASTQ there are different conventions (ex. trailing /1 and /2 for end one and two respectively). See this method too.
  2. Quality scores in SAM have one format, in FASTQ they don't (see this). If folks don't have the correct encoding, then the should look to the tools that originally encoded them! I recommend Picard's FastqToSam.
  3. Not using FASTQs to store raw reads but using SAM has been done for a long time now. For example, see Picard's IlluminaBasecallsToSam which converts an Illumina run folder (BCLs) to SAM directly. The only time we want to create FASTQs is temporarily to feed into the aligner. The latter is usually streaming anyways (Picard's SamToFastq into bwa into Picard's MergeBamAlignment).

I give this a 👎 for two reasons:

  1. The use of FASTQs should be minimized as much as possible as a best practice. Get the data into SAM as soon as possible, which in many cases can be done without creating FASTQs, and then work on it from there.
  2. Supporting non-standard base quality encodings in SAM is a bad idea, as then it is not a SAM file according to the spec. Look to the tool or users that created the non-standard encoded SAM file. I'd be curious about this line of inquiry.
Contributor

magicDGS commented Jul 26, 2016

@nh13, first of all thank you for your comments. I really understand your point because I'm very annoyed by the lack of FASTQ standard practices. Most of the aligners feed the FASTQ files directly without any checking of the qualities, and this result a lot of times into misencoded BAM files. I agree that FASTQ files should be used (I'm not using them) and that supporting other encodings in SAM/BAM is a very bad idea, and I'm not suggesting here that (actually, my PR does not contain anything about supported qualities).

Answering your first list:

  1. I know that the conventions for the read names are different in the specs, but actually there is no restriction in them for any of the implementations (there is not checking as far as I know). The interface that I implemented here is just a convenience interface, but implementations could impose their own rules regarding the restriction of the read name.
  2. I really like this point because I'm very disappointed with the decisions regarding FASTQ quality encodings, and that's one of the reasons that I would like this interface into htsjdk. My plan is not support SAM/BAM files in other encodings, but have common methods for SAMRecord/FastqRecord to convert to the standard format both kind of files in the same way, without dealing with differences in the implementation.
  3. I don't like to store reads in FASTQ format, but it doesn't depends on me at the moment. The problem that I'm facing now is when the reads comes from very different sources and I have to deal with both formats at the same time.

I agree with you in every point, but it's not up to me how the reads comes from the service or how my institution keeps the reads, and at this point both are providing raw reads in tons of different formats.

In addition, I think that this interface is a good idea to start normalizing things in genomic data, because from my point of view is very weird that data before processing and data after processing could not be represented by a common interface. According to a programming structure, it is exactly the same data but after include more information: the FastqRecord represents a sequence associated with quality scores and nothing else; the SAMRecod represents the same sequence with quality scores, but aligned to a reference genome.

Nevertheless, I can change this PR to only implement the changes in the FastqRecord API, becuase I think that at least the method to retrieve qualities and bases that mimic the methods in SAMRecord could be useful for other API users who need to deal with FASTQ files.

Contributor

nh13 commented Jul 26, 2016

@magicDGS

  • I am a bit confused by "My plan is not support SAM/BAM files in other encodings, but have common methods for SAMRecord/FastqRecord to convert to the standard format" since that implies that SAMRecord is in the wrong format or is itself not a standard format. Again, I would urge you to understand why the SAM record has the wrong encoding, and correct the upstream issue (when was the first time you or someone else converted it to SAM?).
  • If you think that there is a high-value use case to use both FASTQ and SAM at the same time for which this PR would make it easier, please present it. Keep in mind that if you could convert FASTQ to SAM in my mind it doesn't qualify.
  • I am suggesting that you do not deal with FASTQ and SAM in parallel, but namely convert FASTQ to SAM as soon as possible; treat FASTQ as a temporary format and SAM as the standard format.
Contributor

nh13 commented Jul 26, 2016

@droazen @lbergelson @yfarjoun I would welcome others to contribute to the discussion here.

Contributor

magicDGS commented Jul 26, 2016 edited

  • The problem is that I'm not mapping all the files, and the problem is feeding into bwa FASTQ files in Illumina encoding. This produce BAM files wrongly encoded. In addition, some service provide us raw reads in BAM format with Illumina encoding. So it is not something that I can fix upstream my pipeline.
  • The use case is in my repository, where I had a tool for de-multiplex BAM/SAM and FASTQ files because of the issue that we have in our institution. Because I have to deal with barcodes in the read name (FASTQ/BAM/SAM) and in the BC tag (BAM/SAM), I'm not able to use SamToFastq. In addition, I implemented a quality trimmer for Pool-Seq data that should work with both kind of inputs because this issue that we have (and I'm sure that other labs will also face this regarding how different services provide the reads from my own experience). Although I prefer to establish a proper pipeline for store/manage the reads in BAM format, it is not up to me to do it, and I have to stick to the way that nowadays is done. Thus, I can't support in my software only one kind of source (BAM), and methods that works with the common interface will be perfect for handle this kind of issues.
  • I also prefer not to deal with them in parallel, as I've already mentioned. I'm trying to convince the people in charge to convert all the reads that we receive into a suitable BAM file with barcodes in the read names and in sanger encoding, and provide to the final user of the reads this files. But even like that, I should provide them this software with methods to deal with the two file sources and store them properly.

Thanks again for the fruitful discussion, @nh13. Even if this PR is not accepted, I would appreciate if I can include some of the methods and abstraction in FastqRecordto make easier the conversion between the two sources of reads and use quality fixers and trimming methods that are common to both of them.

Contributor

droazen commented Jul 26, 2016

@magicDGS I continue to think that in order to properly unify under a common interface, there needs to be not only shared data between the two types of records, but shared contracts for that data -- and I think the shared contracts are somewhat lacking in this case. As @nh13 pointed out, there are important differences between the two formats in the name and qual fields. And you're already finding yourself having to bypass the new interface with code like:

(rawRead instanceof FastqRecord) ? ((FastqRecord) rawRead).getBaseQualityHeader()

which is a warning sign that unification at the type system level might not be the right approach here.

I think the non-interface aspect of this PR (easier conversion) is less controversial and could probably be merged with minor changes.

It would be good to get some additional opinions before we make a decision on this PR, though -- @lbergelson @yfarjoun @tfenne care to chime in?

Contributor

nh13 commented Jul 26, 2016

@magicDGS thanks for the insight into the problems you are trying to solve and what is in your control.

My inclination would be to standardize all the data you receive into valid SAM format, then move onto analysis (mapping, QC, etc.). This would make your analysis pipeline simpler, quarantine the parts of your pipeline that need to deal with non-standard data formats, and reduce the burden on htsjdk to support non-standard formats (wrong quality format) and not-so-best practices (storing things in the read names). So perhaps try to reduce the PR to as little as possible that would enable your use case of reformatting the input data, but keep no methods that would allow creation of non-standard SAM? More succinctly, perhaps @droazen gives the right solution.

Contributor

magicDGS commented Jul 27, 2016

Thanks for your feedback, @nh13 and @droazen. I think that your suggestions are great, and I'm very grateful for them. Should I remove the interface and keep the utility methods in FastqRecord to get this in or wait until @lbergelson, @yfarjoun and @tfenne give some feedback?

Contributor

yfarjoun commented Jul 27, 2016

I don't think you need to wait for every-one to chime in. I agree that if
the interface is clunky then the right thing to do is not have it....

On Wed, Jul 27, 2016 at 4:44 AM, Daniel Gómez-Sánchez <
notifications@github.com> wrote:

Thanks for your feedback, @nh13 https://github.com/nh13 and @droazen
https://github.com/droazen. I think that your suggestions are great,
and I'm very grateful for them. Should I remove the interface and keep the
utility methods in FastqRecord to get this in or wait until @lbergelson
https://github.com/lbergelson, @yfarjoun https://github.com/yfarjoun
and @tfenne https://github.com/tfenne give some feedback?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#572 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACnk0vTnsgaCMxk8Cr5kBL_vzwintSWNks5qZxpbgaJpZM4IIcf-
.

Coverage Status

Coverage increased (+0.03%) to 68.432% when pulling 28bc986 on magicDGS:dgs_read_interface into bcc3f4a on samtools:master.

Coverage Status

Coverage increased (+0.05%) to 68.445% when pulling 5f3a90b on magicDGS:dgs_read_interface into bcc3f4a on samtools:master.

Coverage Status

Coverage increased (+0.05%) to 68.448% when pulling 553ce72 on magicDGS:dgs_read_interface into bcc3f4a on samtools:master.

Contributor

magicDGS commented Jul 27, 2016

Thanks @yfarjoun. Now this PR is ready for review, @lbergelson (you are the one assigned, aren't you?). The changes are the following:

  • Renamed methods for mimic the SAMRecord ones. Deprecation of previous methods.
  • New methods for FastqRecord to retrieve sequence/quality as byte arrays as in SAMRecord
  • Changed FastqRecord.toString() to do not print out a formatted String as in FASTQ files, but a summary of the record.
  • Implemented FastqCodec to encapsulate the conversion to a String as in FASTQ files, and a helper method for convert a SAMRecord to a FastqRecord.
  • Added /1 and /2 to FastqConstants

Other stuff that could be implemented in this PR is the changes in the contract suggested in #557 for empty reads.

Contributor

SHuang-Broad commented Aug 8, 2016

While you are making this change, may i ask you for a favor and add a functionality to convert a FastqRecord to SAMRecord?
Code exists here that does similar things, in particular, this function:

private SAMRecord createSamRecord(final SAMFileHeader header, 
                                  final String baseName, 
                                  final FastqRecord frec, 
                                  final boolean paired);

Thanks!

Contributor

magicDGS commented Aug 9, 2016

Sure @SHuang-Broad, it was also in my plans to include something like that in the FastqCodec because I will need it too. Thanks for the suggestion, because I completely forgot.

magicDGS changed the title from added RawRead interface and refactoring FastqRecord to Refactoring FastqRecord Aug 9, 2016

Contributor

magicDGS commented Aug 9, 2016

Added the method, @SHuang-Broad, but a very simple one compared to the method that you mentioned.

Contributor

magicDGS commented Sep 1, 2016

I squashed and rebased the branch. Could you have a look @lbergelson? I would like to have this in htsjdk, will be very useful for my work! Thanks in advance!

Coverage Status

Coverage increased (+0.01%) to 68.932% when pulling 9dfe7da on magicDGS:dgs_read_interface into c3d5a88 on samtools:master.

Coverage Status

Coverage increased (+0.01%) to 68.932% when pulling 9dfe7da on magicDGS:dgs_read_interface into c3d5a88 on samtools:master.

Contributor

magicDGS commented Sep 13, 2016

Can you review this, @lbergelson? Thanks a lot in advance!

Contributor

magicDGS commented Feb 23, 2017

This is here for a long time now. Should I close, @lbergelson? I think that tons of methods here are very useful, apart of making more consistent the API...

Contributor

lbergelson commented Feb 23, 2017

Ack, yes, this has been here for a very long time. Don't close. I will take a look.

@lbergelson

@magicDGS A few minor comments and then good to go. Sorry this one sat forever. I assigned it to the "things to do later" mental category and the longer it sat the more easily I ignored it.

+import htsjdk.samtools.util.SequenceUtil;
+
+/**
+ * Codec for encode records into FASTQ format.
@lbergelson

lbergelson Feb 23, 2017

Contributor

reads better as "codec for encoding"

@magicDGS

magicDGS Feb 24, 2017

Contributor

Done.

+ *
+ * @author Daniel Gomez-Sanchez (magicDGS)
+ */
+public class FastqCodec {
@lbergelson

lbergelson Feb 23, 2017

Contributor

Is this really a codec? It doesn't have a decode, just an encode. Maybe it should be called FastqEncoder? Or maybe it should just be FastqUtils since it also does the conversion with SamRecords?

@magicDGS

magicDGS Feb 24, 2017

Contributor

I can't remember why I named it like that, but I guess that it is because it is also decoding SAMRecord into FastqRecord. Renamed as FastqEncoder, because in FastqUtils I would expect to have the constants and methods like getSamReadNameFromFastqHeader.

+ *
+ * @author Daniel Gomez-Sanchez (magicDGS)
+ */
+public class FastqCodec {
@lbergelson

lbergelson Feb 23, 2017

Contributor

make class final

@magicDGS

magicDGS Feb 24, 2017

Contributor

Done.

+ * @author Daniel Gomez-Sanchez (magicDGS)
+ */
+public class FastqCodec {
+
@lbergelson

lbergelson Feb 23, 2017

Contributor

add a private noarg constructor to prevent people from instantiating instances of this class since it's really a utility class.

@magicDGS

magicDGS Feb 24, 2017

Contributor

Done, and adding comment to the constructor to point out why.

+ /**
+ * Encodes a FastqRecord in the String FASTQ format.
+ */
+ public static String encode(final FastqRecord record) {
@lbergelson

lbergelson Feb 23, 2017

Contributor

Should there be an encode override that takes a sam record as well?

@magicDGS

magicDGS Feb 24, 2017

Contributor

That's a very nice idea, although it is just a sugar syntax. Done with the following implementation: encode(asFastqRecord(record))

+
+ /**
+ * @return the read name
+ * @deprecated use {@link #getReadName()} instead
@lbergelson

lbergelson Feb 23, 2017

Contributor

Could you add a note to each deprecated function saying when they were deprecated? month and year is enough. (assume we'll merge this this month....)

@magicDGS

magicDGS Feb 24, 2017

Contributor

Done.

return true;
}
-
+
+ /** Simple toString() that gives a read name and length */
@Override
public String toString() {
@lbergelson

lbergelson Feb 23, 2017

Contributor

So I'm a bit afraid that there is a code out there in the while that relies on this giving the old output.

Could you also add a method toFastQString() that also calls through to encode so that's there's an obvious way to do what this used to do? I think toString should probably call through to that and produce the same output as it used to just to avoid giving people surprise headaches.

@magicDGS

magicDGS Feb 24, 2017

Contributor

It is discourage to use toString() to other purposes that is not developmental, but it is true that some code may rely on this. I changed it and point to the toFastQString, but I would rather prefer to add a comment saying that this is discouraged and that this will change at some point (if we set a date, will be nice). What do you think, @lbergelson?

Contributor

magicDGS commented Feb 24, 2017

I addressed all your comments, @lbergelson. I would like to have a more developmental-like toString in FastqRecord because using it for other purposes should be discouraged, but I understand that some code may rely on it. I propose to add a comment to the javadoc saying that it is going to change and be substituted by toFastQString at some point (setting a date/release will be great). What do you think? And the rest of people involved in the conversation?

codecov-io commented Feb 24, 2017 edited

Codecov Report

Merging #572 into master will increase coverage by 0.021%.
The diff coverage is 82.051%.

@@               Coverage Diff               @@
##              master      #572       +/-   ##
===============================================
+ Coverage     64.843%   64.864%   +0.021%     
- Complexity      7160      7174       +14     
===============================================
  Files            525       526        +1     
  Lines          31701     31731       +30     
  Branches        5420      5424        +4     
===============================================
+ Hits           20556     20582       +26     
- Misses          8996      9000        +4     
  Partials        2149      2149
Impacted Files Coverage Δ Complexity Δ
...ain/java/htsjdk/samtools/fastq/FastqConstants.java 0% <ø> (ø) 0 <0> (ø)
...c/main/java/htsjdk/samtools/util/SequenceUtil.java 69.821% <100%> (ø) 115 <0> (ø)
...n/java/htsjdk/samtools/fastq/BasicFastqWriter.java 50% <100%> (-3.846%) 5 <1> (ø)
...c/main/java/htsjdk/samtools/fastq/FastqRecord.java 85.507% <80.435%> (-1.993%) 33 <15> (+2)
.../main/java/htsjdk/samtools/fastq/FastqEncoder.java 82.759% <82.759%> (ø) 12 <12> (?)
...dk/samtools/seekablestream/SeekableHTTPStream.java 54.545% <0%> (-1.515%) 9% <0%> (-1%)
...samtools/util/AsyncBlockCompressedInputStream.java 76% <0%> (+4%) 13% <0%> (+1%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 912c28b...a990401. Read the comment docs.

Contributor

lindenb commented Feb 24, 2017 edited

A minor suggestion: for FastqCodec, instead of creating a StringBuilder for each record, wouldn't it be better to create a method using Appendable , then we can use this FastqCodec in a FastqWriter without creating a new String in memory ?. Also, for large reads, we can reserve some memory for StringBuilder as we have an idea of the final length ?

Something like:

    public static String encode(final FastqRecord record) {
        return write(new StringBuilder(record.getReadLength()*2+record.getReadName().length()+5),record).toString();
    }


    public static java.lang.Appendable write(final java.lang.Appendable out,final FastqRecord record) {
        final String readName = record.getReadName();
        final String readString = record.getReadString();
        final String qualHeader = record.getBaseQualityHeader();
        final String qualityString = record.getBaseQualityString();
        return out.append(FastqConstants.SEQUENCE_HEADER).append(readName == null ? "" : readName).append('\n')
                .append(readString == null ? "" : readString).append('\n')
                .append(FastqConstants.QUALITY_HEADER).append(qualHeader == null ? "" : qualHeader).append('\n')
                .append(qualityString == null ? "" : qualityString);
    }

P.

Contributor

magicDGS commented Feb 24, 2017

@lindenb, sounds like a good improvement. I'd push a commit with the change and use it in FastqWriter.

Contributor

magicDGS commented Feb 24, 2017

Commit d5cc096 includes @lindenb suggestion, which is quite useful.

+ * Writes a FastqRecord into the Appendable output.
+ * @throws SAMException if any I/O error occurs.
+ */
+ public static Appendable write(final Appendable out,final FastqRecord record) {
@lbergelson

lbergelson Feb 24, 2017

Contributor

👍 Good addition.

+ */
+public final class FastqEncoder {
+
+ // cannot be instantiated because it is an utility class
@lbergelson

lbergelson Feb 24, 2017

Contributor

👍

+ */
+ public static String encode(final FastqRecord record) {
+ // reserve some memory based on the read length and read name
+ final int capacity = record.getReadLength() * 2 + record.getReadName().length() + 5;
@lbergelson

lbergelson Feb 24, 2017

Contributor

@magicDGS The tests are failing with a null pointer exception. It looks like it's caused by this unguarded call of record.getReadName().length()

@magicDGS

magicDGS Feb 27, 2017

Contributor

I'm thinking if I should change the contract of the getters to do not return null? What do you think? I guess that it's safer for all the cases, including the new getters.

@@ -23,62 +23,169 @@
*/
package htsjdk.samtools.fastq;
+import htsjdk.samtools.SAMRecord;
@lbergelson

lbergelson Feb 24, 2017

Contributor

This import should probably go away.

@magicDGS

magicDGS Feb 27, 2017

Contributor

Done.

Contributor

lbergelson commented Feb 24, 2017

@magicDGS There's a test failure due to a NullPointerException. Could you fix that and rebase this on master? I think it's good to go once that's done. Thanks for the suggestion @lindenb .

Contributor

magicDGS commented Feb 27, 2017

Three new commits in the branch, @lbergelson:

  1. Removed the unused import (although this is added in 3)
  2. Fix the NPE for null read names
  3. New getters for byte[] bases/quals does not return null anymore, but SAMRecord.NULL_SEQUENCE and SAMRecord.NULL_QUALS

I separated the 3rd commit because it's the only one that should be reviewed and/or removed if not accepted the solution. Because they are new getters, we are not changing the API. But maybe it will be important to change it for the other ones to avoid NPE for clients.

magicDGS added some commits Apr 15, 2016

@magicDGS magicDGS FastqRecord refactoring and FastqCodec for encode/decode 28bc84d
@magicDGS magicDGS Addressing comments 0f19663
@magicDGS magicDGS Add @lindenb suggestion 6fd2459
@magicDGS magicDGS Removed unused import d03b390
@magicDGS magicDGS Fix NPE 64e55a3
@magicDGS magicDGS Non-null return for new getters
a990401
Contributor

lbergelson commented Feb 27, 2017

@magicDGS I hate the original null returning api. I don't know why the original decision was to return null instead of empty string, that always seems wrong to me. It's very explicitly choosing to return null though, so I imagine breaking it would probably cause problems. 👍 To returning the samrecord empty values for the new ones.

@lbergelson lbergelson merged commit c1227d8 into samtools:master Feb 27, 2017

4 checks passed

codecov/changes Unable to determine changes, no report found at pull request base.
Details
codecov/patch 82.051% of diff hit (target 64.843%)
Details
codecov/project 64.864% (+0.021%) compared to b964598
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

magicDGS commented Feb 27, 2017

Thanks @lbergelson!

magicDGS referenced this pull request in magicDGS/ReadTools Feb 27, 2017

Closed

Simplify FASTQ classes by using new HTSJDK functionality #138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment