-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring FastqRecord #572
Conversation
I would like to re-implement also |
Third commit change the API: toString() does not encode a FASTQ-file format and |
src/java/htsjdk/samtools/Read.java
Outdated
* | ||
* @author Daniel Gómez-Sánchez (magicDGS) | ||
*/ | ||
public interface Read { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this Read
will be confusing for a lot of people who think of a SamRecord
as a Read. Could you rename this so it's clear that it's more limited than SamRecord? RawRead
was a suggestion from an in person conversation that a lot of people agreed with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ReadSequence
may be a better name. Or some other name. Read
is too broad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about RawRead?
On Tue, Jun 14, 2016 at 2:26 PM, Adam Kiezun notifications@github.com
wrote:
In src/java/htsjdk/samtools/Read.java
#572 (comment):
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
- * THE SOFTWARE.
- /
+package htsjdk.samtools;
+
+/*- * Simple interface for reads: DNA sequences with associated base quality.
- * @author Daniel Gómez-Sánchez (magicDGS)
- */
+public interface Read {I think ReadSequence may be a better name. Or some other name. Read is
too broad—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/samtools/htsjdk/pull/572/files/15a832457f0e5ac062053ea04d8c5532c0efe6c5#r67028043,
or mute the thread
https://github.com/notifications/unsubscribe/ACnk0uttdRMtS17ZmRv-AHpWhEK0ml2Eks5qLvJCgaJpZM4IIcf-
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it for the one that you prefer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the name should actually depend on the (as yet unspecified) behaviour of some of the interface methods! RawRead
would imply to me that I'm getting what came off the sequencer without any trimming, RC'ing or BSQR or other changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed as RawRead
(also in the PR)
I think that this is also a good opportunity for change the behavior of empty reads as suggested in #557. |
* Default constructor | ||
* | ||
* @param readName the read name | ||
* @param readStrig the sequence string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in readStrig. Also, could you make clear that these are the bases? I feel like sequence can be interpreted in a number of ways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed for more descriptive parameter + improving javadoc
@lbergelson, back to you with some questions:
|
c15fe32
to
c8a8720
Compare
Now it pass the checks, @lbergelson. Just waiting for the decision of which FASTQ convention should we follow... |
@magicDGS Discussing this PR, it seems to us that the union of |
I don't see the forcing in the single interface for both, because a read which comes in a FASTQ is feed to the aligner to obtain a SAM record. Thus, they are representing the same kind of record before and after processing them. In addition, some services are starting to provide raw reads in the BAM format, so it also make sense for me to generate the interface. Another benefit for the common interface is that other code in htsjdk and related projects could be simplify, like the For me it will be easy to have them together because I'm using FASTQ/BAM sources for raw reads, and develop two different utils classes/methods generates lots of repetitive code. In addition, the quality encoding in bytes for Nevertheless, if the rest of the community thinks that the I will be happy either with the |
@lbergelson or @droazen, any feedback about this? |
I give this a 👎 for two reasons:
|
@nh13, first of all thank you for your comments. I really understand your point because I'm very annoyed by the lack of FASTQ standard practices. Most of the aligners feed the FASTQ files directly without any checking of the qualities, and this result a lot of times into misencoded BAM files. I agree that FASTQ files should be used (I'm not using them) and that supporting other encodings in SAM/BAM is a very bad idea, and I'm not suggesting here that (actually, my PR does not contain anything about supported qualities). Answering your first list:
I agree with you in every point, but it's not up to me how the reads comes from the service or how my institution keeps the reads, and at this point both are providing raw reads in tons of different formats. In addition, I think that this interface is a good idea to start normalizing things in genomic data, because from my point of view is very weird that data before processing and data after processing could not be represented by a common interface. According to a programming structure, it is exactly the same data but after include more information: the Nevertheless, I can change this PR to only implement the changes in the |
|
Can you review this, @lbergelson? Thanks a lot in advance! |
This is here for a long time now. Should I close, @lbergelson? I think that tons of methods here are very useful, apart of making more consistent the API... |
Ack, yes, this has been here for a very long time. Don't close. I will take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magicDGS A few minor comments and then good to go. Sorry this one sat forever. I assigned it to the "things to do later" mental category and the longer it sat the more easily I ignored it.
import htsjdk.samtools.util.SequenceUtil; | ||
|
||
/** | ||
* Codec for encode records into FASTQ format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reads better as "codec for encoding"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
/** | ||
* @return the read name | ||
* @deprecated use {@link #getReadName()} instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a note to each deprecated function saying when they were deprecated? month and year is enough. (assume we'll merge this this month....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return true; | ||
} | ||
|
||
|
||
/** Simple toString() that gives a read name and length */ | ||
@Override | ||
public String toString() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm a bit afraid that there is a code out there in the while that relies on this giving the old output.
Could you also add a method toFastQString()
that also calls through to encode so that's there's an obvious way to do what this used to do? I think toString
should probably call through to that and produce the same output as it used to just to avoid giving people surprise headaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is discourage to use toString()
to other purposes that is not developmental, but it is true that some code may rely on this. I changed it and point to the toFastQString
, but I would rather prefer to add a comment saying that this is discouraged and that this will change at some point (if we set a date, will be nice). What do you think, @lbergelson?
* | ||
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
public class FastqCodec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a codec? It doesn't have a decode, just an encode. Maybe it should be called FastqEncoder
? Or maybe it should just be FastqUtils since it also does the conversion with SamRecords?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember why I named it like that, but I guess that it is because it is also decoding SAMRecord
into FastqRecord
. Renamed as FastqEncoder
, because in FastqUtils
I would expect to have the constants and methods like getSamReadNameFromFastqHeader
.
* | ||
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
public class FastqCodec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make class final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
public class FastqCodec { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a private noarg constructor to prevent people from instantiating instances of this class since it's really a utility class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and adding comment to the constructor to point out why.
/** | ||
* Encodes a FastqRecord in the String FASTQ format. | ||
*/ | ||
public static String encode(final FastqRecord record) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be an encode override that takes a sam record as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very nice idea, although it is just a sugar syntax. Done with the following implementation: encode(asFastqRecord(record))
I addressed all your comments, @lbergelson. I would like to have a more developmental-like |
Codecov Report
@@ Coverage Diff @@
## master #572 +/- ##
===============================================
+ Coverage 64.843% 64.864% +0.021%
- Complexity 7160 7174 +14
===============================================
Files 525 526 +1
Lines 31701 31731 +30
Branches 5420 5424 +4
===============================================
+ Hits 20556 20582 +26
- Misses 8996 9000 +4
Partials 2149 2149
Continue to review full report at Codecov.
|
A minor suggestion: for FastqCodec, instead of creating a StringBuilder for each record, wouldn't it be better to create a method using Something like:
P. |
@lindenb, sounds like a good improvement. I'd push a commit with the change and use it in |
* Writes a FastqRecord into the Appendable output. | ||
* @throws SAMException if any I/O error occurs. | ||
*/ | ||
public static Appendable write(final Appendable out,final FastqRecord record) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Good addition.
*/ | ||
public final class FastqEncoder { | ||
|
||
// cannot be instantiated because it is an utility class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
*/ | ||
public static String encode(final FastqRecord record) { | ||
// reserve some memory based on the read length and read name | ||
final int capacity = record.getReadLength() * 2 + record.getReadName().length() + 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magicDGS The tests are failing with a null pointer exception. It looks like it's caused by this unguarded call of record.getReadName().length()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if I should change the contract of the getters to do not return null? What do you think? I guess that it's safer for all the cases, including the new getters.
@@ -23,62 +23,169 @@ | |||
*/ | |||
package htsjdk.samtools.fastq; | |||
|
|||
import htsjdk.samtools.SAMRecord; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import should probably go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Three new commits in the branch, @lbergelson:
I separated the 3rd commit because it's the only one that should be reviewed and/or removed if not accepted the solution. Because they are new getters, we are not changing the API. But maybe it will be important to change it for the other ones to avoid NPE for clients. |
41a175d
to
a990401
Compare
@magicDGS I hate the original null returning api. I don't know why the original decision was to return null instead of empty string, that always seems wrong to me. It's very explicitly choosing to return null though, so I imagine breaking it would probably cause problems. 👍 To returning the samrecord empty values for the new ones. |
Thanks @lbergelson! |
Motivation
SAMRecords
store bases and qualities inbyte[]
, but usingStringUtil
andSAMUtils
for it. For an user that would like to retrieve in the same way the sequence/qualities from aFastqRecord
, which basically store the same read, it is needed to go deeper in the API to understand how this is transformed (personal experience).Having a
Read
RawRead
interface with defined contract for these methods could help to overcome that issue. Like that, methods likeSolexaQualityConverter
orTrimmingUtils
could be used forFastqRecord
, andQualityEncodingDetector
could be clean up to include any kind of reads.Another usage is when transforming a
SAMRecord
to aFastqRecord
, which could be done easier if the original base qualities are not requested. It will be useful, for instance, in Picard toolsSamToFastq
.Description (only first commit)
Read
RawRead
interfaceSAMRecord
andFastqRecord
extendsRead
(does not change the API)FastqRecord
forbyte[]
and from otherRead
FastqRecord
interface changing private variable names according to theRead
interfaceFastqRecord
Read
RawRead
APIChecklist