Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Refactoring FastqRecord #572
Conversation
|
I would like to re-implement also |
|
Third commit change the API: toString() does not encode a FASTQ-file format and |
yfarjoun
added Review-party candidate and removed Review-party candidate
labels
Jun 11, 2016
lbergelson
was assigned
by droazen
Jun 14, 2016
lbergelson
and 4 others
commented on an outdated diff
Jun 14, 2016
| + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
| + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
| + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
| + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
| + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
| + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
| + * THE SOFTWARE. | ||
| + */ | ||
| +package htsjdk.samtools; | ||
| + | ||
| +/** | ||
| + * Simple interface for reads: DNA sequences with associated base quality. | ||
| + * | ||
| + * @author Daniel Gómez-Sánchez (magicDGS) | ||
| + */ | ||
| +public interface Read { |
lbergelson
Contributor
|
|
I think that this is also a good opportunity for change the behavior of empty reads as suggested in #557. |
lbergelson
and 1 other
commented on an outdated diff
Jun 15, 2016
| - if (seqHeaderPrefix != null && !seqHeaderPrefix.isEmpty()) this.seqHeaderPrefix = seqHeaderPrefix; | ||
| - else this.seqHeaderPrefix = null; | ||
| - if (qualHeaderPrefix != null && !qualHeaderPrefix.isEmpty()) this.qualHeaderPrefix = qualHeaderPrefix; | ||
| - else this.qualHeaderPrefix = null; | ||
| - this.seqLine = seqLine ; | ||
| - this.qualLine = qualLine ; | ||
| + private final String readName; | ||
| + private final String readString; | ||
| + private final String qualityHeader; | ||
| + private final String qualityString; | ||
| + | ||
| + /** | ||
| + * Default constructor | ||
| + * | ||
| + * @param readName the read name | ||
| + * @param readStrig the sequence string |
lbergelson
Contributor
|
lbergelson
and 1 other
commented on an outdated diff
Jun 15, 2016
| - this.qualLine = qualLine ; | ||
| + private final String readName; | ||
| + private final String readString; | ||
| + private final String qualityHeader; | ||
| + private final String qualityString; | ||
| + | ||
| + /** | ||
| + * Default constructor | ||
| + * | ||
| + * @param readName the read name | ||
| + * @param readStrig the sequence string | ||
| + * @param qualityHeader the quality header | ||
| + * @param qualityString the quality string | ||
| + */ | ||
| + public FastqRecord(final String readName, final String readStrig, final String qualityHeader, final String qualityString) { | ||
| + if (readName != null && !readName.isEmpty()) this.readName = readName; |
lbergelson
Contributor
|
lbergelson
and 1 other
commented on an outdated diff
Jun 15, 2016
| - /** shortcut to getReadString().length() */ | ||
| - public int length() { return this.seqLine==null?0:this.seqLine.length();} | ||
| - | ||
| + @Override | ||
| + public String getBaseQualityString() { | ||
| + return qualityString; | ||
| + } | ||
| + | ||
| + @Override | ||
| + public byte[] getBaseQualities() { | ||
| + return SAMUtils.fastqToPhred(qualityString); | ||
| + } | ||
| + | ||
| + @Override | ||
| + public int getReadLength() { | ||
| + return readString.length(); |
|
|
lbergelson
and 1 other
commented on an outdated diff
Jun 15, 2016
| + | ||
| + /** | ||
| + * Get the base quality header | ||
| + * | ||
| + * @return the base quality header | ||
| + */ | ||
| + public String getBaseQualityHeader() { | ||
| + return qualityHeader; | ||
| + } | ||
| + | ||
| + /** | ||
| + * shortcut to getReadString().length() | ||
| + * | ||
| + * @deprecated use {@link #getReadLength()} instead | ||
| + */ | ||
| + @Deprecated |
|
|
lbergelson
and 1 other
commented on an outdated diff
Jun 15, 2016
| - /** copy constructor */ | ||
| + | ||
| + /** | ||
| + * Constructor for byte[] arrays | ||
| + * | ||
| + * @param readName the read name | ||
| + * @param readBases the read sequence as ASCII bytes ACGTN=. | ||
| + * @param qualityHeader the quality header | ||
| + * @param readQualities the base qualities as binary PHRED scores (not ASCII) | ||
| + */ | ||
| + public FastqRecord(final String readName, final byte[] readBases, final String qualityHeader, final byte[] readQualities) { | ||
| + this(readName, StringUtil.bytesToString(readBases), qualityHeader, SAMUtils.phredToFastq(readQualities)); | ||
| + } | ||
| + | ||
| + /** | ||
| + * Copy constructor from a read |
|
|
lbergelson
and 1 other
commented on an outdated diff
Jun 15, 2016
| + } | ||
| + | ||
| + /** | ||
| + * Copy constructor from a read | ||
| + * | ||
| + * @param read the read to convert to a FastqRecord | ||
| + */ | ||
| + public FastqRecord(final Read read) { | ||
| + this(read.getReadName(), read.getReadString(), (read instanceof FastqRecord) ? ((FastqRecord) read).qualityHeader : null, read.getBaseQualityString()); | ||
| + } | ||
| + | ||
| + /** | ||
| + * Copy constructor | ||
| + * | ||
| + * @param other record to copy | ||
| + */ | ||
| public FastqRecord(final FastqRecord other) { |
lbergelson
Contributor
|
lbergelson
and 1 other
commented on an outdated diff
Jun 15, 2016
| + * @param readName the read name | ||
| + * @param readBases the read sequence as ASCII bytes ACGTN=. | ||
| + * @param qualityHeader the quality header | ||
| + * @param readQualities the base qualities as binary PHRED scores (not ASCII) | ||
| + */ | ||
| + public FastqRecord(final String readName, final byte[] readBases, final String qualityHeader, final byte[] readQualities) { | ||
| + this(readName, StringUtil.bytesToString(readBases), qualityHeader, SAMUtils.phredToFastq(readQualities)); | ||
| + } | ||
| + | ||
| + /** | ||
| + * Copy constructor from a read | ||
| + * | ||
| + * @param read the read to convert to a FastqRecord | ||
| + */ | ||
| + public FastqRecord(final Read read) { | ||
| + this(read.getReadName(), read.getReadString(), (read instanceof FastqRecord) ? ((FastqRecord) read).qualityHeader : null, read.getBaseQualityString()); |
lbergelson
Contributor
|
tfenne
and 3 others
commented on an outdated diff
Jun 15, 2016
| + */ | ||
| +package htsjdk.samtools; | ||
| + | ||
| +/** | ||
| + * Simple interface for reads: DNA sequences with associated base quality. | ||
| + * | ||
| + * @author Daniel Gómez-Sánchez (magicDGS) | ||
| + */ | ||
| +public interface Read { | ||
| + | ||
| + /** | ||
| + * Get the read name | ||
| + * | ||
| + * @return the read name | ||
| + */ | ||
| + public String getReadName(); |
tfenne
Owner
|
magicDGS
changed the title from
added read interface and refactoring FastqRecord to added RawRead interface and refactoring FastqRecord
Jun 17, 2016
|
@lbergelson, back to you with some questions:
|
magicDGS
referenced
this pull request
in magicDGS/ReadTools
Jun 17, 2016
Closed
Use RawRead interface when added to htsjdk #25
coveralls
commented
Jun 26, 2016
coveralls
commented
Jun 26, 2016
|
Now it pass the checks, @lbergelson. Just waiting for the decision of which FASTQ convention should we follow... |
|
@magicDGS Discussing this PR, it seems to us that the union of |
|
I don't see the forcing in the single interface for both, because a read which comes in a FASTQ is feed to the aligner to obtain a SAM record. Thus, they are representing the same kind of record before and after processing them. In addition, some services are starting to provide raw reads in the BAM format, so it also make sense for me to generate the interface. Another benefit for the common interface is that other code in htsjdk and related projects could be simplify, like the For me it will be easy to have them together because I'm using FASTQ/BAM sources for raw reads, and develop two different utils classes/methods generates lots of repetitive code. In addition, the quality encoding in bytes for Nevertheless, if the rest of the community thinks that the I will be happy either with the |
|
@lbergelson or @droazen, any feedback about this? |
I give this a
|
|
@nh13, first of all thank you for your comments. I really understand your point because I'm very annoyed by the lack of FASTQ standard practices. Most of the aligners feed the FASTQ files directly without any checking of the qualities, and this result a lot of times into misencoded BAM files. I agree that FASTQ files should be used (I'm not using them) and that supporting other encodings in SAM/BAM is a very bad idea, and I'm not suggesting here that (actually, my PR does not contain anything about supported qualities). Answering your first list:
I agree with you in every point, but it's not up to me how the reads comes from the service or how my institution keeps the reads, and at this point both are providing raw reads in tons of different formats. In addition, I think that this interface is a good idea to start normalizing things in genomic data, because from my point of view is very weird that data before processing and data after processing could not be represented by a common interface. According to a programming structure, it is exactly the same data but after include more information: the Nevertheless, I can change this PR to only implement the changes in the |
|
|
@droazen @lbergelson @yfarjoun I would welcome others to contribute to the discussion here. |
Thanks again for the fruitful discussion, @nh13. Even if this PR is not accepted, I would appreciate if I can include some of the methods and abstraction in |
|
@magicDGS I continue to think that in order to properly unify under a common interface, there needs to be not only shared data between the two types of records, but shared contracts for that data -- and I think the shared contracts are somewhat lacking in this case. As @nh13 pointed out, there are important differences between the two formats in the name and qual fields. And you're already finding yourself having to bypass the new interface with code like:
which is a warning sign that unification at the type system level might not be the right approach here. I think the non-interface aspect of this PR (easier conversion) is less controversial and could probably be merged with minor changes. It would be good to get some additional opinions before we make a decision on this PR, though -- @lbergelson @yfarjoun @tfenne care to chime in? |
|
@magicDGS thanks for the insight into the problems you are trying to solve and what is in your control. My inclination would be to standardize all the data you receive into valid SAM format, then move onto analysis (mapping, QC, etc.). This would make your analysis pipeline simpler, quarantine the parts of your pipeline that need to deal with non-standard data formats, and reduce the burden on htsjdk to support non-standard formats (wrong quality format) and not-so-best practices (storing things in the read names). So perhaps try to reduce the PR to as little as possible that would enable your use case of reformatting the input data, but keep no methods that would allow creation of non-standard SAM? More succinctly, perhaps @droazen gives the right solution. |
|
Thanks for your feedback, @nh13 and @droazen. I think that your suggestions are great, and I'm very grateful for them. Should I remove the interface and keep the utility methods in |
|
I don't think you need to wait for every-one to chime in. I agree that if On Wed, Jul 27, 2016 at 4:44 AM, Daniel Gómez-Sánchez <
|
coveralls
commented
Jul 27, 2016
coveralls
commented
Jul 27, 2016
coveralls
commented
Jul 27, 2016
|
Thanks @yfarjoun. Now this PR is ready for review, @lbergelson (you are the one assigned, aren't you?). The changes are the following:
Other stuff that could be implemented in this PR is the changes in the contract suggested in #557 for empty reads. |
|
While you are making this change, may i ask you for a favor and add a functionality to convert a FastqRecord to SAMRecord? private SAMRecord createSamRecord(final SAMFileHeader header,
final String baseName,
final FastqRecord frec,
final boolean paired);Thanks! |
|
Sure @SHuang-Broad, it was also in my plans to include something like that in the |
magicDGS
changed the title from
added RawRead interface and refactoring FastqRecord to Refactoring FastqRecord
Aug 9, 2016
|
Added the method, @SHuang-Broad, but a very simple one compared to the method that you mentioned. |
|
I squashed and rebased the branch. Could you have a look @lbergelson? I would like to have this in htsjdk, will be very useful for my work! Thanks in advance! |
coveralls
commented
Sep 1, 2016
coveralls
commented
Sep 1, 2016
|
Can you review this, @lbergelson? Thanks a lot in advance! |
|
This is here for a long time now. Should I close, @lbergelson? I think that tons of methods here are very useful, apart of making more consistent the API... |
|
Ack, yes, this has been here for a very long time. Don't close. I will take a look. |
lbergelson
requested changes
Feb 23, 2017
@magicDGS A few minor comments and then good to go. Sorry this one sat forever. I assigned it to the "things to do later" mental category and the longer it sat the more easily I ignored it.
| +import htsjdk.samtools.util.SequenceUtil; | ||
| + | ||
| +/** | ||
| + * Codec for encode records into FASTQ format. |
| + * | ||
| + * @author Daniel Gomez-Sanchez (magicDGS) | ||
| + */ | ||
| +public class FastqCodec { |
lbergelson
Feb 23, 2017
Contributor
Is this really a codec? It doesn't have a decode, just an encode. Maybe it should be called FastqEncoder? Or maybe it should just be FastqUtils since it also does the conversion with SamRecords?
magicDGS
Feb 24, 2017
Contributor
I can't remember why I named it like that, but I guess that it is because it is also decoding SAMRecord into FastqRecord. Renamed as FastqEncoder, because in FastqUtils I would expect to have the constants and methods like getSamReadNameFromFastqHeader.
| + * | ||
| + * @author Daniel Gomez-Sanchez (magicDGS) | ||
| + */ | ||
| +public class FastqCodec { |
| + * @author Daniel Gomez-Sanchez (magicDGS) | ||
| + */ | ||
| +public class FastqCodec { | ||
| + |
lbergelson
Feb 23, 2017
Contributor
add a private noarg constructor to prevent people from instantiating instances of this class since it's really a utility class.
| + /** | ||
| + * Encodes a FastqRecord in the String FASTQ format. | ||
| + */ | ||
| + public static String encode(final FastqRecord record) { |
lbergelson
Feb 23, 2017
Contributor
Should there be an encode override that takes a sam record as well?
magicDGS
Feb 24, 2017
Contributor
That's a very nice idea, although it is just a sugar syntax. Done with the following implementation: encode(asFastqRecord(record))
| + | ||
| + /** | ||
| + * @return the read name | ||
| + * @deprecated use {@link #getReadName()} instead |
lbergelson
Feb 23, 2017
Contributor
Could you add a note to each deprecated function saying when they were deprecated? month and year is enough. (assume we'll merge this this month....)
| return true; | ||
| } | ||
| - | ||
| + | ||
| + /** Simple toString() that gives a read name and length */ | ||
| @Override | ||
| public String toString() { |
lbergelson
Feb 23, 2017
Contributor
So I'm a bit afraid that there is a code out there in the while that relies on this giving the old output.
Could you also add a method toFastQString() that also calls through to encode so that's there's an obvious way to do what this used to do? I think toString should probably call through to that and produce the same output as it used to just to avoid giving people surprise headaches.
magicDGS
Feb 24, 2017
Contributor
It is discourage to use toString() to other purposes that is not developmental, but it is true that some code may rely on this. I changed it and point to the toFastQString, but I would rather prefer to add a comment saying that this is discouraged and that this will change at some point (if we set a date, will be nice). What do you think, @lbergelson?
|
I addressed all your comments, @lbergelson. I would like to have a more developmental-like |
codecov-io
commented
Feb 24, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #572 +/- ##
===============================================
+ Coverage 64.843% 64.864% +0.021%
- Complexity 7160 7174 +14
===============================================
Files 525 526 +1
Lines 31701 31731 +30
Branches 5420 5424 +4
===============================================
+ Hits 20556 20582 +26
- Misses 8996 9000 +4
Partials 2149 2149
Continue to review full report at Codecov.
|
|
A minor suggestion: for FastqCodec, instead of creating a StringBuilder for each record, wouldn't it be better to create a method using Something like:
P. |
|
@lindenb, sounds like a good improvement. I'd push a commit with the change and use it in |
| + * Writes a FastqRecord into the Appendable output. | ||
| + * @throws SAMException if any I/O error occurs. | ||
| + */ | ||
| + public static Appendable write(final Appendable out,final FastqRecord record) { |
| + */ | ||
| +public final class FastqEncoder { | ||
| + | ||
| + // cannot be instantiated because it is an utility class |
| + */ | ||
| + public static String encode(final FastqRecord record) { | ||
| + // reserve some memory based on the read length and read name | ||
| + final int capacity = record.getReadLength() * 2 + record.getReadName().length() + 5; |
lbergelson
Feb 24, 2017
Contributor
@magicDGS The tests are failing with a null pointer exception. It looks like it's caused by this unguarded call of record.getReadName().length()
magicDGS
Feb 27, 2017
Contributor
I'm thinking if I should change the contract of the getters to do not return null? What do you think? I guess that it's safer for all the cases, including the new getters.
| @@ -23,62 +23,169 @@ | ||
| */ | ||
| package htsjdk.samtools.fastq; | ||
| +import htsjdk.samtools.SAMRecord; |
|
Three new commits in the branch, @lbergelson:
I separated the 3rd commit because it's the only one that should be reviewed and/or removed if not accepted the solution. Because they are new getters, we are not changing the API. But maybe it will be important to change it for the other ones to avoid NPE for clients. |
magicDGS
added some commits
Apr 15, 2016
|
@magicDGS I hate the original null returning api. I don't know why the original decision was to return null instead of empty string, that always seems wrong to me. It's very explicitly choosing to return null though, so I imagine breaking it would probably cause problems. |
lbergelson
merged commit c1227d8
into
samtools:master
Feb 27, 2017
|
Thanks @lbergelson! |
magicDGS commentedApr 15, 2016
•
edited
Motivation
SAMRecordsstore bases and qualities inbyte[], but usingStringUtilandSAMUtilsfor it. For an user that would like to retrieve in the same way the sequence/qualities from aFastqRecord, which basically store the same read, it is needed to go deeper in the API to understand how this is transformed (personal experience).Having a
ReadRawReadinterface with defined contract for these methods could help to overcome that issue. Like that, methods likeSolexaQualityConverterorTrimmingUtilscould be used forFastqRecord, andQualityEncodingDetectorcould be clean up to include any kind of reads.Another usage is when transforming a
SAMRecordto aFastqRecord, which could be done easier if the original base qualities are not requested. It will be useful, for instance, in Picard toolsSamToFastq.Description (only first commit)
ReadRawReadinterfaceSAMRecordandFastqRecordextendsRead(does not change the API)FastqRecordforbyte[]and from otherReadFastqRecordinterface changing private variable names according to theReadinterfaceFastqRecordReadRawReadAPIChecklist