-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[I/O] Small fix when reading field::cigar and changing the alignment default fields. #1642
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some snippet is failing :)
Is there a card for the further discussion?
test/unit/io/alignment_file/alignment_file_format_test_template.hpp
Outdated
Show resolved
Hide resolved
test/unit/io/alignment_file/alignment_file_format_test_template.hpp
Outdated
Show resolved
Hide resolved
test/unit/io/alignment_file/alignment_file_format_test_template.hpp
Outdated
Show resolved
Hide resolved
test/unit/io/alignment_file/alignment_file_format_test_template.hpp
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #1642 +/- ##
=======================================
Coverage 97.68% 97.68%
=======================================
Files 238 238
Lines 9078 9080 +2
=======================================
+ Hits 8868 8870 +2
Misses 210 210
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I agree that it's good to allow streaming the sam format into another file, because you can make copies while applying filters etc.
At the same time I'm worried that our alignment_file module becomes too specialised for sam/bam and it may become difficult to implement further formats like Aligned Fasta, msf, Clustal, Stockholm etc. which rely on aligned sequences. My point is that with cigar compared to alignment we loose information and by enabling a nice transformation of sam files we disable it for all others.
I don't know if you agreed in @seqan/core about this change, but for me it contradicts with our original design decision here that we had all agreed on.
I think we should label this commit/PR also with the [API] tag.
@joergi-w Thanks a lot for your input! I had to force push the changes since they are quite different now. Please review anew :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think this is a good solution. We need to add a test case where alignment and CIGAR are contradicting to prove that we use the CIGAR version for sam. (And maybe check coverage test for other new code paths.) There is one not resolved comment and a discussion point (which does not block this PR in my opinion):
@@ -388,6 +375,23 @@ TYPED_TEST_P(alignment_file_read, format_error_ref_id_not_in_reference_informati | |||
// alignment_file_write | |||
// ---------------------------------------------------------------------------- | |||
|
|||
// Note that these differ from the alignment_file_output default fields: | |||
// 1. They don't contain field::bit_score and field::evalue since these belong to the BLAST format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reasoning suggests that this file is already specialised for sam/bam. Can this template file still be easily included and used for BLAST (and other) format tests? If the answer is no, I suggest to rename this file into something more specific to avoid confusion in case someone implements another alignment format. It seems like the formats are so different that we cannot have a single template anyway?
Can be discussed independently of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, this test template would not work for BLAST just like this.
It would need to be split into a generic alignment format template and a sam/bam template. I would suggest doing so when BLAST will be implemented to avoid code refactorings / renamings before the new format can be tested with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, alright! :)
@joergi-w Sorry for the delay but in order to fix the code coverage, there was some major discussion about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor things. The only thing that I find odd is that ther is no explicit test that demonstrates that now the cigar field is read in addition to the alignment field per default. But maybe that is handled somewhere?
Also can you fix the typo in the commit message? |
I will fix it after you approved and I fix the commit history ok? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor things left 💅
@@ -515,8 +515,8 @@ inline void format_bam::read_alignment_record(stream_type & stream, | |||
if constexpr (!detail::decays_to_ignore_v<align_type>) | |||
{ | |||
assign_unaligned(get<1>(align), | |||
seq | views::slice(static_cast<decltype(std::ranges::distance(seq))>(offset_tmp), | |||
std::ranges::distance(seq) - soft_clipping_end)); | |||
seq | views::slice(static_cast<std::ranges::difference_type<seq_type>>(offset_tmp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq | views::slice(static_cast<std::ranges::difference_type<seq_type>>(offset_tmp), | |
seq | views::slice(static_cast<std::ranges::range_difference_t<seq_type>>(offset_tmp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is the other even possible? Is this never tested? Or is std::ranges::difference_type
an alias?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already outdated! I pushed the same fix a minute afterwards :D
Stale review because @eseiler was sick
@rrahn I don't find the typo? |
If the cigar string matches ‘kSmN’ the cigar was larger then 65535 operations and was stored in the optional CG tag of the sam_tag_dictionary. Before this fix, the cigar string was only checked if the field::alignment was read.
…/BAM. The alignment cannot represent hard clipping, resulting in information loss when reading and writing SAM/BAM files if the alignment field is preferred over the cigar field.
Set for seqan3::alignment_file_input and seqan3::alignment_file_output.
Forgot to merge this after jenkins was fixed |
Resolves #1616
Blocked by #1697The
seqan3::field::cigar
was added to the default fields for reading and writing alignment files.This has the following impact:
seqan3::alignment_file_output{"foo.sam"} = seqan3::alignment_file_input{"bar.sam"};
seqan3::alignment_file_output
now acceptsseqan3::field::cigar
andseqan3::field::alignment
although they store redundant information. For the SAM/BAM format, this ambiguity is handled by favoring the CIGAR
information at all times if present.
Afterwards, we can discuss if we want to discard the dummy sequence and only allow reading the alignment field if the reference information is present.