Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[I/O] Small fix when reading field::cigar and changing the alignment default fields. #1642

Merged
merged 3 commits into from
Apr 24, 2020

Conversation

smehringer
Copy link
Member

@smehringer smehringer commented Mar 13, 2020

Resolves #1616
Blocked by #1697

The seqan3::field::cigar was added to the default fields for reading and writing alignment files.
This has the following impact:

  1. Reading and writing in one line is now possible without additional reference information:
    seqan3::alignment_file_output{"foo.sam"} = seqan3::alignment_file_input{"bar.sam"};
  2. The seqan3::alignment_file_output now accepts seqan3::field::cigar and seqan3::field::alignment
    although they store redundant information. For the SAM/BAM format, this ambiguity is handled by favoring the CIGAR
    information at all times if present.

Afterwards, we can discuss if we want to discard the dummy sequence and only allow reading the alignment field if the reference information is present.

@smehringer smehringer requested review from a team and eseiler and removed request for a team March 13, 2020 07:20
Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some snippet is failing :)

Is there a card for the further discussion?

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
doc/tutorial/alignment_file/index.md Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 13, 2020

Codecov Report

Merging #1642 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1642   +/-   ##
=======================================
  Coverage   97.68%   97.68%           
=======================================
  Files         238      238           
  Lines        9078     9080    +2     
=======================================
+ Hits         8868     8870    +2     
  Misses        210      210           
Impacted Files Coverage Δ
include/seqan3/io/alignment_file/input.hpp 100.00% <ø> (ø)
include/seqan3/io/alignment_file/output.hpp 100.00% <ø> (ø)
include/seqan3/io/alignment_file/format_bam.hpp 95.49% <100.00%> (+0.01%) ⬆️
include/seqan3/io/alignment_file/format_sam.hpp 97.99% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d921134...50b60aa. Read the comment docs.

@smehringer smehringer requested review from eseiler and joergi-w and removed request for eseiler March 17, 2020 07:31
Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree that it's good to allow streaming the sam format into another file, because you can make copies while applying filters etc.
At the same time I'm worried that our alignment_file module becomes too specialised for sam/bam and it may become difficult to implement further formats like Aligned Fasta, msf, Clustal, Stockholm etc. which rely on aligned sequences. My point is that with cigar compared to alignment we loose information and by enabling a nice transformation of sam files we disable it for all others.

I don't know if you agreed in @seqan/core about this change, but for me it contradicts with our original design decision here that we had all agreed on.

I think we should label this commit/PR also with the [API] tag.

CHANGELOG.md Outdated Show resolved Hide resolved
doc/tutorial/alignment_file/index.md Outdated Show resolved Hide resolved
doc/tutorial/alignment_file/index.md Outdated Show resolved Hide resolved
@smehringer
Copy link
Member Author

@joergi-w Thanks a lot for your input!
We reconsidered the decision and now instead of replacing field::cigar with field::alignment we will read in both by default.

I had to force push the changes since they are quite different now. Please review anew :)

Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think this is a good solution. We need to add a test case where alignment and CIGAR are contradicting to prove that we use the CIGAR version for sam. (And maybe check coverage test for other new code paths.) There is one not resolved comment and a discussion point (which does not block this PR in my opinion):

doc/tutorial/alignment_file/index.md Outdated Show resolved Hide resolved
@@ -388,6 +375,23 @@ TYPED_TEST_P(alignment_file_read, format_error_ref_id_not_in_reference_informati
// alignment_file_write
// ----------------------------------------------------------------------------

// Note that these differ from the alignment_file_output default fields:
// 1. They don't contain field::bit_score and field::evalue since these belong to the BLAST format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reasoning suggests that this file is already specialised for sam/bam. Can this template file still be easily included and used for BLAST (and other) format tests? If the answer is no, I suggest to rename this file into something more specific to avoid confusion in case someone implements another alignment format. It seems like the formats are so different that we cannot have a single template anyway?

Can be discussed independently of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, this test template would not work for BLAST just like this.
It would need to be split into a generic alignment format template and a sam/bam template. I would suggest doing so when BLAST will be implemented to avoid code refactorings / renamings before the new format can be tested with it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, alright! :)

@smehringer
Copy link
Member Author

@joergi-w Sorry for the delay but in order to fix the code coverage, there was some major discussion about noexcept. Since these are resolved I rebased this PR (unfortunately I had to force push). But I did not change anything here so your review is rather to check if rebasing caused any issues. Thank you !

Copy link
Member

@joergi-w joergi-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks good!

@smehringer smehringer requested a review from rrahn April 14, 2020 07:20
@rrahn rrahn added this to the Sprint 2 milestone Apr 15, 2020
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor things. The only thing that I find odd is that ther is no explicit test that demonstrates that now the cigar field is read in addition to the alignment field per default. But maybe that is handled somewhere?

CHANGELOG.md Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
include/seqan3/io/alignment_file/format_bam.hpp Outdated Show resolved Hide resolved
@rrahn
Copy link
Contributor

rrahn commented Apr 15, 2020

Also can you fix the typo in the commit message? fierlds ➡️ fields

@smehringer
Copy link
Member Author

Also can you fix the typo in the commit message? fierlds arrow_right fields

I will fix it after you approved and I fix the commit history ok?

@smehringer smehringer requested a review from rrahn April 15, 2020 16:25
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor things left 💅

CHANGELOG.md Show resolved Hide resolved
@@ -515,8 +515,8 @@ inline void format_bam::read_alignment_record(stream_type & stream,
if constexpr (!detail::decays_to_ignore_v<align_type>)
{
assign_unaligned(get<1>(align),
seq | views::slice(static_cast<decltype(std::ranges::distance(seq))>(offset_tmp),
std::ranges::distance(seq) - soft_clipping_end));
seq | views::slice(static_cast<std::ranges::difference_type<seq_type>>(offset_tmp),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
seq | views::slice(static_cast<std::ranges::difference_type<seq_type>>(offset_tmp),
seq | views::slice(static_cast<std::ranges::range_difference_t<seq_type>>(offset_tmp),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is the other even possible? Is this never tested? Or is std::ranges::difference_type an alias?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already outdated! I pushed the same fix a minute afterwards :D

@smehringer smehringer dismissed eseiler’s stale review April 16, 2020 04:29

Stale review because @eseiler was sick

@smehringer smehringer requested a review from rrahn April 16, 2020 04:34
@smehringer
Copy link
Member Author

smehringer commented Apr 16, 2020

Also can you fix the typo in the commit message? fierlds arrow_right fields

@rrahn I don't find the typo?

@marehr marehr changed the title [I/O] Small fix when reading field::cigar and changing the alignment default fierlds. [I/O] Small fix when reading field::cigar and changing the alignment default fields. Apr 16, 2020
smehringer and others added 3 commits April 21, 2020 09:04
If the cigar string matches ‘kSmN’ the cigar was larger then 65535 operations
and was stored in the optional CG tag of the sam_tag_dictionary.
Before this fix, the cigar string was only checked if the field::alignment was read.
…/BAM.

The alignment cannot represent hard clipping, resulting in information
loss when reading and writing SAM/BAM files if the alignment field is
preferred over the cigar field.
Set for seqan3::alignment_file_input and seqan3::alignment_file_output.
@smehringer
Copy link
Member Author

Forgot to merge this after jenkins was fixed

@smehringer smehringer merged commit 949aea9 into seqan:master Apr 24, 2020
@smehringer smehringer deleted the io_api branch November 4, 2020 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error when reading bam files with default fields.
4 participants