Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] not accepting BAM with empty header #2536

Merged
merged 2 commits into from
Apr 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ If possible, provide tooling that performs the changes, e.g. a shell-script.
* Requesting the alignment without also requesting the sequence for BAM files containing empty CIGAR strings does now
not result in erroneous parsing ([\#2418](https://github.com/seqan/seqan3/pull/2418)).
* BAM files with 64 references are now parsed correctly ([\#2423](https://github.com/seqan/seqan3/pull/2423)).
* BAM files not containing a plain text header are now accepted ([\#2536](https://github.com/seqan/seqan3/pull/2536)).
* Writing `gz`-compressed output no longer results in `bgzf`-compressed output. This change may have following effects
([\#2458](https://github.com/seqan/seqan3/pull/2458)):
* A noticeable slowdown when writing `gz`-compressed content since, in contrast to `bgzf`, `gz` does not feature
Expand Down
39 changes: 29 additions & 10 deletions include/seqan3/io/sam_file/format_bam.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -333,24 +333,43 @@ inline void format_bam::read_alignment_record(stream_type & stream,
if (!std::ranges::equal(stream_view | views::take_exactly_or_throw(4), std::string_view{"BAM\1"}))
throw format_error{"File is not in BAM format."};

int32_t tmp32{};
read_field(stream_view, tmp32);
int32_t l_text{}; // length of header text including \0 character
int32_t n_ref{}; // number of reference sequences
int32_t l_name{}; // 1 + length of reference name including \0 character
int32_t l_ref{}; // length of reference sequence
Comment on lines +336 to +339
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per specification, these should be uint32_t. Maybe there was a reason they are not, and maybe we can find that reason.
Just replacing does not work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely a follow-up!


if (tmp32 > 0) // header text is present
read_header(stream_view | views::take_exactly_or_throw(tmp32), header, ref_seqs);
read_field(stream_view, l_text);

if (l_text > 0) // header text is present
read_header(stream_view | views::take_exactly_or_throw(l_text), header, ref_seqs);

int32_t n_ref;
read_field(stream_view, n_ref);

for (int32_t ref_idx = 0; ref_idx < n_ref; ++ref_idx)
{
read_field(stream_view, tmp32); // l_name (length of reference name including \0 character)
read_field(stream_view, l_name);

string_buffer.resize(tmp32 - 1);
std::ranges::copy_n(std::ranges::begin(stream_view), tmp32 - 1, string_buffer.data()); // copy without \0 character
string_buffer.resize(l_name - 1);
std::ranges::copy_n(std::ranges::begin(stream_view), l_name - 1, string_buffer.data()); // copy without \0 character
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, this should copy the \0 character, because l_name is 1 + length including \0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean :)

Do want to copy that \0 too?

Copy link
Member Author

@eseiler eseiler Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says: We do not copy \0.
The code says: We copy \0.

So follow-up would be to double-check and, if we do not copy the \0, to add a comment as to why is that. Because l_name is one more than the length including \0, so l_name - 1 should include the \0.

std::ranges::next(std::ranges::begin(stream_view)); // skip \0 character

read_field(stream_view, tmp32); // l_ref (length of reference sequence)
read_field(stream_view, l_ref);

if constexpr (detail::decays_to_ignore_v<ref_seqs_type>) // no reference information given
{
// If there was no header text, we parse reference sequences block as header information
if (l_text == 0)
{
auto & reference_ids = header.ref_ids();
// put the length of the reference sequence into ref_id_info
header.ref_id_info.emplace_back(l_ref, "");
// put the reference name into reference_ids
reference_ids.push_back(string_buffer);
// assign the reference name an ascending reference id (starts at index 0).
header.ref_dict.emplace(reference_ids.back(), reference_ids.size() - 1);
continue;
}
}

auto id_it = header.ref_dict.find(string_buffer);

Expand All @@ -367,7 +386,7 @@ inline void format_bam::read_alignment_record(stream_type & stream,
" does not correspond to the position ", id_it->second,
" in the header (header.ref_ids():", header.ref_ids(), ").")};
}
else if (std::get<0>(header.ref_id_info[id_it->second]) != tmp32) // [unlikely]
else if (std::get<0>(header.ref_id_info[id_it->second]) != l_ref) // [unlikely]
{
throw format_error{"Provided reference has unequal length as specified in the header."};
}
Expand Down
20 changes: 20 additions & 0 deletions test/unit/io/sam_file/format_bam_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -666,3 +666,23 @@ TEST_F(bam_format, issue2417)

EXPECT_EQ(num_records, 1u);
}

// https://github.com/seqan/seqan3/issues/1201
TEST_F(bam_format, issue1201)
{
// A BAM file with an l_text of 0
std::string const input{
// read1 117 ref 1 0 * = 1 0 ACGTA IIIII
'\x42', '\x41', '\x4d', '\x01', '\x00', '\x00', '\x00', '\x00', '\x01', '\x00', '\x00', '\x00', '\x04', '\x00',
'\x00', '\x00', '\x72', '\x65', '\x66', '\x00', '\x70', '\x07', '\x00', '\x00', '\x2e', '\x00', '\x00', '\x00',
'\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x06', '\x00', '\x49', '\x12', '\x00', '\x00',
'\x75', '\x00', '\x05', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00',
'\x00', '\x00', '\x00', '\x00', '\x72', '\x65', '\x61', '\x64', '\x31', '\x00', '\x12', '\x48', '\x10', '\x28',
'\x28', '\x28', '\x28', '\x28'
};

std::istringstream stream{input};
seqan3::sam_file_input fin{stream, seqan3::format_bam{}};

fin.header().format_version;
}