Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

noodles bam failled to write record with large mapping position. #229

Closed
natir opened this issue Feb 1, 2024 · 1 comment
Closed

noodles bam failled to write record with large mapping position. #229

natir opened this issue Feb 1, 2024 · 1 comment
Assignees
Labels
bam bug Something isn't working

Comments

@natir
Copy link

natir commented Feb 1, 2024

The bam writer cannot write a record with a position greater than 997,048,321, whereas the specification indicates that the max value is 2^31 - 1.

Col Field Type Regexp/Range
4 POS Int [0, $2^{31} - 1$]

There's no problem for the sam writer here's a small example:

use noodles_bam as bam;
use noodles_sam as sam;
use noodles_util::alignment;

fn main() {
    let mut writer = bam::io::Writer::new(Vec::new());
    let header = sam::Header::default();

    let record1 = sam::alignment::RecordBuf::builder()
        .set_alignment_start(
            noodles_core::Position::new(997048321).expect("Position conversion failled"),
        )
        .build();
    let record2 = sam::alignment::RecordBuf::builder()
        .set_alignment_start(
            noodles_core::Position::new(997048322).expect("Position conversion failled"),
        )
        .build();

    let mut sam_writer = alignment::io::writer::builder::Builder::default()
        .set_format(alignment::io::Format::Sam)
        .build_from_writer(std::io::stdout().lock())
        .unwrap();
    sam_writer
        .write_record(&header, &record1)
        .expect("sam write record1 failed");
    sam_writer
        .write_record(&header, &record2)
        .expect("sam write record2 failed");
    println!("Write sam record success");

    let mut bam_writer = alignment::io::writer::builder::Builder::default()
        .set_format(alignment::io::Format::Bam)
        .build_from_writer(std::io::stdout().lock())
        .unwrap();
    bam_writer
        .write_record(&header, &record1)
        .expect("bam write record1 failed");
    bam_writer
        .write_record(&header, &record2)
        .expect("bam write record2 failed");
    println!("Write bam record success");
}

@zaeleus zaeleus self-assigned this Feb 1, 2024
@zaeleus zaeleus added bug Something isn't working bam labels Feb 1, 2024
@zaeleus zaeleus closed this as completed in 36c14bf Feb 1, 2024
zaeleus added a commit that referenced this issue Feb 1, 2024
… CIGAR operations that consume the reference sequence

Calculating the wrong alignment end in this case was found while
investigating #229.
@zaeleus
Copy link
Owner

zaeleus commented Feb 1, 2024

Oh, that's an interesting case!

Writing the position isn't failing but calculating the BAM index bin ID is. While the max position for BAM is indeed 231 - 1 (0-based), the max supported position in BAI is 229 - 1 (0-based). Each BAM record does include its supposed bin ID if it were to be placed in an index, calculated from its position, and 997048321 is the first position that overflows the field. (This position works in your example due to the alignment end being incorrectly calculated for records with no reference-consuming CIGAR operations. This is fixed in d19eb8e.)

The specification doesn't say what to do with this field when you have a nonsensical value (for BAM, > 37449) or when the bin ID overflows. I ended up taking the same approach as htslib and using the overflowing values.

Thanks for the report and test case! This fix is released in noodles 0.62.0.


If you're interested in the math, binning indices are m-ary trees, where $m$ = 8. For BAI, depth $d$ of the tree is 5, and the window size $w$ is $2^{s}$, where $s$ is 14. The number of nodes in the last level of the tree $L$ is $m^{d}$. A BAM bin ID is 2 bytes, meaning there are $u = 2^{16}$ values. The total number of nodes in the tree $N$, which is also the maximum bin ID, is $\frac{m * L - 1}{m - 1}$. The max position $P$ is $Lw$.

The last position that does not overflow the bin ID is given by

$$\begin{align} p_{last} &= P + w(u - N) \\\ &= Lw + (2^{s})(u - \frac{m * L - 1}{m - 1}) \\\ &= (m^{d})(2^{s}) + (2^{s})(u - \frac{m * m^{d} - 1}{m - 1}) \\\ &= (8^5)(2^{14}) + (2^{14})(2^{16} - \frac{8 * 8^{5} - 1}{8 - 1}) \\\ &= 997048320 \end{align}$$

Therefore, 997048320 + 1 = 997048321 (1-based) will overflow the bin ID calculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bam bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants