Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmapped reads are not added to the BAI linear index #1142

Closed
jmarshall opened this issue Sep 23, 2020 · 9 comments · Fixed by #1158
Closed

Unmapped reads are not added to the BAI linear index #1142

jmarshall opened this issue Sep 23, 2020 · 9 comments · Fixed by #1158

Comments

@jmarshall
Copy link
Member

jmarshall commented Sep 23, 2020

Unmapped reads are not added to the BAI linear index, whether they are placed or unplaced:

htslib/hts.c

Lines 1939 to 1950 in 1814ae3

if ( tid>=0 )
{
if (idx->bidx[tid] == 0) idx->bidx[tid] = kh_init(bin);
if (is_mapped) {
// shoehorn [-1,0) (VCF POS=0) into the leftmost bottom-level bin
if (beg < 0) beg = 0;
if (end <= 0) end = 1;
// idx->z.last_off points to the start of the current record
if (insert_to_l(&idx->lidx[tid], beg, end,
idx->z.last_off, idx->min_shift) < 0) return -1;
}
}

Since PR #1031, which landed in HTSlib 1.11 and made iterators directly take advantage of the linear index, iterator behaviour has changed for some data files. For example, if the first read in a query region is a placed unmapped read (i.e., it's filled in with its mate's coordinate) and that read is earlier in the file than the first mapped read in the same linear interval — then using the BAI linear index will cause that first read to be omitted from the iteration output as it skips over it to the linear interval's first mapped read.

Unmapped reads stopped being added to the linear index in samtools/samtools@cb5f449 for not entirely clear reasons. It may be that this was a bug fix to stop invalid data being added to the linear index in the case of unplaced reads (with tid set, slightly oddly, and a pos of -1), but that placed unmapped reads should have continued to be added.

Originally posted by @jmarshall in #1031 (comment) — see further discussion there, including note of the resulting pysam test case failures.

@jmarshall
Copy link
Member Author

Testing with picard BuildBamIndex on the problematic pysam data file (ex2_sorted.bam) shows that HTSJDK does (in this case at least) record the placed unmapped read in the linear index.

@daviesrob
Copy link
Member

Out of interest, does picard use the linear index? And if so has it given not-quite-correct results for htslib-made indexes for a very long time, now?

@jmarshall
Copy link
Member Author

I was going to say No and No on the basis of picard ViewSam INTERVAL_LIST=…, but it seems that I have no idea what Picard command to use to ensure it uses an index…

@valeriuo
Copy link
Contributor

I think ValidateSamFile uses the index.

@daviesrob
Copy link
Member

htsjdk does make use of the linear index, so it's possible that it has this behaviour.
https://github.com/samtools/htsjdk/blob/5240c404/src/main/java/htsjdk/samtools/BinningIndexContent.java#L140
https://github.com/samtools/htsjdk/blob/5240c404/src/main/java/htsjdk/samtools/Chunk.java#L164

On the other hand, picard ViewSam does appear to use the index (confirmed via strace) and does print the unmapped reads, so maybe it isn't affected.

@daviesrob
Copy link
Member

While fixing the index generation isn't too hard, we are left with the problem of compatibility with all of the .bai indexes made since 2011. I think it might be possible to fix without reverting #1031 completely by correcting the linear index offset by using data in the binning index.

As unmapped, placed reads should be 1bp long for indexing purposes, they should be in the smallest bin that overlaps the linear index interval. If this is <= 16kb in size, we just have to look for the smallest chunk_beg and use that if it's less than the linear ioffset.

If the bin is bigger than 16kb (for example because the compress_binning() function pushed the unmapped reads to a longer bin) then there's a bit more work to be done. In that case we would need to search backwards the part of the linear index that overlaps that bin to find an entry with an offset less than our original ioffset. If there is one, we can use the new offset to cull some of the entries from the bin we're searching as we know they are outside the region of interest. The smallest chunk_beg in the remaining entries can again be used to correct the original ioffset.

I think this should fix the unmapped reads problem on old indexes while mostly retaining the speed benefits. I would expect the amount of extra data that would be read would usually be fairly small. As an additional optimisation, we could also consider checking n_unmapped in the "magic" bin. If it's zero, we will know that there are no unmapped reads, and so no additional correction is needed to the original linear index location.

@jkbonfield
Copy link
Contributor

jkbonfield commented Sep 24, 2020

Are the unmapped reads still added to the bin chunks, even though they've been excluded from the linear index? If so then I see your point.

Using the conceptual example in the spec (a 3-way instead of 8-way tree) an alignment A(m) from 51k to 70k ends up in bin 2, but its unmapped sibling A(u) would end up in bin 7. A region query of 50k to 60k will hit the linear index covering 48k-64k which is (wrongly if A(u) is before A(m) in the file) pointing to the virtual offset for A(m). The only way this can happen however is if there are no mapped reads in bin 7. If there were, then the linear index would point to the first of those.

Thus the only time in that scenario that we do any more reading at all is indeed the same scenario we are trying to fix - unmapped data is missing from the linear index.

What about if bins 7 and 8 got merged into 2 because they were too small? Well in that case with just A(u) followed by A(m) the linear index points to a virtual offset which is part way through the chunk-beg to chunk-end range of bin 2, but chunk-beg is where we need to be so we can use that instead of the linear index. That's not a penalty because it's the correct place to start had the index been built correctly. If however we had data elsewhere in the bin, eg Z(m) at pos 48K to 49K, then this would also likely be part of the merged chunk-beg to chunk-end range and hence we do indeed have to read more data, negating the whole benefit of the linear index. That basically boils down to not using the linear index I believe and back to square 1, but it only applies when bins have been merged.

I don't have a good handle on how chunking is done (how many we coalesce together) and the frquency of bin merging to know what percentage of times we couldn't use the linear index.

What a total can of worms. :(

PS. I think the spec is also wrong, or at least highly confusing. It states that any alignment starting beyond 64Mb will always need to seek to some chunks in bin 0 (and then goes on to explain how linear indexes can be used to resolve that). Firstly, the data may be very sparse and we may actually have no chunks at all in bin 0, hence no seeking. Secondly we may have a mammothly long read in bin 0 (eg 10Mbp to 80Mbp) in which case we'll be reading bin 0 even for queries that start before 64Mbp. I'm not really sure this spec sentence particularly helps clarify things.

@daviesrob
Copy link
Member

A bin is merged if the amount of compressed data between the first and last entries in the indexed file is less than 64K. So it will only happen when the data is fairly sparse. It can in theory go all the way to bin 0. Even if it did happen to end up in a very long bin, as long as there are a few mapped reads around it should still be possible to use the linear index to remove parts of the file before the area of interest, just not quite as much as before.

@jkbonfield
Copy link
Contributor

Note the reason for adding linear index support was specifially for sparse data. From #1031

"It looks stark, but remember this is due to shallow data - around 2x coverage. For deep data the difference is minimal."

So that's the case we need to test on.

jmarshall added a commit to jmarshall/pysam that referenced this issue Oct 19, 2020
Avoid exercising samtools/htslib#1142 (present in the HTSlib-1.11 release),
which leads to test failures due to the way we test using iterators.
This is not what these pysam test cases are trying to test, so alter the
input data file to avoid the issue.
jmarshall added a commit to jmarshall/pysam that referenced this issue Oct 19, 2020
Avoid exercising samtools/htslib#1142 (present in the HTSlib-1.11 release),
which leads to test failures due to the way we test using iterators.
This is not what these pysam test cases are trying to test, so alter the
input data file to avoid the issue.
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Nov 3, 2021
Unmapped-but-placed (having REF/POS) reads are not included in the
index.  Hence if an placed unmapped is the first record in a bin, then
it may not be returned.  Note most aligners write out mapped followed
by unmapped which does not trigger this problem.

The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1.  While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it.  It appears that picard also indexes placed reads in this
manner.

Originally reported by John Marshall.  Fixes samtools#1142
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Nov 3, 2021
Unmapped-but-placed (having REF/POS) reads are not included in the
index.  Hence if an placed unmapped is the first record in a bin, then
it may not be returned.  Note most aligners write out mapped followed
by unmapped which does not trigger this problem.

The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1.  While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it.  It appears that picard also indexes placed reads in this
manner.

Originally reported by John Marshall.  Fixes samtools#1142
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Nov 3, 2021
Unmapped-but-placed (having REF/POS) reads are not included in the
index.  Hence if an placed unmapped is the first record in a bin, then
it may not be returned.  Note most aligners write out mapped followed
by unmapped which does not trigger this problem.

The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1.  While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it.  It appears that picard also indexes placed reads in this
manner.

Originally reported by John Marshall.  Fixes samtools#1142
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Nov 3, 2021
Unmapped-but-placed (having REF/POS) reads are not included in the
index.  Hence if an placed unmapped is the first record in a bin, then
it may not be returned.  Note most aligners write out mapped followed
by unmapped which does not trigger this problem.

The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1.  While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it.  It appears that picard also indexes placed reads in this
manner.

Originally reported by John Marshall.  Fixes samtools#1142
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Nov 4, 2021
Unmapped-but-placed (having REF/POS) reads are not included in the
index.  Hence if an placed unmapped is the first record in a bin, then
it may not be returned.  Note most aligners write out mapped followed
by unmapped which does not trigger this problem.

The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1.  While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it.  It appears that picard also indexes placed reads in this
manner.

Originally reported by John Marshall.  Fixes samtools#1142
daviesrob pushed a commit that referenced this issue Nov 12, 2021
Unmapped-but-placed (having REF/POS) reads are not included in the
index.  Hence if an placed unmapped is the first record in a bin, then
it may not be returned.  Note most aligners write out mapped followed
by unmapped which does not trigger this problem.

The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1.  While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it.  It appears that picard also indexes placed reads in this
manner.

Originally reported by John Marshall.  Fixes #1142
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment