Unmapped reads are not added to the BAI linear index #1142

jmarshall · 2020-09-23T14:17:27Z

Unmapped reads are not added to the BAI linear index, whether they are placed or unplaced:

Lines 1939 to 1950 in 1814ae3

    
           if ( tid>=0 ) 
        
           { 
        
               if (idx->bidx[tid] == 0) idx->bidx[tid] = kh_init(bin); 
        
               if (is_mapped) { 
        
                   // shoehorn [-1,0) (VCF POS=0) into the leftmost bottom-level bin 
        
                   if (beg < 0)  beg = 0; 
        
                   if (end <= 0) end = 1; 
        
                   // idx->z.last_off points to the start of the current record 
        
                   if (insert_to_l(&idx->lidx[tid], beg, end, 
        
                                   idx->z.last_off, idx->min_shift) < 0) return -1; 
        
               } 
        
           }

Since PR #1031, which landed in HTSlib 1.11 and made iterators directly take advantage of the linear index, iterator behaviour has changed for some data files. For example, if the first read in a query region is a placed unmapped read (i.e., it's filled in with its mate's coordinate) and that read is earlier in the file than the first mapped read in the same linear interval — then using the BAI linear index will cause that first read to be omitted from the iteration output as it skips over it to the linear interval's first mapped read.

Unmapped reads stopped being added to the linear index in samtools/samtools@cb5f449 for not entirely clear reasons. It may be that this was a bug fix to stop invalid data being added to the linear index in the case of unplaced reads (with tid set, slightly oddly, and a pos of -1), but that placed unmapped reads should have continued to be added.

Originally posted by @jmarshall in #1031 (comment) — see further discussion there, including note of the resulting pysam test case failures.

The text was updated successfully, but these errors were encountered:

jmarshall · 2020-09-23T14:47:52Z

Testing with picard BuildBamIndex on the problematic pysam data file (ex2_sorted.bam) shows that HTSJDK does (in this case at least) record the placed unmapped read in the linear index.

daviesrob · 2020-09-23T14:56:08Z

Out of interest, does picard use the linear index? And if so has it given not-quite-correct results for htslib-made indexes for a very long time, now?

jmarshall · 2020-09-23T15:19:05Z

I was going to say No and No on the basis of picard ViewSam INTERVAL_LIST=…, but it seems that I have no idea what Picard command to use to ensure it uses an index…

valeriuo · 2020-09-23T15:39:58Z

I think ValidateSamFile uses the index.

daviesrob · 2020-09-23T16:42:39Z

htsjdk does make use of the linear index, so it's possible that it has this behaviour.
https://github.com/samtools/htsjdk/blob/5240c404/src/main/java/htsjdk/samtools/BinningIndexContent.java#L140
https://github.com/samtools/htsjdk/blob/5240c404/src/main/java/htsjdk/samtools/Chunk.java#L164

On the other hand, picard ViewSam does appear to use the index (confirmed via strace) and does print the unmapped reads, so maybe it isn't affected.

daviesrob · 2020-09-24T15:24:32Z

While fixing the index generation isn't too hard, we are left with the problem of compatibility with all of the .bai indexes made since 2011. I think it might be possible to fix without reverting #1031 completely by correcting the linear index offset by using data in the binning index.

As unmapped, placed reads should be 1bp long for indexing purposes, they should be in the smallest bin that overlaps the linear index interval. If this is <= 16kb in size, we just have to look for the smallest chunk_beg and use that if it's less than the linear ioffset.

If the bin is bigger than 16kb (for example because the compress_binning() function pushed the unmapped reads to a longer bin) then there's a bit more work to be done. In that case we would need to search backwards the part of the linear index that overlaps that bin to find an entry with an offset less than our original ioffset. If there is one, we can use the new offset to cull some of the entries from the bin we're searching as we know they are outside the region of interest. The smallest chunk_beg in the remaining entries can again be used to correct the original ioffset.

I think this should fix the unmapped reads problem on old indexes while mostly retaining the speed benefits. I would expect the amount of extra data that would be read would usually be fairly small. As an additional optimisation, we could also consider checking n_unmapped in the "magic" bin. If it's zero, we will know that there are no unmapped reads, and so no additional correction is needed to the original linear index location.

jkbonfield · 2020-09-24T16:53:05Z

Are the unmapped reads still added to the bin chunks, even though they've been excluded from the linear index? If so then I see your point.

Using the conceptual example in the spec (a 3-way instead of 8-way tree) an alignment A(m) from 51k to 70k ends up in bin 2, but its unmapped sibling A(u) would end up in bin 7. A region query of 50k to 60k will hit the linear index covering 48k-64k which is (wrongly if A(u) is before A(m) in the file) pointing to the virtual offset for A(m). The only way this can happen however is if there are no mapped reads in bin 7. If there were, then the linear index would point to the first of those.

Thus the only time in that scenario that we do any more reading at all is indeed the same scenario we are trying to fix - unmapped data is missing from the linear index.

What about if bins 7 and 8 got merged into 2 because they were too small? Well in that case with just A(u) followed by A(m) the linear index points to a virtual offset which is part way through the chunk-beg to chunk-end range of bin 2, but chunk-beg is where we need to be so we can use that instead of the linear index. That's not a penalty because it's the correct place to start had the index been built correctly. If however we had data elsewhere in the bin, eg Z(m) at pos 48K to 49K, then this would also likely be part of the merged chunk-beg to chunk-end range and hence we do indeed have to read more data, negating the whole benefit of the linear index. That basically boils down to not using the linear index I believe and back to square 1, but it only applies when bins have been merged.

I don't have a good handle on how chunking is done (how many we coalesce together) and the frquency of bin merging to know what percentage of times we couldn't use the linear index.

What a total can of worms. :(

PS. I think the spec is also wrong, or at least highly confusing. It states that any alignment starting beyond 64Mb will always need to seek to some chunks in bin 0 (and then goes on to explain how linear indexes can be used to resolve that). Firstly, the data may be very sparse and we may actually have no chunks at all in bin 0, hence no seeking. Secondly we may have a mammothly long read in bin 0 (eg 10Mbp to 80Mbp) in which case we'll be reading bin 0 even for queries that start before 64Mbp. I'm not really sure this spec sentence particularly helps clarify things.

daviesrob · 2020-09-25T08:45:44Z

A bin is merged if the amount of compressed data between the first and last entries in the indexed file is less than 64K. So it will only happen when the data is fairly sparse. It can in theory go all the way to bin 0. Even if it did happen to end up in a very long bin, as long as there are a few mapped reads around it should still be possible to use the linear index to remove parts of the file before the area of interest, just not quite as much as before.

jkbonfield · 2020-09-25T10:24:56Z

Note the reason for adding linear index support was specifially for sparse data. From #1031

"It looks stark, but remember this is due to shallow data - around 2x coverage. For deep data the difference is minimal."

So that's the case we need to test on.

Avoid exercising samtools/htslib#1142 (present in the HTSlib-1.11 release), which leads to test failures due to the way we test using iterators. This is not what these pysam test cases are trying to test, so alter the input data file to avoid the issue.

Unmapped-but-placed (having REF/POS) reads are not included in the index. Hence if an placed unmapped is the first record in a bin, then it may not be returned. Note most aligners write out mapped followed by unmapped which does not trigger this problem. The SAM spec states that all unmapped placed reads should be considered as having an alignment length of 1. While it doesn't seem to explicitly state these must therefore be in the index, it does imply it. It appears that picard also indexes placed reads in this manner. Originally reported by John Marshall. Fixes samtools#1142

Unmapped-but-placed (having REF/POS) reads are not included in the index. Hence if an placed unmapped is the first record in a bin, then it may not be returned. Note most aligners write out mapped followed by unmapped which does not trigger this problem. The SAM spec states that all unmapped placed reads should be considered as having an alignment length of 1. While it doesn't seem to explicitly state these must therefore be in the index, it does imply it. It appears that picard also indexes placed reads in this manner. Originally reported by John Marshall. Fixes #1142

This was referenced Oct 2, 2020

Correct the linear index to include unmapped placed reads #1152

Closed

Adjust the starting offset of the iterator for unmapped placed reads #1155

Closed

valeriuo mentioned this issue Oct 16, 2020

Adjust the min offset of a bin when unmapped placed reads detected #1158

Merged

jmarshall mentioned this issue Oct 17, 2020

Test failures with samtools/htslib/bcftools 1.11 pysam-developers/pysam#961

Closed

jkbonfield closed this as completed in #1158 Oct 21, 2020

jmarshall mentioned this issue Jun 29, 2021

Summer 2021 NEWS update #1303

Merged

jmarshall mentioned this issue Nov 3, 2021

Fix indexing bug with "placed" unmapped reads. #1352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unmapped reads are not added to the BAI linear index #1142

Unmapped reads are not added to the BAI linear index #1142

jmarshall commented Sep 23, 2020 •

edited

Loading

jmarshall commented Sep 23, 2020

daviesrob commented Sep 23, 2020

jmarshall commented Sep 23, 2020

valeriuo commented Sep 23, 2020

daviesrob commented Sep 23, 2020

daviesrob commented Sep 24, 2020

jkbonfield commented Sep 24, 2020 •

edited

Loading

daviesrob commented Sep 25, 2020

jkbonfield commented Sep 25, 2020

Unmapped reads are not added to the BAI linear index #1142

Unmapped reads are not added to the BAI linear index #1142

Comments

jmarshall commented Sep 23, 2020 • edited Loading

jmarshall commented Sep 23, 2020

daviesrob commented Sep 23, 2020

jmarshall commented Sep 23, 2020

valeriuo commented Sep 23, 2020

daviesrob commented Sep 23, 2020

daviesrob commented Sep 24, 2020

jkbonfield commented Sep 24, 2020 • edited Loading

daviesrob commented Sep 25, 2020

jkbonfield commented Sep 25, 2020

jmarshall commented Sep 23, 2020 •

edited

Loading

jkbonfield commented Sep 24, 2020 •

edited

Loading