Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[index] Initialize the missing entries of the linear index in reverse order #1286

Closed
wants to merge 3 commits into from

Conversation

valeriuo
Copy link
Contributor

For each contig, the linear index is an array of the smallest file offsets corresponding to alignments that cover a particular contig window, each 2min_shift bp wide. In case there are no alignments covering a particular genomic window, the linear index entries are filled from adjacent valid entries.
Whereas before, the empty linear index entries were filled with values of the nearest valid entry from the left, the current PR changes this procedure to use the nearest valid entry from the right, making the linear index values of empty windows slightly larger. This change has the potential to improve some reading operations by skipping over unnecessary data. However, in practice, most of HTSlib indexed reads use iterators, which rely on actual index chunks of existing alignments. So the linear index value is used as a starting point for searching valid chunks and not as an actual file offset. Since there are no valid chunks corresponding to the empty windows, there will be no change in performance for iterator reads.

The dump_index method has been renamed into idx_dump and a few new display capabilities have been added to it:

  • showing the index level of a bin
  • showing the genomic region covered by a bin

The hts_idx_save_core has been renamed into idx_save_core to reflect its static use and to match the reading equivalent idx_read_core.

The new hts_bin_level method has been added. It computes the index level of a bin.

Fixes #486

htslib/hts.h Outdated Show resolved Hide resolved
@jkbonfield
Copy link
Contributor

I'm struggling to observe any impact on this.

I have 10,000 1KB regions spread across chr22 on a TopHat aligned Exome. It has ~11% of bases covered, so there should be lots of holes for the linear index to make an impact on. Using strace to dump out all the lseeks I cannot find a single difference, both with test_view and test_view -M (although I didn't expect the latter to differ anyway). This was with a .bai index. I tried with .csi and there's no difference either. Although there is a big difference of course between using .bai and .csi, sadly with the latter having 23% more seeks and 16% more bytes read. Losing the linear index in CSI was a bad move. :-(

I can't really understand why there is no difference. The crux of it is seeding the empty linear index slots with the previous non-empty index item vs the next non-empty index item. The latter definitely sounds like the right thing to do, but I don't understand how it doesn't become a change in seek location.

I also tested the impact on rebuilding the index. That did change things, marginally for the better. On the old index vs new index I get:

# Old
    Num. seeks   	8359
    Num. read    	33762
    Bytes read   	1098949004
    
 # New
    Num. seeks   	8292
    Num. read    	33651
    Bytes read   	1095311756

Maybe I misunderstand how this is being called then. I assumed the linear index contained empty items on disk and the empty bits got populated on reading, but this doesn't appear to be the case. I can confirm that the alignments returned by these 10,000 queries are identical.

I'll continue digging and use your dump index tool to understand it better (thanks).

There are still some other useful additions here, such as the improved index dumping, improved documentation and hts_bin_level function. So irrespective of our assessment on update_loff, this is worth persuing further.

@jkbonfield
Copy link
Contributor

I can see that there are missing bins, which presumably also means missing loff figures, but the bins that are present all have an loff assigned. I see 1269 loff differences between the two indices, out of 72207. So subtle, but there.

I've looked at the code and see hts_read_core contains this:

https://github.com/valeriuo/htslib/blob/lindex_rev/hts.c#L2592-L2593

            for (j = 1; j < l->n; ++j) // fill missing values; may happen given older samtools and tabix
                if (l->offset[j] == 0) l->offset[j] = l->offset[j-1];

This appears to be a duplicate of the bit of code you changed, which is called by the index generation function. Hence why we see no difference when running on an existing index. However I tried changing it to the same logic and it made no difference still:

-            for (j = 1; j < l->n; ++j) // fill missing values; may happen given older samtools and tabix
-                if (l->offset[j] == 0) l->offset[j] = l->offset[j-1];
+            for (j=l->n-2; j >= 0; j--)
+                if (l->offset[j] == 0) l->offset[j] = l->offset[j+1];

@valeriuo
Copy link
Contributor Author

I also tested the impact on rebuilding the index. That did change things, marginally for the better. On the old index vs new index I get:

Indeed. You can only see a difference by regenerating the index file.

Maybe I misunderstand how this is being called then. I assumed the linear index contained empty items on disk and the empty bits got populated on reading, but this doesn't appear to be the case. I can confirm that the alignments returned by these 10,000 queries are identical.

No, the linear index on the disk has no holes. So the new code and the old code will return the same index structure in memory.

I've looked at the code and see hts_read_core contains this:

https://github.com/valeriuo/htslib/blob/lindex_rev/hts.c#L2592-L2593

That probably has no effect at all with the current index files that have the full linear index on disk, but I will change it to reflect the new filling strategy.

Show bin level.
Add documentation to bin methods.
Rename `hts_idx_save_core` to `idx_save_core`, to indicate its
static use.
@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 1, 2021

Squashed / rebased and merged as 1830551

@jkbonfield jkbonfield closed this Jun 1, 2021
@valeriuo valeriuo deleted the lindex_rev branch June 3, 2021 10:58
jmarshall added a commit to jmarshall/pysam that referenced this pull request Jul 9, 2021
This accounts for samtools/htslib#1286, shipped in HTSlib 1.13, which
changes the contents of the various index files. Only example.gtf.gz.tbi
caused test failures, but regenerate all of these anyway (other than
example_0v2[36]*, which are intentionally using index files generated
by old versions of tabix).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ioff for linear index appears to be set improperly for empty bins
3 participants