New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[index] Initialize the missing entries of the linear index in reverse order #1286
Conversation
I'm struggling to observe any impact on this. I have 10,000 1KB regions spread across chr22 on a TopHat aligned Exome. It has ~11% of bases covered, so there should be lots of holes for the linear index to make an impact on. Using I can't really understand why there is no difference. The crux of it is seeding the empty linear index slots with the previous non-empty index item vs the next non-empty index item. The latter definitely sounds like the right thing to do, but I don't understand how it doesn't become a change in seek location. I also tested the impact on rebuilding the index. That did change things, marginally for the better. On the old index vs new index I get:
Maybe I misunderstand how this is being called then. I assumed the linear index contained empty items on disk and the empty bits got populated on reading, but this doesn't appear to be the case. I can confirm that the alignments returned by these 10,000 queries are identical. I'll continue digging and use your dump index tool to understand it better (thanks). There are still some other useful additions here, such as the improved index dumping, improved documentation and |
I can see that there are missing bins, which presumably also means missing loff figures, but the bins that are present all have an loff assigned. I see 1269 loff differences between the two indices, out of 72207. So subtle, but there. I've looked at the code and see https://github.com/valeriuo/htslib/blob/lindex_rev/hts.c#L2592-L2593
This appears to be a duplicate of the bit of code you changed, which is called by the index generation function. Hence why we see no difference when running on an existing index. However I tried changing it to the same logic and it made no difference still:
|
Indeed. You can only see a difference by regenerating the index file.
No, the linear index on the disk has no holes. So the new code and the old code will return the same index structure in memory.
That probably has no effect at all with the current index files that have the full linear index on disk, but I will change it to reflect the new filling strategy. |
Show bin level. Add documentation to bin methods.
Rename `hts_idx_save_core` to `idx_save_core`, to indicate its static use.
Squashed / rebased and merged as 1830551 |
This accounts for samtools/htslib#1286, shipped in HTSlib 1.13, which changes the contents of the various index files. Only example.gtf.gz.tbi caused test failures, but regenerate all of these anyway (other than example_0v2[36]*, which are intentionally using index files generated by old versions of tabix).
For each contig, the linear index is an array of the smallest file offsets corresponding to alignments that cover a particular contig window, each 2min_shift bp wide. In case there are no alignments covering a particular genomic window, the linear index entries are filled from adjacent valid entries.
Whereas before, the empty linear index entries were filled with values of the nearest valid entry from the left, the current PR changes this procedure to use the nearest valid entry from the right, making the linear index values of empty windows slightly larger. This change has the potential to improve some reading operations by skipping over unnecessary data. However, in practice, most of HTSlib indexed reads use iterators, which rely on actual index chunks of existing alignments. So the linear index value is used as a starting point for searching valid chunks and not as an actual file offset. Since there are no valid chunks corresponding to the empty windows, there will be no change in performance for iterator reads.
The
dump_index
method has been renamed intoidx_dump
and a few new display capabilities have been added to it:The
hts_idx_save_core
has been renamed intoidx_save_core
to reflect its static use and to match the reading equivalentidx_read_core
.The new
hts_bin_level
method has been added. It computes the index level of a bin.Fixes #486