[FEATURE] Reduce memory footprint of FM-indices over text collections #1363
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
Using
sdsl::sd_vector_builder
The builder can be used iff
1
s you want to set1
s are in a strictly increasing orderSince we actually fulfil all these requirements, we can use the builder.
This already reduces the memory footprint by a little bit - you do not need to allocate a bitvector of size
text_size
( a vector usingtext_size/8
many bytes).This may be more noticeable for bigger texts since the allocation will take longer. Since I only measured memory peak, it's hard to tell exactly (the other steps use way more memory).
Avoiding a copy
We can get rid of the copy from
std::vector<uint8_t>
tosdsl::int_vector<8>
.Note that we need to do a
std::ranges::reverse(tmp_text);
afterwards.The alternative would be to do a
views::deep{views::reverse}
followed by anviews::reverse
in thestd::ranges::move
, but the deep reverse is very expensive (run time wise), so it's actually better to reverse the vector after flattening (views::join
) it.Also: We cannot do the
views::reverse
after theviews::join
becauseviews::join
strips the bidirectional(?) property off the range and henceviews::reverse
won't work.I did some local benchmarks with a text of size 256Mbp.
The times are around the same, with this new version tending to be a bit faster.
Memory peak was reduced from 3,213,168 KB to 2,885,388 KB (- 10%).
Considering the text is only 256MB, the memory consumption is still rather high, though.
Todos
Addstd::ranges::reverse
tostd/algorithm
and use it instead ofstd::reverse