Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Reduce memory footprint of FM-indices over text collections #1363

Merged
merged 2 commits into from
Nov 14, 2019

Conversation

eseiler
Copy link
Member

@eseiler eseiler commented Nov 12, 2019

Changes

Using sdsl::sd_vector_builder

The builder can be used iff

  • You know the bitvector size
  • You know the number of 1s you want to set
  • The positions of the 1s are in a strictly increasing order

Since we actually fulfil all these requirements, we can use the builder.
This already reduces the memory footprint by a little bit - you do not need to allocate a bitvector of size text_size ( a vector using text_size/8 many bytes).
This may be more noticeable for bigger texts since the allocation will take longer. Since I only measured memory peak, it's hard to tell exactly (the other steps use way more memory).

Avoiding a copy

We can get rid of the copy from std::vector<uint8_t> to sdsl::int_vector<8>.
Note that we need to do a std::ranges::reverse(tmp_text); afterwards.
The alternative would be to do a views::deep{views::reverse} followed by an views::reverse in the std::ranges::move, but the deep reverse is very expensive (run time wise), so it's actually better to reverse the vector after flattening (views::join) it.
Also: We cannot do the views::reverse after the views::join because views::join strips the bidirectional(?) property off the range and hence views::reverse won't work.

I did some local benchmarks with a text of size 256Mbp.
The times are around the same, with this new version tending to be a bit faster.
Memory peak was reduced from 3,213,168 KB to 2,885,388 KB (- 10%).
Considering the text is only 256MB, the memory consumption is still rather high, though.

Todos

  • Add std::ranges::reverse to std/algorithm and use it instead of std::reverse

@eseiler eseiler changed the title [FEATURE] Reduce memory footprint of FM-indices [FEATURE] Reduce memory footprint of FM-indices over text collections Nov 12, 2019
@eseiler eseiler force-pushed the feature/less_ram branch 2 times, most recently from 9b7b298 to 5d5d645 Compare November 12, 2019 16:26
@codecov
Copy link

codecov bot commented Nov 12, 2019

Codecov Report

Merging #1363 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1363   +/-   ##
=======================================
  Coverage   97.49%   97.49%           
=======================================
  Files         218      218           
  Lines        8703     8703           
=======================================
  Hits         8485     8485           
  Misses        218      218
Impacted Files Coverage Δ
include/seqan3/search/fm_index/fm_index.hpp 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca6dd02...f7f51e1. Read the comment docs.

@eseiler eseiler requested a review from rrahn November 13, 2019 16:23
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rrahn rrahn merged commit b74697f into seqan:master Nov 14, 2019
@eseiler eseiler deleted the feature/less_ram branch November 22, 2019 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants