[Search] Use a buffer to cache the intermediate results between search incovations such that memory allocations are reduced on multiple invocations of search. #29

rrahn · 2020-03-27T14:42:14Z

h-2 · 2020-07-20T16:29:43Z

how can this be designed and implemented?

low-level utilities like .locate() on the iterators need an out-parameter version.
add a publicly documented struct called search_buffers whose members are public, but not part of the API and not visible to user doc.
one member should be a buffer for the occurrences that is passed to .locate() by in-out parameter (.locate() appends to this buffer).
actual member it is not one buffer, but a vector of buffers (one for every thread)
regular search creates one object of type search_buffer and passes the respective member to down-stream functions.
a config object is added that allows the user to create an object outside the function call. In this case the search doesn't create an object but uses the one provided by the config element.

So for regular usage nothing changes, but users can optionally pass in an external buffers object via the config system. The internals of this type are not exposed and can therefore be changed at any time.

rrahn · 2020-08-10T09:54:20Z

Resolution 10.06.2020

We will investigate a strategy using pmr::vector and polymorphic resource to make it easy to provide a generic buffer from outside without knowing the type.
Agent-Client pattern like for client pattern?

h-2 · 2020-08-10T11:26:52Z

Hm, that sounds quite complicated (for the user), but I don't know what you have discussed yet. Why do we need polymorphism? We know the type of the buffers.

My proposal would look like this:

seqan3::search_buffers buffers; // the user doesn't know what's inside, but we do

seqan3::configuration const default_cfg = seqan3::search_cfg::max_error_total{zero_errors} |
                                              seqan3::search_cfg::max_error_substitution{zero_errors} |
                                              seqan3::search_cfg::max_error_insertion{zero_errors} |
                                              seqan3::search_cfg::max_error_deletion{zero_errors} |
                                              seqan3::search_cfg::output_query_id |
                                              seqan3::search_cfg::output_reference_id |
                                              seqan3::search_cfg::output_reference_begin_position |
                                              seqan3::search_cfg::hit_all |
                                              seqan3::search_cfg::with_buffers{buffers}; // <--------

// after that, search normally with the config

h-2 · 2020-08-10T11:28:17Z

Note that this whole issue is not about how results are returned from the top-level search but how intermediate results are stored internally (cursors and results of the cursors' locate operation).

marehr · 2020-08-10T11:53:29Z

Note that this whole issue is not about how results are returned from the top-level search but how intermediate results are stored internally (cursors and results of the cursors' locate operation).

We know.

The main point is:

Allocating internal memory is expensive; reusing (internal) memory would be better.
- => This could only be achieved if you pass that (internal) memory externally in.

We (briefly) discussed how we could model this with a sound interface. (Just passing a std::vector is a bad interface, because it is to specific to the internals)

I noticed that we already solved this problem for a different data structure (e.g. IBF), where we used a client-agent pattern to model this.

And René noted that std::pmr::memory_resource might be a different way to implement this (passing internal storage externally), but that we need to actually see if it really fits our use case.

If you want and have time, we could talk about this in a web-call. 👍

h-2 · 2020-08-10T12:01:58Z

Allocating internal memory is expensive; reusing (internal) memory would be better.
=> This could only be achieved if you pass that (internal) memory externally in.

Yep, that's why I propose do just create a named struct that can be used to create objects that are passed to the algorithm via the config. This struct ca contain anything that we need (and can be changed later on since none of its members are part of the API).

If you want and have time, we could talk about this in a web-call. +1

Maybe we can do a Lambda call sometime soon and then have 30min nerd session on this topic afterwards ;)

rrahn · 2020-08-10T12:17:21Z

Well, that's the point. I am not completely into the details of the index cursors, but often we end up having different types depending on the configuration (and I am not sure what will come in the future, different indices, different cursors etc.). So while these types usually depend on the configuration, you need to allocate buffer for something you don't know the type yet and so you get into difficulties and hard to-use-interfaces with strange dependencies (get the correct buffer type for the configuration element, but then add this to the configuration again). So I have seen (but not pursued any further, i.e. it has to be investigated) that with the pmr memory resource you as a user can simply provide memory (chunks of std::byte). So that is great, because now we can use these buffers to manage and reuse the memory inside of the algorithms and do not carry to much about the types. Of course we can wrap a default implementation of this inside of a dedicated struct. But also the user can do more advanced stuff by being able to provide memory specifically allocated on high-bandwidth memory or the stack etc..

h-2 · 2020-08-10T13:22:21Z

It's true that the buffers might depend on the configuration at some point. My initial response would be to just put all potentially usable buffers into seqan3::search_buffers. It doesn't matter if some members aren't used by some configurations.

While the approach you suggest is powerful, it sounds like it will add complexity to the simple use-case. And while some algorithms might profit from specialised storage, the things we are talking about for search are small, simple objects that need to be in main memory (AFAICT).

An alternative with a non-generic buffer type could be exposing the buffers type as a member typedef of the config, but this is already suboptimal usability-wise from my POV:

seqan3::configuration const cfg = seqan3::search_cfg::max_error_total{zero_errors} |
                                              seqan3::search_cfg::max_error_substitution{zero_errors} |
                                              seqan3::search_cfg::max_error_insertion{zero_errors} |
                                              seqan3::search_cfg::max_error_deletion{zero_errors} |
                                              seqan3::search_cfg::output_query_id |
                                              seqan3::search_cfg::output_reference_id |
                                              seqan3::search_cfg::output_reference_begin_position |
                                              seqan3::search_cfg::hit_all;

decltype(cfg)::buffers_type buffers;

cfg.set_buffers(buffers);

// after that, search normally with the config

edit: let's talk about this on a call sometime.

marehr · 2020-09-28T10:11:10Z

2020-09-28

We briefly discussed this, and we think this is a spike and needs a prototype to be able to discuss this further.

rrahn added the needs refinement A story that was not discussed and/or estimated by the team yet but is planned for upcoming sprints. label Mar 27, 2020

rrahn added this to the Sprint 1 milestone Mar 27, 2020

rrahn modified the milestones: Sprint 1, Sprint 2 Apr 15, 2020

rrahn added this to the Sprint 4 milestone May 12, 2020

rrahn modified the milestones: Sprint 4, Sprint 5 May 26, 2020

rrahn modified the milestones: Sprint 5, Sprint 6 Jun 8, 2020

h-2 mentioned this issue Jul 20, 2020

Search outparam seqan/seqan3#1991

Closed

marehr added the spike label Sep 28, 2020

Irallia mentioned this issue Mar 21, 2022

[FEATURE EPIC] Call SNPs, Indels & SVs with iGenVar seqan/iGenVar#36

Open

6 tasks

eseiler removed the spike label Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Search] Use a buffer to cache the intermediate results between search incovations such that memory allocations are reduced on multiple invocations of search. #29

[Search] Use a buffer to cache the intermediate results between search incovations such that memory allocations are reduced on multiple invocations of search. #29

rrahn commented Mar 27, 2020 •

edited

h-2 commented Jul 20, 2020

rrahn commented Aug 10, 2020 •

edited by marehr

h-2 commented Aug 10, 2020

h-2 commented Aug 10, 2020

marehr commented Aug 10, 2020

h-2 commented Aug 10, 2020 •

edited

rrahn commented Aug 10, 2020

h-2 commented Aug 10, 2020 •

edited

marehr commented Sep 28, 2020

[Search] Use a buffer to cache the intermediate results between search incovations such that memory allocations are reduced on multiple invocations of search. #29

[Search] Use a buffer to cache the intermediate results between search incovations such that memory allocations are reduced on multiple invocations of search. #29

Comments

rrahn commented Mar 27, 2020 • edited

Description

Acceptance Criteria

Tasks

Definition of Done

h-2 commented Jul 20, 2020

rrahn commented Aug 10, 2020 • edited by marehr

Resolution 10.06.2020

h-2 commented Aug 10, 2020

h-2 commented Aug 10, 2020

marehr commented Aug 10, 2020

h-2 commented Aug 10, 2020 • edited

rrahn commented Aug 10, 2020

h-2 commented Aug 10, 2020 • edited

marehr commented Sep 28, 2020

rrahn commented Mar 27, 2020 •

edited

rrahn commented Aug 10, 2020 •

edited by marehr

h-2 commented Aug 10, 2020 •

edited

h-2 commented Aug 10, 2020 •

edited