Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reader_concurrency_semaphore: handle read blocked on memory being registered as inactive #12777

Merged
merged 19 commits into from
Mar 16, 2023

Conversation

denesb
Copy link
Contributor

@denesb denesb commented Feb 8, 2023

A read that requested memory and has to wait for it can be registered as inactive. This can happen for example if the memory request originated from a background I/O operation (a read-ahead maybe).
Handling this case is currently very difficult. What we want to do is evict such a read on-the-spot: the fact that there is a read waiting on memory means memory is in demand and so inactive reads should be evicted. To evict this reader, we'd first have to remove it from the memory wait list, which is almost impossible currently, because expiring_fifo<>, the type used for the wait list, doesn't allow for that. So in this PR we set out to make this possible first, by transforming all current queues to be intrusive lists of permits. Permits are already linked into an intrusive list, to allow for enumerating all existing permits. We use these existing hooks to link the permits into the appropriate queue, and back to _permit_list when they are not in any special queue. To make this possible we first have to make all lists store naked permits, moving all auxiliary data fields currently stored in wrappers like entry into the permit itself. With this, all queues and lists in the semaphore are intrusive lists, storing permits directly, which has the following implications:

  • queues no longer take extra memory, as all of them are intrusive
  • permits are completely self-sufficient w.r.t to queuing: code can queue or dequeue permits just with a reference to a permit at hand, no other wrapper, iterator, pointer, etc. is necessary.
  • queues don't keep permits alive anymore; destroying a permit will automatically unlink it from the respective queue, although this might lead to use-after-free. Not a problem in practice, only one code-path (reader_concurrenc_semaphore::with_permit()) had to be adjusted.

After all that extensive preparations, we can now handle the case of evicting a reader which is queued on memory.

Fixes: #12700

@denesb denesb requested a review from tgrabiec as a code owner February 8, 2023 10:12
@denesb
Copy link
Contributor Author

denesb commented Feb 8, 2023

First two commits are from #12756, will rebase them away, once said PR is merged.

@scylladb-promoter
Copy link
Contributor

@scylladb-promoter
Copy link
Contributor

@avikivity
Copy link
Member

Before reviewing, the gist is to convert a chain of futures to a chain of objects? If so that's pretty much what we've been doing for years while taming those queues.

@denesb
Copy link
Contributor Author

denesb commented Feb 9, 2023

Before reviewing, the gist is to convert a chain of futures to a chain of objects? If so that's pretty much what we've been doing for years while taming those queues.

I did not change how waiting work. What I did was throw out expiring_fifo<entry> and queue<entry> and replace them with boost::intrusive::list<reader_permit>.

@denesb
Copy link
Contributor Author

denesb commented Feb 9, 2023

v1.1:

  • Fixed scylla-gdb.py.

@denesb
Copy link
Contributor Author

denesb commented Feb 9, 2023

v1.2:

  • Fix use-after-free while evicting all inactive reads for a table.
  • Bonus cosmetic commit.

@scylladb-promoter
Copy link
Contributor

mutation_fragment.cc Outdated Show resolved Hide resolved
@avikivity
Copy link
Member

Looks good. I think the move to intrusive_list was worthwhile even without fixing the bug, reader_concurrency_semaphore is central enough that we can use customized data types for it.

@denesb
Copy link
Contributor Author

denesb commented Feb 17, 2023

Rebased (but #12756 is still not merged).

@scylladb-promoter
Copy link
Contributor

…oexcept

It is in fact noexcept and so it is expected to be, so document this.
It already is conceptually, as it passes const references to the permits
it iterates over. The only reason it wasn't const before is a technical
issue which is solved here with a const_cast.
… permit list param

This param is from a time when _permit_list was not accessible from the
outside, so it was passed along the semaphore instance to avoid making
the diagnostics methods friends.
To allow the semaphore freedom in how permits are stored, the
diagnostics code is instead made to use foreach_permit(), instead of
accessing the underlying list directly.
As the diagnostics code wants reader_permit::impl& directly, a new
variant of foreach_permit() passing impl references is introduced.
Instead of having callers use get_timeout(), then compare it against the
current time, set up a timeout timer in the permit, which assigned a new
`_ex` member (a `std::exception_ptr`) to the appropriate exception type
when it fires.
Callers can now just poll check_abort() which will throw when `_ex`
is not null. This is more natural and allows for more general reasons
for aborting reads in the future.
This prepares the ground for timeouts being managed inside the permit,
instead of by the semaphore. Including timing out while in a wait queue.
Use it to keep track of all permits that are currently waiting on
something: admission, memory or execution.
Currently we keep track of size, by adding up the result of size() of
the various queues. In future patches we are going to change the queues
such that they will not have constant time size anymore, move to an
explicit counter in preperation to that.
Another change this commit makes is to also include ready list entries
in this counter. Permits in the ready list are also waiters, they wait
to be executed. Soon we will have a separate wait state for this too.
There is now a field in stats with the same information, use that.
Currently the reader_permit has some private methods that only the
semaphore's internal calls. But this method of communication is not
consistent, other times the semaphore accesses the permit impl directly,
calling methods on that.
This commit introduces operator * and -> for reader_permit. With this,
the semaphore internals always call the reader_permit::impl methods
direcly, either via a direct reference, or via the above operators.
This makes the permit internface a little narrower and reduces
boilerplate code.
Instead of the `entry` wrapper. In _wait_list and _ready_list, that is.
Data stored in the `entry` wrapper is moved to a new
`reader_permit::auxiliary_data` type. This makes the reader permit
self-sufficient. This in turn prepares the ground for the ability to
de-queue a permit from any queue, with nothing but a permit reference at
hand: no need to have back pointer to wrappers and/or iterators.
They will soon depend on the definition of the reader_permit::impl,
which is only available in the .cc file.
Instead of using expiring_fifo to store queued permits, use the same
intrusive list mechanism we use to keep track of all permits.
Permits are now moved between the _permit_list and the wait queues,
depending on which state they are in. This means _permit_list is now not
the definitive list containing all permits, instead it is the list
containing all permits that are not in a more specialized queue at the
moment.
Code wishing to iterate over all permits should now use
foreach_permits(). For outside code, this was already the only way and
internal users are already patched.
Making the wait lists intrusive allows us to dequeue a permit from any
position, with nothing but a permit reference at hand. It also means
the wait queues don't have any additional memory requirements, other
than the memory for the permit itself.
Timeout while being queued is now handled by the permit's on_timeout()
callback.
Used while the permit is in the _ready_list, waiting for the execution
loop to pick it up. This just acknowledging the existence of this
wait-state. This state will now show up in permit diagnostics printouts
and we can now determine whether a permit is waiting for execution,
without checking which queue it is in.
@denesb
Copy link
Contributor Author

denesb commented Mar 9, 2023

New in v2:

  • Rebased on top of now merged dependencies.
  • Use a naked std::exception_ptr instead of abort_source for aborting permits (via timeout).
  • Remove default clause in dequeue_permit().
  • Use condition_variable::when() instead of condition_variable::wait().
  • use std::optional instead of std::unique_ptr in auxiliary_data::ir.

@denesb
Copy link
Contributor Author

denesb commented Mar 9, 2023

./build/release/scylla perf-simple-query --duration=30 -c1 -m2G

Before

median 145442.20 tps ( 60.1 allocs/op,  12.1 tasks/op,   42956 insns/op,        0 errors)
median absolute deviation: 109.61
maximum: 145831.52
minimum: 144404.47

After:

median 149573.43 tps ( 61.1 allocs/op,  13.1 tasks/op,   42920 insns/op,        0 errors)
median absolute deviation: 698.79
maximum: 150463.39
minimum: 141666.46

Hmm, looks like we improved on on instructions and therefore on tps but degraded on tasks/op (and therefore allocas/op).

@denesb
Copy link
Contributor Author

denesb commented Mar 9, 2023

I think I know where the extra continuations is coming from, I will look into getting rid of it.

@scylladb-promoter
Copy link
Contributor

@denesb
Copy link
Contributor Author

denesb commented Mar 10, 2023

v3:

  • Get rid of the extra allocation (remove continuation from with_permit()).
./build/release/scylla perf-simple-query --duration=30 -c1 -m2G

Before:

median 145341.99 tps ( 60.1 allocs/op,  12.1 tasks/op,   42949 insns/op,        0 errors)
median absolute deviation: 137.63
maximum: 145790.51
minimum: 143386.36

After:

median 148466.67 tps ( 60.1 allocs/op,  12.1 tasks/op,   42718 insns/op,        0 errors)
median absolute deviation: 145.02
maximum: 148742.83
minimum: 147603.15

@scylladb-promoter
Copy link
Contributor

@avikivity
Copy link
Member

Now we improved on instructions but degraded on passing the tests.

Following the same scheme we used to make the wait lists intrusive.
Permits are added to the ready list intrusive list while waiting to be
executed and moved back to the _permit_list when de-queued from this
list.
We now use a conditional variable for signaling when there are permits
ready to be executed.
They will soon need to access reader_permit::impl internals, only
available in the .cc file.
Add an member of type `inactive_read` to reader permit, and store permit
instances in `_inactive_reads`. This list is now just another intrusive
list the permit can be linked into, depending on its state.
Inactive read handles now just store a reader permit pointer.
It is not used in the header anymore and moving it to the .cc allows us
to remove the dependency on flat_mutation_reader_v2.hh.
If the read is inactive when the timeout clock fires, evict it.
Now that `_inactive_reads` just store permits, we can do this easily.
A mostly cosmetic change. Also add a comment mentioning that this is the
catch-all list.
…g inactive

Kill said read's memory requests with std::bad_alloc and dequeue it from
the memory wait list, then evict it on the spot.
Now that `_inactive_reads` just store permits, we can do this easily.
@denesb
Copy link
Contributor Author

denesb commented Mar 13, 2023

v4:

  • Fix use-after-free in on_timeout(): instead of clearing the keepalive field with defer(), just move it onto the stack of said method.

@scylladb-promoter
Copy link
Contributor

@denesb
Copy link
Contributor Author

denesb commented Mar 15, 2023

Ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reader_concurrency_semaphore: background reads can trigger assert failure when the reader is made inactive
3 participants