Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft-pressure memtable flusher picks unflushable memtable lists even though flushable ones exist #3716

Closed
tgrabiec opened this Issue Aug 22, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@tgrabiec
Copy link
Contributor

tgrabiec commented Aug 22, 2018

The flusher picks the memtable list which contains the largest region according to region_impl::evictable_occupancy().total_space(), which follows region::occupancy().total_space(). But only the latest memtable in the list can start flushing. It can happen that the memtable corresponding to the largest region was already flushed to an sstable (flush permit released), but not yet moved to cache, so it's still in the memtable list. The latest memtable in this group can be much smaller, there could be other lists with larger latest memtables. The flusher should consider only flushable (latest) memtables.

The latest memtable in the winning list may be small (or empty), in which case the soft pressure flusher will not be able to make much progress. This can lead to writes blocking on dirty.

I observed this for the system memtable group, where it's easy for the memtables to overshoot small soft pressure limits. The flusher kept trying to flush empty memtables, while the previous non-empty memtable was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache in a separate scheduling group, so it further defers in time the removal of the flushed memtable from the memtable list.

@glommer

This comment has been minimized.

Copy link
Contributor

glommer commented Aug 22, 2018

Awesome work as usual, Tomek.

Indeed of all the problems you have opened today, I think this is the most crucial one. We will have poor utilization if we are not able to start flushing a new memtable asap.

@tzach

This comment has been minimized.

Copy link
Contributor

tzach commented Aug 25, 2018

@tgrabiec @glommer is this a 2.3 regression compare to 2.2?

avikivity added a commit that referenced this issue Aug 26, 2018

database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes #3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>

avikivity added a commit that referenced this issue Aug 26, 2018

database: Make soft-pressure memtable flusher not consider already fl…
…ushed memtables

The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 1e50f85)

avikivity added a commit that referenced this issue Aug 26, 2018

database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes #3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 2afce13)

avikivity added a commit that referenced this issue Aug 26, 2018

database: Make soft-pressure memtable flusher not consider already fl…
…ushed memtables

The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit 1e50f85)

avikivity added a commit that referenced this issue Aug 26, 2018

database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes #3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 2afce13)
@tgrabiec

This comment has been minimized.

Copy link
Contributor Author

tgrabiec commented Aug 27, 2018

@tzach I don't think it's a regression compared to 2.2, the soft-pressure flusher was introduced in 1.6 (f08162e).

I think this issue (edit: the issue I am talking about is #3717) may now be more likely to hit due to the CPU scheduler, which spreads related tasks in time more, giving the flusher more time to OOM the memory.

@glommer

This comment has been minimized.

Copy link
Contributor

glommer commented Aug 27, 2018

If it generates bad behavior now whereas before it didn't, it's a regression. That's regardless of whether or not the code was already there unchanged.

@tgrabiec

This comment has been minimized.

Copy link
Contributor Author

tgrabiec commented Aug 27, 2018

@glommer For the bad behavior related to this we have a different issue: #3717. The fix for #3717 could be backported without the need for backporting the fix for this issue.

syuu1228 added a commit to syuu1228/scylla that referenced this issue Sep 22, 2018

database: Make soft-pressure memtable flusher not consider already fl…
…ushed memtables

The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes scylladb#3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>

syuu1228 added a commit to syuu1228/scylla that referenced this issue Sep 22, 2018

database: Avoid OOM when soft pressure but nothing to flush
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs scylladb#3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.

flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.

This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test

Fixes scylladb#3717

Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.