Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major compaction: flush commitlog by forcing new active segment and flushing all tables #15820

Merged

Conversation

bhalevy
Copy link
Member

@bhalevy bhalevy commented Oct 24, 2023

Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6

However, tombstone purging may be inhibited by data
in commitlog segments based on gc_time_min in the
tombstone_gc_state (See f42eb4d).

Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.

However, flushing all tables too frequently might
result in tiny sstables. Since when flushing all
keyspaces using nodetool flush the force_keyspace_compaction
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.

Hence a compaction_flush_all_tables_before_major_seconds interval
configuration option is added (defaults to 24 hours).

In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.

Fixes #15777

@bhalevy bhalevy requested a review from nyh as a code owner October 24, 2023 06:06
@bhalevy bhalevy requested review from raphaelsc and elcallio and removed request for nyh October 24, 2023 06:06
@bhalevy bhalevy force-pushed the major-compaction-flush-commitlog branch from 0b602d7 to 7691e07 Compare October 24, 2023 06:08
Copy link
Contributor

@elcallio elcallio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are ok with the additional overhead, this is a neat way to more or less ensure behaviour akin to before.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - Sanity Tests
✅ - Unit Tests

Failed Tests (1/21010):

Build Details:

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

@bhalevy
Copy link
Member Author

bhalevy commented Oct 31, 2023

@avikivity ok to merge?

@raphaelsc
Copy link
Member

Without reading through the discussion, cannot one argue that minor is affected by the same problem? And that's something that should be dealt with by commitlog itself? IIRC, Calle's patch aims at that, by narrowing the purgeable timestamp range on a per table basis.

@raphaelsc
Copy link
Member

Just a note: one of the side affects with early flush is increased write amp, and the impact of course depends on frequency of majors.

@elcallio
Copy link
Contributor

I have no idea if a minor compaction is allowed to GC tombstones... :-/
The original patch (f42eb4d) will lock out any compaction where CL could cause resurrection.
The goal here is to restore/preserve GC behaviour, mainly for tests. Another approach (as previously discussed) is forcing tests to do manual nodetool flush (which nowadays will also try to purge CL).

@avikivity
Copy link
Member

minor compaction will gc tombstones, just not those that potentially cover data in commitlog segments pinned by other tables

@raphaelsc
Copy link
Member

Do we have a concept of TTL for active segments? I am really worried about the new logic backfiring with tables, that is either very low or very high in write activity co-existing. Then they could share CL segments, and tables with high write activity could suffer by not being able to GC tombstones until tables with very low writes finally release the segments. I also wonder if the simplest solution was indeed the best we had on table. A tombstone-like record could also work while not introducing this cross-table dependency for tombstone GC efficiency. But maybe the problem I mentioned above is already solved, then my concern is not valid.

@elcallio
Copy link
Contributor

elcallio commented Nov 1, 2023

@raphaelsc : Segments don't have any concept of TTL as such, but the previous patch added keeping track of lowest mutation timestamp per segment, and retrieving it on compaction. Indeed, segments kept alive by tables not being memtable flushed could prevent GC thusly, which in a sense is an argument for the above change here, forcing global flushes etc on full comapction. Btw, perhaps you mean tablets, not tables, since I've previously suggesting keeping CL per tablet, not CF in that world...

Not sure if f42eb4d is the easiest solution, but it probably is the safest/lowest overall impact one. I.e. does not add all that much work, though again, in pathological cases it can prevent GC somewhat...

What do you mean by "tombstone-like record"?

Note that we could alleviate the problem by working on making keeping track of sstable RP:s better/actually working, but any compaction across shards prevents that. Do we ever compact sstables created during a nodes current run (uptime) across shards? If not, just tracking highest RP originating in the current node run into resulting sstable(s) would allow us to effectively
eliminate any segments from replay based on sstable set.

@avikivity
Copy link
Member

Do we have a concept of TTL for active segments? I am really worried about the new logic backfiring with tables, that is either very low or very high in write activity co-existing.

If there are tables with high write activity then they will recycle commitlog quickly.

We have amount 8GB of commitlog per shard in average configurations. AT 2MB/s per shard commitlog write rate (relatively low) we recycle all of commitlog in 4000s = 1h. That's much lower than reasonable repair intervals or gc_grace_seconds.

Then they could share CL segments, and tables with high write activity could suffer by not being able to GC tombstones until tables with very low writes finally release the segments. I also wonder if the simplest solution was indeed the best we had on table. A tombstone-like record could also work while not introducing this cross-table dependency for tombstone GC efficiency. But maybe the problem I mentioned above is already solved, then my concern is not valid.

@bhalevy
Copy link
Member Author

bhalevy commented Nov 2, 2023

Do we have a concept of TTL for active segments? I am really worried about the new logic backfiring with tables, that is either very low or very high in write activity co-existing.

If there are tables with high write activity then they will recycle commitlog quickly.

But the slow tables will still hold up CL segments from recycling.

Maybe we need a TTL on the memtables instead so they are flushed after an idle timeout (e.g. 1 hour) even if the table has infrequent writes.

@raphaelsc
Copy link
Member

Do we have a concept of TTL for active segments? I am really worried about the new logic backfiring with tables, that is either very low or very high in write activity co-existing.

If there are tables with high write activity then they will recycle commitlog quickly.

But the slow tables will still hold up CL segments from recycling.

Maybe we need a TTL on the memtables instead so they are flushed after an idle timeout (e.g. 1 hour) even if the table has infrequent writes.

Exactly. We can't assume write frequency. There are tables that are almost idle and could affect other co-existing tables that are tombstone heavy. One of the things I thought yesterday, later, was indeed flushing stale memtables. We have to handle it or otherwise users, with this scenario, might potentially face a regression.

@bhalevy
Copy link
Member Author

bhalevy commented Nov 2, 2023

With #15920 covering the regular-compaction path, I think we can merge this PR as is.

@bhalevy
Copy link
Member Author

bhalevy commented Nov 8, 2023

@avikivity, @denesb ok to merge this PR?

@denesb
Copy link
Contributor

denesb commented Nov 8, 2023

@avikivity, @denesb ok to merge this PR?

Fine by me.

@mykaul
Copy link
Contributor

mykaul commented Nov 9, 2023

@avikivity - please review.

@bhalevy bhalevy force-pushed the major-compaction-flush-commitlog branch from 7691e07 to c0a67c6 Compare November 9, 2023 13:33
std::optional<flush_mode> fm = std::nullopt,
seastar::condition_variable* cv = nullptr,
tasks::task_manager::task_ptr* current_task = nullptr) noexcept
: major_compaction_task_impl(module, tasks::task_id::create_random_id(), module->new_sequence_number(), "keyspace", std::move(keyspace), "", "", parent_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using parent_id ? 0 : module->new_sequence_number() not to waste a new sequence number for a child task (which is going to override it anyway)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in v7

@@ -84,6 +84,14 @@
"type":"string",
"paramType":"path"
},
{
"name":"flush_memtables",
"description":"Controls flushing of memtables before compaction (true by default). Set to \"false\" to skip automatic flushing of memtables before compaction, e.g. when the table is flushed explicitly invoking the compaction api.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this explanation. What do you mean "e.g., when the table is flushed explicitly invoking the compaction api"? Did you forget the word before? I.e., "when the table was flushed explicitly before invoking the compacation api"?

In general, I don't like the phenomenon that recently we're adding too many obscure options to the REST API which solve in a manual way some very specific problems that nobody will remember in a year (or tomorrow) that even need to be solved. If these are important as manual options, they should be clearly documented in a document - not just one line. If they are not important as manual options, they should have been automatic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below that you have another copy of this JSON, and it has exactly the word "before" I was missing.
Please choose the best text for both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will add "before".
The option is not needed if scylla nodetool compact is used, then everything is done automatically and efficiently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in v7



@pytest.mark.parametrize("flush", ("true", "false"))
def test_keyspace_flush(nodetool, scylla_only, flush):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what a "keyspace flush" actually test? It doesn't test "nodetool flush"... Please add a comment explaining that it actually tests a specific parameter of "nodetool compact". Maybe also use "compact" in the test's name.
And please also explain why it's a Scylla-only test, because it tests a nodetool option "--flush-memtables" that didn't exist in Cassandra's nodetool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what a "keyspace flush" actually test? It doesn't test "nodetool flush"... Please add a comment explaining that it actually tests a specific parameter of "nodetool compact".

I'll rename the function to mention flush_memtables_option.

Maybe also use "compact" in the test's name.

test_compaction.py is all about scylla nodetool compact and it's not mentioned in any other test case.
I think that given the mention of the specific option in the function name, this isn't requried.

And please also explain why it's a Scylla-only test, because it tests a nodetool option "--flush-memtables" that didn't exist in Cassandra's nodetool.

No problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in v7

, compaction_flush_all_tables_before_major_seconds(this, "compaction_flush_all_tables_before_major_seconds", value_status::Used, 86400,
"Set the minimum interval in seconds between flushing all tables before each major compaction (default is 86400). "
"This option is useful for maximizing tombstone garbage collection by releasing all active commitlog segments. "
"Set to 0 to disable automatic flushing all tables before major compaction")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this description. The default is 86400 seconds, i.e., a day. What does it mean? If I do a "nodetool compact" it will or won't flush the tables? What does "a day" do here - does it mean that if the previous nodetool compact happened more than a day or less than a day, the behavior is different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, flush of all tables happens at most once a day by default since it is expensive and may create small sstables due to excessive memtable flushing.
The reason is to deal with the user / management tool invoking the keyspace_compaction api repeatedly, for each keyspace. It's sufficient to flush all tables only once, on the first invocation.
If more than a day passed since last flush then all tables would be flush again.

BTW, this is related to #15971 that will help flush idle memtable in a timely manner, and with it this mechanism here should skip tables that were already flushed in the last 24 hours, but we're not yet committed to it.

# 4 Flush it into sstable
resp = rest_api.send("POST", f"storage_service/keyspace_flush/{test_keyspace}")
# 4 Trigger major compaction on that sstable (flushes memtable by default)
resp = rest_api.send("POST", f"column_family/major_compaction/{test_keyspace}:{test_table}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand something here.... It seems you are assuming that major compaction will flush the memtable. But I see two patches later that there is some 1-day threshold that may avoid this flush. So is this flush guaranteed? Was it always guaranteed, also before this patch, and the flush step was always unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition in this series is the flush of all tables in the database before major compaction.
Flushing the particular tables that are major compacted was already done before (in all supported versions), and is still done in the case that flushing all db tables is skipped due to the 1-day threshold (but not if the flush_memtables=false option is given).

"operations":[
{
"method":"POST",
"summary":"Flush all memtables in all keyspaces.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is sad that the real explanation why this API exists is hidden in a commit message and not visible to the users. We need to start investing more efforts into documentation (for users) and comments (for developers), than in commit messages :-(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality exists forever at the nodetool level, the new api is there to allow nodetool/jmx to implement it in a more efficient way. The api level is not exposed to the end user, but rather to field-engineering and to power-admins. The summary text here can provide some additional information about the call usage, but I'm not sure the motivation for it is necessarily required here.

@pytest.fixture(scope="function")
def cassandra_only(request):
if request.config.getoption("nodetool") != "cassandra":
pytest.skip('Cassandra-only test skipped')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, such a fixture should not exist:

  1. If this is a cassandra feature that should be in Scylla, it's a Scylla bug, and it should be marked "@xfail", not "cassandra_only".
  2. If this is a cassandra feature that should never be a Scylla, we shouldn't have a test for it - we're not writing a test suite for Cassandra.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand better now why you added cassandra_only - because you wanted the Java ("cassandra") nodetool to behave differently than the C++ one. More on this below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's not a cassandra-only feature, but rather cassandra-specific implementation detail that is tested (in a grey-box way)

def test_flush_all_tables(nodetool, scylla_only):
nodetool("flush", expected_requests=[
expected_request("POST", "/storage_service/flush")
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something very strange is happening here, and I'd like @denesb 's opinion.
You decided to change the behavior of "nodetool flush" (to call some new Scylla-only API), but decided to do this only in the new nodetool command that nobody uses yet - and not in the Java nodetool. And these tests verify that this is the case. But why? Why should the C++ nodetool behave differently from the Java one? The whole point of this test file is to verify that it doesn't. At least not yet (I'm sure that in the future we'll stop adding features to the old nodetool, but have we reached that state?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nyh it doesn't change the "black-box" behavior from the user's perspective.
It achieves the same goal for bare nodetool flush in either nodetool implementations.
It is a mistake to have scylla_nodetool just mimic the exact api calls that happen to happen in the json nodetool implementation over scylla-jmx, as it inhibits progress.
If the semantics of the operation are indistinguishable I see no reason not to evolve the underlying implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, as long as there are no semantical differences, the two implementations can diverge.
For the most part, there is no reason to diverge, but when there are more optimal set of API endpoints available, there is no reason to not use it.
I suspect we will see more and more changes like this, and slowly, even semantic differences will creep in. The reason is that updating the Java nodetool is such a chore, that everybody avoids it as much as they can.

"operations":[
{
"method":"POST",
"summary":"Forces major compaction in all keyspaces",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't understand when this API is ever useful. Maybe just when you have exactly one big table?
I remember in the past we tried to claim that users should never explicitly call major compaction at all - that it just ruins performance. So now we're giving them a way to do a major compaction of all tables at once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dtest (that represents users mindset) bare nodetool compact, that causes major compaction of all keyspaces, is used elaborately, and naive users may also invoke it sparringly.
Unfortunately in many use cases major compaction is still required once in a while to prevent accumulation of deleted data and tombstones, though in this case I admit the usage may typically be needed for a specific table or keyspace.
In any case, the API is there also to help field-engineering in extreme corner cases. Having an API for something doesn't mean we're encouraging anyone to use it on a regular basis. It's a power-tool that should be used with care, understanding what one is doing.

@bhalevy bhalevy force-pushed the major-compaction-flush-commitlog branch from 1bf4ab4 to b5e2603 Compare November 22, 2023 09:36
@bhalevy
Copy link
Member Author

bhalevy commented Nov 22, 2023

In v7 (b5e2603):

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

@bhalevy
Copy link
Member Author

bhalevy commented Nov 27, 2023

rebased

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

@denesb
Copy link
Contributor

denesb commented Nov 28, 2023

There are conflicts with next, a new rebase will be needed after promotion.

Document the behavior if no keyspace is specified
or no table(s) are specified for a given keyspace.

Fixes scylladb#16032

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The current implementation makes no sense.

Like `nodetool_path`, base the default `jmx_path`
on the assumption that the test is run using, e.g.
```
(cd test/nodetool; pytest --nodetool=cassandra test_compact.py)
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When flushing is done externally, e.g. by running
`nodetool flush` prior to `nodetool compact`,
flush_memtables=false can be passed to skip flushing
of tables right before they are major-compacted.

This is useful to prevent creation of small sstables
due to excessive memtable flushing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Flushes all tables after forcing force_new_active_segment
of the commitlog to make sure all commitlog segments can
get recycled.

Otherwise, due to "false sharing", rarely-written tables
might inhibit recycling of the commitlog segments they reference.

After f42eb4d,
that won't allow compaction to purge some tombstones based on
the min_gc_time.

To be used in the next patch by major compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6

However, tombstone purging may be inhibited by data
in commitlog segments based on `gc_time_min` in the
`tombstone_gc_state` (See f42eb4d).

Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.

However, flushing all tables too frequently might
result in tiny sstables.  Since when flushing all
keyspaces using `nodetool flush` the `force_keyspace_compaction`
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.

Hence a `compaction_flush_all_tables_before_major_seconds` interval
configuration option is added (defaults to 24 hours).

In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.

Fixes scylladb#15777

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For flushing all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool flush` translates to
a sequence of `/storage_service/keyspace_flush` calls).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For major compacting all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool compact` translates to
a sequence of `/storage_service/keyspace_compaction` calls).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It looks like `nodetool compact standard1` is meant
to show how to compact a specified table, not a keyspace.
Note that the previous example like is for a keyspace.
So fix the table compaction example to:
`nodetool compact keyspace1 standard1`

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Provide 3 examples, like in the nodetool/compact page:
global, per-keyspace, per-table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy bhalevy force-pushed the major-compaction-flush-commitlog branch from 19f529e to 310ff20 Compare November 28, 2023 14:50
@bhalevy
Copy link
Member Author

bhalevy commented Nov 28, 2023

In v8 (310ff20):

  • rebased + retested
  • docs: nodetool: compact: fix example
  • docs: nodetool: flush: enrich examples

@annastuchlik can you please review last 2 patches.
Sorry for the last minute changes.

Copy link
Collaborator

@annastuchlik annastuchlik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc updates look good.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

@bhalevy
Copy link
Member Author

bhalevy commented Nov 29, 2023

@denesb / @scylladb/scylla-maint please queue

@scylladb-promoter scylladb-promoter merged commit 3ed6925 into scylladb:master Nov 29, 2023
4 checks passed
denesb added a commit that referenced this pull request Jan 16, 2024
… active segment and flushing all tables' from Kefu Chai

Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6

However, tombstone purging may be inhibited by data
in commitlog segments based on `gc_time_min` in the
`tombstone_gc_state` (See f42eb4d).

Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.

However, flushing all tables too frequently might
result in tiny sstables.  Since when flushing all
keyspaces using `nodetool flush` the `force_keyspace_compaction`
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.

Hence a `compaction_flush_all_tables_before_major_seconds` interval
configuration option is added (defaults to 24 hours).

In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.

Fixes #15777

Closes #15820

to address the confliction, following change is also included in this changeset:

tools/scylla-nodetool: implement the cleanup command

The --jobs command-line argument is accepted but ignored, just like the
current nodetool does.

Refs: #15588

Closes #16756

* github.com:scylladb/scylladb:
  docs: nodetool: flush: enrich examples
  docs: nodetool: compact: fix example
  api: add /storage_service/compact
  api: add /storage_service/flush
  tools/scylla-nodetool: implement the flush command
  compaction_manager: flush_all_tables before major compaction
  database: add flush_all_tables
  api: compaction: add flush_memtables option
  test/nodetool: jmx: fix path to scripts/scylla-jmx
  scylla-nodetool, docs: improve optional params documentation
  tools/scylla-nodetool: extract keyspace/table parsing
  tools/scylla-nodetool: implement the cleanup command
  test/nodetool: rest_api_mock: add more options for multiple requests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants