-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major compaction: flush commitlog by forcing new active segment and flushing all tables #15820
Major compaction: flush commitlog by forcing new active segment and flushing all tables #15820
Conversation
0b602d7
to
7691e07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are ok with the additional overhead, this is a neat way to more or less ensure behaviour akin to before.
🔴 CI State: FAILURE✅ - Build Failed Tests (1/21010):Build Details:
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
@avikivity ok to merge? |
Without reading through the discussion, cannot one argue that minor is affected by the same problem? And that's something that should be dealt with by commitlog itself? IIRC, Calle's patch aims at that, by narrowing the purgeable timestamp range on a per table basis. |
Just a note: one of the side affects with early flush is increased write amp, and the impact of course depends on frequency of majors. |
I have no idea if a minor compaction is allowed to GC tombstones... :-/ |
minor compaction will gc tombstones, just not those that potentially cover data in commitlog segments pinned by other tables |
Do we have a concept of TTL for active segments? I am really worried about the new logic backfiring with tables, that is either very low or very high in write activity co-existing. Then they could share CL segments, and tables with high write activity could suffer by not being able to GC tombstones until tables with very low writes finally release the segments. I also wonder if the simplest solution was indeed the best we had on table. A tombstone-like record could also work while not introducing this cross-table dependency for tombstone GC efficiency. But maybe the problem I mentioned above is already solved, then my concern is not valid. |
@raphaelsc : Segments don't have any concept of TTL as such, but the previous patch added keeping track of lowest mutation timestamp per segment, and retrieving it on compaction. Indeed, segments kept alive by tables not being memtable flushed could prevent GC thusly, which in a sense is an argument for the above change here, forcing global flushes etc on full comapction. Btw, perhaps you mean tablets, not tables, since I've previously suggesting keeping CL per tablet, not CF in that world... Not sure if f42eb4d is the easiest solution, but it probably is the safest/lowest overall impact one. I.e. does not add all that much work, though again, in pathological cases it can prevent GC somewhat... What do you mean by "tombstone-like record"? Note that we could alleviate the problem by working on making keeping track of sstable RP:s better/actually working, but any compaction across shards prevents that. Do we ever compact sstables created during a nodes current run (uptime) across shards? If not, just tracking highest RP originating in the current node run into resulting sstable(s) would allow us to effectively |
If there are tables with high write activity then they will recycle commitlog quickly. We have amount 8GB of commitlog per shard in average configurations. AT 2MB/s per shard commitlog write rate (relatively low) we recycle all of commitlog in 4000s = 1h. That's much lower than reasonable repair intervals or gc_grace_seconds.
|
But the slow tables will still hold up CL segments from recycling. Maybe we need a TTL on the memtables instead so they are flushed after an idle timeout (e.g. 1 hour) even if the table has infrequent writes. |
Exactly. We can't assume write frequency. There are tables that are almost idle and could affect other co-existing tables that are tombstone heavy. One of the things I thought yesterday, later, was indeed flushing stale memtables. We have to handle it or otherwise users, with this scenario, might potentially face a regression. |
With #15920 covering the regular-compaction path, I think we can merge this PR as is. |
@avikivity, @denesb ok to merge this PR? |
Fine by me. |
@avikivity - please review. |
7691e07
to
c0a67c6
Compare
compaction/task_manager_module.hh
Outdated
std::optional<flush_mode> fm = std::nullopt, | ||
seastar::condition_variable* cv = nullptr, | ||
tasks::task_manager::task_ptr* current_task = nullptr) noexcept | ||
: major_compaction_task_impl(module, tasks::task_id::create_random_id(), module->new_sequence_number(), "keyspace", std::move(keyspace), "", "", parent_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using parent_id ? 0 : module->new_sequence_number()
not to waste a new sequence number for a child task (which is going to override it anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in v7
api/api-doc/column_family.json
Outdated
@@ -84,6 +84,14 @@ | |||
"type":"string", | |||
"paramType":"path" | |||
}, | |||
{ | |||
"name":"flush_memtables", | |||
"description":"Controls flushing of memtables before compaction (true by default). Set to \"false\" to skip automatic flushing of memtables before compaction, e.g. when the table is flushed explicitly invoking the compaction api.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this explanation. What do you mean "e.g., when the table is flushed explicitly invoking the compaction api"? Did you forget the word before? I.e., "when the table was flushed explicitly before invoking the compacation api"?
In general, I don't like the phenomenon that recently we're adding too many obscure options to the REST API which solve in a manual way some very specific problems that nobody will remember in a year (or tomorrow) that even need to be solved. If these are important as manual options, they should be clearly documented in a document - not just one line. If they are not important as manual options, they should have been automatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see below that you have another copy of this JSON, and it has exactly the word "before" I was missing.
Please choose the best text for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will add "before".
The option is not needed if scylla nodetool compact
is used, then everything is done automatically and efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in v7
test/nodetool/test_compact.py
Outdated
|
||
|
||
@pytest.mark.parametrize("flush", ("true", "false")) | ||
def test_keyspace_flush(nodetool, scylla_only, flush): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what a "keyspace flush" actually test? It doesn't test "nodetool flush"... Please add a comment explaining that it actually tests a specific parameter of "nodetool compact". Maybe also use "compact" in the test's name.
And please also explain why it's a Scylla-only test, because it tests a nodetool option "--flush-memtables" that didn't exist in Cassandra's nodetool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what a "keyspace flush" actually test? It doesn't test "nodetool flush"... Please add a comment explaining that it actually tests a specific parameter of "nodetool compact".
I'll rename the function to mention flush_memtables_option
.
Maybe also use "compact" in the test's name.
test_compaction.py is all about scylla nodetool compact
and it's not mentioned in any other test case.
I think that given the mention of the specific option in the function name, this isn't requried.
And please also explain why it's a Scylla-only test, because it tests a nodetool option "--flush-memtables" that didn't exist in Cassandra's nodetool.
No problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in v7
, compaction_flush_all_tables_before_major_seconds(this, "compaction_flush_all_tables_before_major_seconds", value_status::Used, 86400, | ||
"Set the minimum interval in seconds between flushing all tables before each major compaction (default is 86400). " | ||
"This option is useful for maximizing tombstone garbage collection by releasing all active commitlog segments. " | ||
"Set to 0 to disable automatic flushing all tables before major compaction") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this description. The default is 86400 seconds, i.e., a day. What does it mean? If I do a "nodetool compact" it will or won't flush the tables? What does "a day" do here - does it mean that if the previous nodetool compact happened more than a day or less than a day, the behavior is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, flush of all tables happens at most once a day by default since it is expensive and may create small sstables due to excessive memtable flushing.
The reason is to deal with the user / management tool invoking the keyspace_compaction
api repeatedly, for each keyspace. It's sufficient to flush all tables only once, on the first invocation.
If more than a day passed since last flush then all tables would be flush again.
BTW, this is related to #15971 that will help flush idle memtable in a timely manner, and with it this mechanism here should skip tables that were already flushed in the last 24 hours, but we're not yet committed to it.
# 4 Flush it into sstable | ||
resp = rest_api.send("POST", f"storage_service/keyspace_flush/{test_keyspace}") | ||
# 4 Trigger major compaction on that sstable (flushes memtable by default) | ||
resp = rest_api.send("POST", f"column_family/major_compaction/{test_keyspace}:{test_table}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand something here.... It seems you are assuming that major compaction will flush the memtable. But I see two patches later that there is some 1-day threshold that may avoid this flush. So is this flush guaranteed? Was it always guaranteed, also before this patch, and the flush step was always unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition in this series is the flush of all tables in the database before major compaction.
Flushing the particular tables that are major compacted was already done before (in all supported versions), and is still done in the case that flushing all db tables is skipped due to the 1-day threshold (but not if the flush_memtables=false
option is given).
"operations":[ | ||
{ | ||
"method":"POST", | ||
"summary":"Flush all memtables in all keyspaces.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sad that the real explanation why this API exists is hidden in a commit message and not visible to the users. We need to start investing more efforts into documentation (for users) and comments (for developers), than in commit messages :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality exists forever at the nodetool
level, the new api is there to allow nodetool/jmx to implement it in a more efficient way. The api level is not exposed to the end user, but rather to field-engineering and to power-admins. The summary
text here can provide some additional information about the call usage, but I'm not sure the motivation for it is necessarily required here.
@pytest.fixture(scope="function") | ||
def cassandra_only(request): | ||
if request.config.getoption("nodetool") != "cassandra": | ||
pytest.skip('Cassandra-only test skipped') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, such a fixture should not exist:
- If this is a cassandra feature that should be in Scylla, it's a Scylla bug, and it should be marked "@xfail", not "cassandra_only".
- If this is a cassandra feature that should never be a Scylla, we shouldn't have a test for it - we're not writing a test suite for Cassandra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand better now why you added cassandra_only - because you wanted the Java ("cassandra") nodetool to behave differently than the C++ one. More on this below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's not a cassandra-only feature, but rather cassandra-specific implementation detail that is tested (in a grey-box way)
def test_flush_all_tables(nodetool, scylla_only): | ||
nodetool("flush", expected_requests=[ | ||
expected_request("POST", "/storage_service/flush") | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something very strange is happening here, and I'd like @denesb 's opinion.
You decided to change the behavior of "nodetool flush" (to call some new Scylla-only API), but decided to do this only in the new nodetool command that nobody uses yet - and not in the Java nodetool. And these tests verify that this is the case. But why? Why should the C++ nodetool behave differently from the Java one? The whole point of this test file is to verify that it doesn't. At least not yet (I'm sure that in the future we'll stop adding features to the old nodetool, but have we reached that state?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nyh it doesn't change the "black-box" behavior from the user's perspective.
It achieves the same goal for bare nodetool flush
in either nodetool
implementations.
It is a mistake to have scylla_nodetool just mimic the exact api calls that happen to happen in the json nodetool implementation over scylla-jmx, as it inhibits progress.
If the semantics of the operation are indistinguishable I see no reason not to evolve the underlying implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, as long as there are no semantical differences, the two implementations can diverge.
For the most part, there is no reason to diverge, but when there are more optimal set of API endpoints available, there is no reason to not use it.
I suspect we will see more and more changes like this, and slowly, even semantic differences will creep in. The reason is that updating the Java nodetool is such a chore, that everybody avoids it as much as they can.
"operations":[ | ||
{ | ||
"method":"POST", | ||
"summary":"Forces major compaction in all keyspaces", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't understand when this API is ever useful. Maybe just when you have exactly one big table?
I remember in the past we tried to claim that users should never explicitly call major compaction at all - that it just ruins performance. So now we're giving them a way to do a major compaction of all tables at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dtest (that represents users mindset) bare nodetool compact
, that causes major compaction of all keyspaces, is used elaborately, and naive users may also invoke it sparringly.
Unfortunately in many use cases major compaction is still required once in a while to prevent accumulation of deleted data and tombstones, though in this case I admit the usage may typically be needed for a specific table or keyspace.
In any case, the API is there also to help field-engineering in extreme corner cases. Having an API for something doesn't mean we're encouraging anyone to use it on a regular basis. It's a power-tool that should be used with care, understanding what one is doing.
1bf4ab4
to
b5e2603
Compare
In v7 (b5e2603):
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
b5e2603
to
19f529e
Compare
rebased |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
There are conflicts with next, a new rebase will be needed after promotion. |
Document the behavior if no keyspace is specified or no table(s) are specified for a given keyspace. Fixes scylladb#16032 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The current implementation makes no sense. Like `nodetool_path`, base the default `jmx_path` on the assumption that the test is run using, e.g. ``` (cd test/nodetool; pytest --nodetool=cassandra test_compact.py) ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When flushing is done externally, e.g. by running `nodetool flush` prior to `nodetool compact`, flush_memtables=false can be passed to skip flushing of tables right before they are major-compacted. This is useful to prevent creation of small sstables due to excessive memtable flushing. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Flushes all tables after forcing force_new_active_segment of the commitlog to make sure all commitlog segments can get recycled. Otherwise, due to "false sharing", rarely-written tables might inhibit recycling of the commitlog segments they reference. After f42eb4d, that won't allow compaction to purge some tombstones based on the min_gc_time. To be used in the next patch by major compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See 64ec1c6 However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See f42eb4d). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes scylladb#15777 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For flushing all tables in the database. The advantage of this api is that `commitlog->force_new_active_segment` happens only once in `database::flush_all_tables` rather than once per keyspace (when `nodetool flush` translates to a sequence of `/storage_service/keyspace_flush` calls). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For major compacting all tables in the database. The advantage of this api is that `commitlog->force_new_active_segment` happens only once in `database::flush_all_tables` rather than once per keyspace (when `nodetool compact` translates to a sequence of `/storage_service/keyspace_compaction` calls). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It looks like `nodetool compact standard1` is meant to show how to compact a specified table, not a keyspace. Note that the previous example like is for a keyspace. So fix the table compaction example to: `nodetool compact keyspace1 standard1` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Provide 3 examples, like in the nodetool/compact page: global, per-keyspace, per-table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
19f529e
to
310ff20
Compare
In v8 (310ff20):
@annastuchlik can you please review last 2 patches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc updates look good.
🟢 CI State: SUCCESS✅ - Build Build Details:
|
@denesb / @scylladb/scylla-maint please queue |
… active segment and flushing all tables' from Kefu Chai Major compaction already flushes each table to make sure it considers any mutations that are present in the memtable for the purpose of tombstone purging. See 64ec1c6 However, tombstone purging may be inhibited by data in commitlog segments based on `gc_time_min` in the `tombstone_gc_state` (See f42eb4d). Flushing all sstables in the database release all references to commitlog segments and there it maximizes the potential for tombstone purging, which is typically the reason for running major compaction. However, flushing all tables too frequently might result in tiny sstables. Since when flushing all keyspaces using `nodetool flush` the `force_keyspace_compaction` api is invoked for keyspace successively, we need a mechanism to prevent too frequent flushes by major compaction. Hence a `compaction_flush_all_tables_before_major_seconds` interval configuration option is added (defaults to 24 hours). In the case that not all tables are flushed prior to major compaction, we revert to the old behavior of flushing each table in the keyspace before major-compacting it. Fixes #15777 Closes #15820 to address the confliction, following change is also included in this changeset: tools/scylla-nodetool: implement the cleanup command The --jobs command-line argument is accepted but ignored, just like the current nodetool does. Refs: #15588 Closes #16756 * github.com:scylladb/scylladb: docs: nodetool: flush: enrich examples docs: nodetool: compact: fix example api: add /storage_service/compact api: add /storage_service/flush tools/scylla-nodetool: implement the flush command compaction_manager: flush_all_tables before major compaction database: add flush_all_tables api: compaction: add flush_memtables option test/nodetool: jmx: fix path to scripts/scylla-jmx scylla-nodetool, docs: improve optional params documentation tools/scylla-nodetool: extract keyspace/table parsing tools/scylla-nodetool: implement the cleanup command test/nodetool: rest_api_mock: add more options for multiple requests
Major compaction already flushes each table to make
sure it considers any mutations that are present in the
memtable for the purpose of tombstone purging.
See 64ec1c6
However, tombstone purging may be inhibited by data
in commitlog segments based on
gc_time_min
in thetombstone_gc_state
(See f42eb4d).Flushing all sstables in the database release
all references to commitlog segments and there
it maximizes the potential for tombstone purging,
which is typically the reason for running major compaction.
However, flushing all tables too frequently might
result in tiny sstables. Since when flushing all
keyspaces using
nodetool flush
theforce_keyspace_compaction
api is invoked for keyspace successively, we need a mechanism
to prevent too frequent flushes by major compaction.
Hence a
compaction_flush_all_tables_before_major_seconds
intervalconfiguration option is added (defaults to 24 hours).
In the case that not all tables are flushed prior
to major compaction, we revert to the old behavior of
flushing each table in the keyspace before major-compacting it.
Fixes #15777