-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] db/system_keyspace: add data_source virtual table #14083
Conversation
Allows reading the content of the memtable(s).
Allows reading the content of all the sstables of the table.
So it can be used where an instance is not available (or cannot be used).
Allows querying the mutation-level content of any table, from any of the mutation sources (memtable, sstables, row-cache). The table supports reading only a single partition from the given table. The content is presented in the mutation-fragment level, separately for all mutation sources. The content of individual mutation sources are merged, e.g. even if there are multiple memtables/sstables, the contents of these will be merged into a single per mutation-source stream. This virtual table provides us a long needed peak into the content of memtables and cache, something we could only do in coredumps until now. Example: given a table: CREATE TABLE ks.tbl ( pk int, ck int, s int static, v text, PRIMARY KEY (pk, ck) ) with the following content: cqlsh> select * from ks.tbl where pk = 100; pk | ck | s | v -----+-----+------+----- 100 | 200 | null | www 100 | 300 | null | vvv 100 | 400 | null | www we can query the underlying mutation fragments: cqlsh> select * from system.data_source where keyspace_name = 'ks' and table_name = 'tbl' and partition_key = ['100']; keyspace_name | table_name | partition_key | source | partition_region | clustering_key | position_weight | kind | value ---------------+------------+---------------+----------+------------------+----------------+-----------------+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ks | tbl | ['100'] | memtable | 0 | [] | 0 | partition start | null ks | tbl | ['100'] | memtable | 2 | ['200'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685427018668474 0 0} {{row:\n v{ atomic_cell{www,ts=1685427018668474,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | memtable | 2 | ['400'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685427016170123 0 0} {{row:\n v{ atomic_cell{www,ts=1685427016170123,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | memtable | 3 | [] | 0 | partition end | null ks | tbl | ['100'] | sstable | 0 | [] | 0 | partition start | null ks | tbl | ['100'] | sstable | 2 | ['200'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685364217042150 0 0} {{row:\n v{ atomic_cell{vvv,ts=1685364217042150,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | sstable | 2 | ['300'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{00040000012c}, 0} {clustering_row: ck 00040000012c dr {deletable_row: {row_marker: 1685364217047564 0 0} {{row:\n v{ atomic_cell{vvv,ts=1685364217047564,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | sstable | 2 | ['400'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685364217050305 0 0} {{row:\n v{ atomic_cell{vvv,ts=1685364217050305,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | sstable | 3 | [] | 0 | partition end | null
TODO:
|
This table will allow us to count tombstones in memtable and/or cache, something we've never been able to do before either. |
CI state |
Doesn't tracing provide the same information? At least, if we enhance the memtable reader to report it. In addition it reports which sstables are involved. |
Not sure which information you refer to. Tracing doesn't provide any information w.r.t. to the content of any mutation sources. It just mentions some (not all) of the mutation sources involved. The way I see it, the advantage of using tracing is ease of implementation and ease of very basic usage. We could just shove the mutation fragments to the trace ptr which is already propagated everywhere. One would just add some keywords to an existing query and see the mutation fragments in the tracing output. But any additional analytics would then require custom scripts which extract, parse and query the included data. |
I agree with @avikivity - this belongs to Tracing: at the places where you write data to the virtual table you can simply add a tracing point with the same content exactly in the "parsable" way. Note that the table above is going to require post processing as well - exactly as the Tracing data. However a Virtual table from RFC is not providing the full trace of the query in question, e.g. the same information from all replicas involved in the query. And this is just one of the examples that shows the advantage of using Tracing in this RFC instead of a Virtual table. Another question about this RFC: in the example above you seem to be printing the information about each single tombstone. This is not going to create good results when we have millions of tombstones in a single partition. I think you'd rather print some statistics instead. |
If tracing misses some mutation source (memtables?) then it needs to be fixed. It's true that tracing provides much more information than is needed. btw, don't we also have an HTTP API to find sstables that contain a key? That makes for yet a third way to obtain the the same information. |
I feel like there is some misunderstanding as to what this table aims to provide. This table is not a mean to find out where a certain partition is coming from (which mutation sources). This table is away to dump the raw content of those mutation sources. It is a capability that we are currently missing entirely and I think tracing is not the right tool for this job. If however you want to know the exact mutation-level content of the cache for a given partition, then this table is the right tool for that. And I recall many occasions when we spent a lot of time guessing what cache and/or memtable contains in the context of a performance/correctness issue. So I think there is place for a way to dump their content. |
Got it. I indeed misunderstood the intent of this RFC. It makes a total sense to me now. Good idea indeed! |
@tgrabiec please review too |
Ah, thanks for the explanation. So at least the name can be improved. I think a virtual table is the wrong interface. A virtual table is something that Perhaps the right interface is a modification to the SELECT statement via a pseudo-function: SELECT scylladb_mutation_fragment(*) FROM tab WHERE pk = ? or even a new statement: DUMP INTERNALS
SELECT * FROM tab WHERE pk = ? Of course, I wouldn't want to turn this into a mega-project just for the sake of syntactic sugar. |
Oracle has the concept of a pseudo-column: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/ROWID-Pseudocolumn.html |
Yes, in v2 (not published yet) I renamed table to
It is legal, supporting it wouldn't even be a lot of work. Running a full scan of this table just wouldn't be useful, because it would return all the content of all the tables, in an unspecified order (token order). Since I didn't think it would be of much use, I deferred implementing it, to focus on the more valuable parts. The The schema of the vtable is the following: CREATE TABLE system.mutation_dump (
keyspace_name text,
table_name text,
partition_key frozen<list<text>>,
source text,
partition_region smallint,
clustering_key frozen<list<text>>,
position_weight smallint,
kind text,
value text,
PRIMARY KEY ((keyspace_name, table_name, partition_key), (source, partition_region, clustering_key, position_weight)); The only "unnatural" thing when querying this table is that partitions in the output of this table, will not be ordered according to the token order of the underlying table. So selecting a token-range from this table will not work. I don't know how to work around that, maybe with a new syntax proposed above it is possible. But again, I don't think we will want to select partition ranges from this table, but I may be wrong. |
streamed_mutation::forwarding::no, | ||
mutation_reader::forwarding::no, | ||
[&readers] (size_t memtable_count) { readers.reserve(memtable_count); }); | ||
return make_combined_reader(std::move(schema), std::move(permit), std::move(readers), streamed_mutation::forwarding::no, mutation_reader::forwarding::no); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more accurate to report each memtable separately, but no big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this, but thought it would maybe interact weirdly with paging (memtable could disappear between pages, unless we pin it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the capturing the memtable in the reader (which is how it works) be enough? The memtable will still be flushable, in which case the reader will read from the resulting sstable.
It should just work, but we don't have to do it now.
@@ -36,6 +36,7 @@ | |||
#include "gms/feature_service.hh" | |||
#include "system_keyspace_view_types.hh" | |||
#include "schema/schema_builder.hh" | |||
#include "schema/schema_registry.hh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected (after reading the more detailed description) that we'd see the source as an individual sstable, not "sstable" as a class. I guess it can be added as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also just work. An sstable reader captures the sstable and prevents it from disappearing.
})}); | ||
rs->data_sources.push_back(data_source{"row_cache", mutation_source([] (schema_ptr schema, reader_permit permit, const dht::partition_range& pr, const query::partition_slice& ps) { | ||
//TODO | ||
return make_empty_flat_reader_v2(std::move(schema), std::move(permit)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need to be a special row_cache reader that doesn't fall back to sstables on misses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, turns out I can reuse the memtable's existing reader for this (done in v2).
|
||
static schema_ptr build_schema() { | ||
auto id = generate_legacy_id(system_keyspace::NAME, "data_source"); | ||
return schema_builder(system_keyspace::NAME, "data_source", std::make_optional(id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still called data_source, I thought you renamed it.
Please add the word "debug" to the name so we'd be at liberty to play with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, still unpublished. Anyway, series looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did rename it in v2, but it is not published yet.
The virtual table interface breaks down badly at the the clustering ordering, because I will have to look into the alternatives provided by @avikivity at #14083 (comment). |
After some digging, I think the way to go is a new kind of statement, similar to our existing |
You could put the keys as regular columns, and order by sequence number in the fragment stream. |
That would solve the ordering, but it would make it really awkward to select a single row or a range of rows. One would have to first scan the entire partition, to obtain the row number. Also, the row number would change every time the rows would change, so the mapping between this number and the clustering key would not be stable. Overall, I think this would lead to a poor UX. |
I think the easiest (but not prettiest) way is scylladb_mutation_fragment(*), returning a list<text>. The reason it's the easiest is that we preserve one primary key == one row (but: how to present the partition tombstone and static row?) Maybe the cleanest is a new statement. The new statement can return different metadata. Let's say the primary key is ((pk1, pk2), ck1, ck2). We'd return metadata of ((pk1, pk2), ck1, ck2, type, source) where type = memtable | sstable | cache and source names the particular memtable or sstable. The partition and static row would have ck1 = ck2 = NULL. |
I agree and I already started working on this. I will publish an RFC as soon as I get a simplest example working. Thankfully, I can salvage most of the code I wrote for the virtual table, what I need to figure out is how to wire in properly the new statement. I think I'm getting there and hopefully will be able to publish an early RFC today or tomorrow. |
Closing this, as the virtual table approach is a dead-end. I opened a new PR for the new statement approach: #14347. |
Allows querying the mutation-level content of any table, from any of the
mutation sources (memtable, sstables, row-cache).
The table supports reading only a single partition from the given table.
The content is presented in the mutation-fragment level, separately for
all mutation sources. The content of individual mutation sources are
merged, e.g. even if there are multiple memtables/sstables, the contents
of these will be merged into a single per mutation-source stream.
This virtual table provides us a long needed peak into the content of
memtables and cache, something we could only do in coredumps until now.
Fixes: #11130
Example:
given a table:
with the following content:
we can query the underlying mutation fragments: