[RFC] db/system_keyspace: add data_source virtual table #14083

denesb · 2023-05-30T06:26:53Z

Allows querying the mutation-level content of any table, from any of the
mutation sources (memtable, sstables, row-cache).
The table supports reading only a single partition from the given table.
The content is presented in the mutation-fragment level, separately for
all mutation sources. The content of individual mutation sources are
merged, e.g. even if there are multiple memtables/sstables, the contents
of these will be merged into a single per mutation-source stream.

This virtual table provides us a long needed peak into the content of
memtables and cache, something we could only do in coredumps until now.

Fixes: #11130

Example:

given a table:

CREATE TABLE ks.tbl (
pk int,
ck int,
s int static,
v text,
PRIMARY KEY (pk, ck)
)

with the following content:

cqlsh> select * from ks.tbl where pk = 100;

 pk  | ck  | s    | v
-----+-----+------+-----
 100 | 200 | null | www
 100 | 300 | null | vvv
 100 | 400 | null | www

we can query the underlying mutation fragments:

cqlsh> select * from system.data_source where keyspace_name = 'ks' and table_name = 'tbl' and partition_key = ['100'];

 keyspace_name | table_name | partition_key | source   | partition_region | clustering_key | position_weight | kind            | value
---------------+------------+---------------+----------+------------------+----------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
	ks |        tbl |       ['100'] | memtable |                0 |               [] |               0 | partition start |                                                                                                                                                                                                                                             null
	ks |        tbl |       ['100'] | memtable |                2 |        ['200'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685427018668474 0 0} {{row:\n    v{ atomic_cell{www,ts=1685427018668474,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] | memtable |                2 |        ['400'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685427016170123 0 0} {{row:\n    v{ atomic_cell{www,ts=1685427016170123,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] | memtable |                3 |               [] |               0 |   partition end |                                                                                                                                                                                                                                             null
	ks |        tbl |       ['100'] |  sstable |                0 |               [] |               0 | partition start |                                                                                                                                                                                                                                             null
	ks |        tbl |       ['100'] |  sstable |                2 |        ['200'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685364217042150 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217042150,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] |  sstable |                2 |        ['300'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{00040000012c}, 0} {clustering_row: ck 00040000012c dr {deletable_row: {row_marker: 1685364217047564 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217047564,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] |  sstable |                2 |        ['400'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685364217050305 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217050305,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] |  sstable |                3 |               [] |               0 |   partition end |                                                                                                                                                                                                                                             null

Allows reading the content of the memtable(s).

Allows reading the content of all the sstables of the table.

So it can be used where an instance is not available (or cannot be used).

Allows querying the mutation-level content of any table, from any of the mutation sources (memtable, sstables, row-cache). The table supports reading only a single partition from the given table. The content is presented in the mutation-fragment level, separately for all mutation sources. The content of individual mutation sources are merged, e.g. even if there are multiple memtables/sstables, the contents of these will be merged into a single per mutation-source stream. This virtual table provides us a long needed peak into the content of memtables and cache, something we could only do in coredumps until now. Example: given a table: CREATE TABLE ks.tbl ( pk int, ck int, s int static, v text, PRIMARY KEY (pk, ck) ) with the following content: cqlsh> select * from ks.tbl where pk = 100; pk | ck | s | v -----+-----+------+----- 100 | 200 | null | www 100 | 300 | null | vvv 100 | 400 | null | www we can query the underlying mutation fragments: cqlsh> select * from system.data_source where keyspace_name = 'ks' and table_name = 'tbl' and partition_key = ['100']; keyspace_name | table_name | partition_key | source | partition_region | clustering_key | position_weight | kind | value ---------------+------------+---------------+----------+------------------+----------------+-----------------+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ks | tbl | ['100'] | memtable | 0 | [] | 0 | partition start | null ks | tbl | ['100'] | memtable | 2 | ['200'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685427018668474 0 0} {{row:\n v{ atomic_cell{www,ts=1685427018668474,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | memtable | 2 | ['400'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685427016170123 0 0} {{row:\n v{ atomic_cell{www,ts=1685427016170123,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | memtable | 3 | [] | 0 | partition end | null ks | tbl | ['100'] | sstable | 0 | [] | 0 | partition start | null ks | tbl | ['100'] | sstable | 2 | ['200'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685364217042150 0 0} {{row:\n v{ atomic_cell{vvv,ts=1685364217042150,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | sstable | 2 | ['300'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{00040000012c}, 0} {clustering_row: ck 00040000012c dr {deletable_row: {row_marker: 1685364217047564 0 0} {{row:\n v{ atomic_cell{vvv,ts=1685364217047564,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | sstable | 2 | ['400'] | 0 | clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685364217050305 0 0} {{row:\n v{ atomic_cell{vvv,ts=1685364217050305,expiry=-1,ttl=0} }}}}}} ks | tbl | ['100'] | sstable | 3 | [] | 0 | partition end | null

denesb · 2023-05-30T06:29:41Z

TODO:

add mutation fragments from cache
nicer format for value
- format metadata using JSON
- separate field for actual value (so it can be omitted from query)
take slice into consideration
- clustering range
- drop excluded columns early, especially the value column
deal with mutation fragments having the same position (range-tombstone-changes)
tests
docs

denesb · 2023-05-30T06:49:10Z

This table will allow us to count tombstones in memtable and/or cache, something we've never been able to do before either.

scylladb-promoter · 2023-05-30T08:17:29Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1557/

avikivity · 2023-05-30T10:52:19Z

Doesn't tracing provide the same information? At least, if we enhance the memtable reader to report it. In addition it reports which sstables are involved.

denesb · 2023-05-30T11:47:05Z

Doesn't tracing provide the same information? At least, if we enhance the memtable reader to report it. In addition it reports which sstables are involved.

Not sure which information you refer to. Tracing doesn't provide any information w.r.t. to the content of any mutation sources. It just mentions some (not all) of the mutation sources involved.

The way I see it, the advantage of using tracing is ease of implementation and ease of very basic usage. We could just shove the mutation fragments to the trace ptr which is already propagated everywhere. One would just add some keywords to an existing query and see the mutation fragments in the tracing output. But any additional analytics would then require custom scripts which extract, parse and query the included data.
With a virtual table, all that comes for free at the slight inconvenience that one has to type up a second query to see the mutations at work for the first one. But this allows for so much more flexibility.

vladzcloudius · 2023-05-30T13:29:12Z

I agree with @avikivity - this belongs to Tracing: at the places where you write data to the virtual table you can simply add a tracing point with the same content exactly in the "parsable" way.

Note that the table above is going to require post processing as well - exactly as the Tracing data.

However a Virtual table from RFC is not providing the full trace of the query in question, e.g. the same information from all replicas involved in the query. And this is just one of the examples that shows the advantage of using Tracing in this RFC instead of a Virtual table.

Another question about this RFC: in the example above you seem to be printing the information about each single tombstone. This is not going to create good results when we have millions of tombstones in a single partition. I think you'd rather print some statistics instead.

avikivity · 2023-05-30T13:36:54Z

If tracing misses some mutation source (memtables?) then it needs to be fixed.

It's true that tracing provides much more information than is needed.

btw, don't we also have an HTTP API to find sstables that contain a key? That makes for yet a third way to obtain the the same information.

denesb · 2023-05-30T13:47:03Z

I feel like there is some misunderstanding as to what this table aims to provide. This table is not a mean to find out where a certain partition is coming from (which mutation sources). This table is away to dump the raw content of those mutation sources. It is a capability that we are currently missing entirely and I think tracing is not the right tool for this job.
Yes, we can add more and better stats to tracing, but that is orthogonal to this. If all you care about is what sstables a certain partition can be found in, then this is not the (best) tool for that.

If however you want to know the exact mutation-level content of the cache for a given partition, then this table is the right tool for that. And I recall many occasions when we spent a lot of time guessing what cache and/or memtable contains in the context of a performance/correctness issue. So I think there is place for a way to dump their content.
So far, the only way to dump the content of cache and memtable was to generate a coredump. Which is quite an intrusive way and it requires an engineer to inspect the core. With this table, anybody can do the dumping, although I guess not everybody can make sense of it still.

vladzcloudius · 2023-05-30T16:36:10Z

I feel like there is some misunderstanding as to what this table aims to provide. This table is not a mean to find out where a certain partition is coming from (which mutation sources). This table is away to dump the raw content of those mutation sources. It is a capability that we are currently missing entirely and I think tracing is not the right tool for this job. Yes, we can add more and better stats to tracing, but that is orthogonal to this. If all you care about is what sstables a certain partition can be found in, then this is not the (best) tool for that.

If however you want to know the exact mutation-level content of the cache for a given partition, then this table is the right tool for that. And I recall many occasions when we spent a lot of time guessing what cache and/or memtable contains in the context of a performance/correctness issue. So I think there is place for a way to dump their content. So far, the only way to dump the content of cache and memtable was to generate a coredump. Which is quite an intrusive way and it requires an engineer to inspect the core. With this table, anybody can do the dumping, although I guess not everybody can make sense of it still.

Got it. I indeed misunderstood the intent of this RFC. It makes a total sense to me now. Good idea indeed!

bhalevy · 2023-06-11T13:58:43Z

@tgrabiec please review too

avikivity · 2023-06-11T20:57:12Z

I feel like there is some misunderstanding as to what this table aims to provide. This table is not a mean to find out where a certain partition is coming from (which mutation sources). This table is away to dump the raw content of those mutation sources. It is a capability that we are currently missing entirely and I think tracing is not the right tool for this job. Yes, we can add more and better stats to tracing, but that is orthogonal to this. If all you care about is what sstables a certain partition can be found in, then this is not the (best) tool for that.

If however you want to know the exact mutation-level content of the cache for a given partition, then this table is the right tool for that. And I recall many occasions when we spent a lot of time guessing what cache and/or memtable contains in the context of a performance/correctness issue. So I think there is place for a way to dump their content. So far, the only way to dump the content of cache and memtable was to generate a coredump. Which is quite an intrusive way and it requires an engineer to inspect the core. With this table, anybody can do the dumping, although I guess not everybody can make sense of it still.

Ah, thanks for the explanation. So at least the name can be improved.

I think a virtual table is the wrong interface. A virtual table is something that SELECT * FROM vtable should be legal, and any WHERE clause will serve to filter the data, not to generate inputs.

Perhaps the right interface is a modification to the SELECT statement via a pseudo-function:

SELECT scylladb_mutation_fragment(*) FROM tab WHERE pk = ?

or even a new statement:

DUMP INTERNALS
SELECT * FROM tab WHERE pk = ?

Of course, I wouldn't want to turn this into a mega-project just for the sake of syntactic sugar.

kostja · 2023-06-12T07:55:21Z

Oracle has the concept of a pseudo-column: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/ROWID-Pseudocolumn.html

denesb · 2023-06-13T08:07:59Z

Ah, thanks for the explanation. So at least the name can be improved.

Yes, in v2 (not published yet) I renamed table to mutation_dump to further emphasize that it is a mean to dump the content of mutation sources.

I think a virtual table is the wrong interface. A virtual table is something that SELECT * FROM vtable should be legal, and any WHERE clause will serve to filter the data, not to generate inputs.

It is legal, supporting it wouldn't even be a lot of work. Running a full scan of this table just wouldn't be useful, because it would return all the content of all the tables, in an unspecified order (token order). Since I didn't think it would be of much use, I deferred implementing it, to focus on the more valuable parts.

The WHERE clause doesn't generate input. The partition-key selects the table to be queried, as well as the partition to be read. The mapping is direct, there are no magic values. The clustering constraints selects the data sources and the underlying clustering ranges, again the mapping is direct, without any magic values.

The schema of the vtable is the following:

CREATE TABLE system.mutation_dump (
    keyspace_name text,
    table_name text,
    partition_key frozen<list<text>>,
    source text,
    partition_region smallint,
    clustering_key frozen<list<text>>,
    position_weight smallint,
    kind text,
    value text,
    PRIMARY KEY ((keyspace_name, table_name, partition_key), (source, partition_region, clustering_key, position_weight));

The only "unnatural" thing when querying this table is that partitions in the output of this table, will not be ordered according to the token order of the underlying table. So selecting a token-range from this table will not work. I don't know how to work around that, maybe with a new syntax proposed above it is possible. But again, I don't think we will want to select partition ranges from this table, but I may be wrong.

avikivity · 2023-06-14T12:32:22Z

replica/table.cc

+            streamed_mutation::forwarding::no,
+            mutation_reader::forwarding::no,
+            [&readers] (size_t memtable_count) { readers.reserve(memtable_count); });
+    return make_combined_reader(std::move(schema), std::move(permit), std::move(readers), streamed_mutation::forwarding::no, mutation_reader::forwarding::no);


It would be more accurate to report each memtable separately, but no big deal.

I considered this, but thought it would maybe interact weirdly with paging (memtable could disappear between pages, unless we pin it).

Wouldn't the capturing the memtable in the reader (which is how it works) be enough? The memtable will still be flushable, in which case the reader will read from the resulting sstable.

It should just work, but we don't have to do it now.

avikivity · 2023-06-14T12:34:29Z

db/system_keyspace.cc

@@ -36,6 +36,7 @@
 #include "gms/feature_service.hh"
 #include "system_keyspace_view_types.hh"
 #include "schema/schema_builder.hh"
+#include "schema/schema_registry.hh"


I expected (after reading the more detailed description) that we'd see the source as an individual sstable, not "sstable" as a class. I guess it can be added as needed.

This would also just work. An sstable reader captures the sstable and prevents it from disappearing.

avikivity · 2023-06-14T12:35:32Z

db/system_keyspace.cc

+            })});
+            rs->data_sources.push_back(data_source{"row_cache", mutation_source([] (schema_ptr schema, reader_permit permit, const dht::partition_range& pr, const query::partition_slice& ps) {
+                //TODO
+                return make_empty_flat_reader_v2(std::move(schema), std::move(permit));


This would need to be a special row_cache reader that doesn't fall back to sstables on misses.

Yes, turns out I can reuse the memtable's existing reader for this (done in v2).

avikivity · 2023-06-14T12:36:46Z

db/system_keyspace.cc

+
+    static schema_ptr build_schema() {
+        auto id = generate_legacy_id(system_keyspace::NAME, "data_source");
+        return schema_builder(system_keyspace::NAME, "data_source", std::make_optional(id))


It's still called data_source, I thought you renamed it.

Please add the word "debug" to the name so we'd be at liberty to play with it.

Ah, still unpublished. Anyway, series looks good.

I did rename it in v2, but it is not published yet.

denesb · 2023-06-20T11:48:13Z

The virtual table interface breaks down badly at the the clustering ordering, because list<string> (my abstraction for all possible clustering key types) is ordered differently than tuple<$ck_type1, $ck_type2, ...> (the real underlying type).

I will have to look into the alternatives provided by @avikivity at #14083 (comment).

denesb · 2023-06-20T12:26:12Z

After some digging, I think the way to go is a new kind of statement, similar to our existing SELECT JSON FROM statement: SELECT MUTATION_FRAGMENTS FROM $column_family WHERE ...; We could add host and mutation_source pseudo columns, that can also be used to restrict the input. I will think about this some more, but any early feed-back is welcome.

tgrabiec · 2023-06-20T15:08:57Z

You could put the keys as regular columns, and order by sequence number in the fragment stream.

denesb · 2023-06-21T05:12:46Z

You could put the keys as regular columns, and order by sequence number in the fragment stream.

That would solve the ordering, but it would make it really awkward to select a single row or a range of rows. One would have to first scan the entire partition, to obtain the row number. Also, the row number would change every time the rows would change, so the mapping between this number and the clustering key would not be stable. Overall, I think this would lead to a poor UX.

avikivity · 2023-06-21T09:49:12Z

The virtual table interface breaks down badly at the the clustering ordering, because list<string> (my abstraction for all possible clustering key types) is ordered differently than tuple<$ck_type1, $ck_type2, ...> (the real underlying type).

I will have to look into the alternatives provided by @avikivity at #14083 (comment).

I think the easiest (but not prettiest) way is scylladb_mutation_fragment(*), returning a list<text>.

The reason it's the easiest is that we preserve one primary key == one row (but: how to present the partition tombstone and static row?)

Maybe the cleanest is a new statement. The new statement can return different metadata. Let's say the primary key is ((pk1, pk2), ck1, ck2). We'd return metadata of ((pk1, pk2), ck1, ck2, type, source) where type = memtable | sstable | cache and source names the particular memtable or sstable. The partition and static row would have ck1 = ck2 = NULL.

denesb · 2023-06-21T11:42:52Z

Maybe the cleanest is a new statement. The new statement can return different metadata. Let's say the primary key is ((pk1, pk2), ck1, ck2). We'd return metadata of ((pk1, pk2), ck1, ck2, type, source) where type = memtable | sstable | cache and source names the particular memtable or sstable. The partition and static row would have ck1 = ck2 = NULL.

I agree and I already started working on this. I will publish an RFC as soon as I get a simplest example working. Thankfully, I can salvage most of the code I wrote for the virtual table, what I need to figure out is how to wire in properly the new statement. I think I'm getting there and hopefully will be able to publish an early RFC today or tomorrow.

denesb · 2023-06-21T18:04:32Z

Closing this, as the virtual table approach is a dead-end. I opened a new PR for the new statement approach: #14347.

denesb added 4 commits May 30, 2023 02:12

replica/table: add make_memtable_reader()

a3f4b3c

Allows reading the content of the memtable(s).

replica/table: add make_sstable_reader()

ed3cfdf

Allows reading the content of all the sstables of the table.

db/virtual_table: add static variant of set_cell()

b0dc101

So it can be used where an instance is not available (or cannot be used).

denesb changed the title ~~db/system_keyspace: add data_source virtual table~~ [RFC] db/system_keyspace: add data_source virtual table May 30, 2023

denesb mentioned this pull request May 30, 2023

Add ability to trace mutation form from each data source during query #11130

Closed

bhalevy requested a review from tgrabiec June 11, 2023 13:58

tgrabiec approved these changes Jun 14, 2023

View reviewed changes

avikivity reviewed Jun 14, 2023

View reviewed changes

denesb mentioned this pull request Jun 21, 2023

Introduce SELECT MUTATION FRAGMENTS statement #14347

Merged

denesb closed this Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] db/system_keyspace: add data_source virtual table #14083

[RFC] db/system_keyspace: add data_source virtual table #14083

denesb commented May 30, 2023

denesb commented May 30, 2023 •

edited

denesb commented May 30, 2023

scylladb-promoter commented May 30, 2023

avikivity commented May 30, 2023

denesb commented May 30, 2023

vladzcloudius commented May 30, 2023 •

edited

avikivity commented May 30, 2023

denesb commented May 30, 2023 •

edited

vladzcloudius commented May 30, 2023

bhalevy commented Jun 11, 2023

avikivity commented Jun 11, 2023

kostja commented Jun 12, 2023

denesb commented Jun 13, 2023

avikivity Jun 14, 2023

denesb Jun 14, 2023

avikivity Jun 14, 2023

avikivity Jun 14, 2023

denesb Jun 14, 2023

avikivity Jun 14, 2023

avikivity Jun 14, 2023

denesb Jun 14, 2023

avikivity Jun 14, 2023

avikivity Jun 14, 2023

denesb Jun 14, 2023

denesb commented Jun 20, 2023

denesb commented Jun 20, 2023

tgrabiec commented Jun 20, 2023

denesb commented Jun 21, 2023

avikivity commented Jun 21, 2023 •

edited

denesb commented Jun 21, 2023

denesb commented Jun 21, 2023

[RFC] db/system_keyspace: add data_source virtual table #14083

[RFC] db/system_keyspace: add data_source virtual table #14083

Conversation

denesb commented May 30, 2023

denesb commented May 30, 2023 • edited

denesb commented May 30, 2023

scylladb-promoter commented May 30, 2023

avikivity commented May 30, 2023

denesb commented May 30, 2023

vladzcloudius commented May 30, 2023 • edited

avikivity commented May 30, 2023

denesb commented May 30, 2023 • edited

vladzcloudius commented May 30, 2023

bhalevy commented Jun 11, 2023

avikivity commented Jun 11, 2023

kostja commented Jun 12, 2023

denesb commented Jun 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denesb commented Jun 20, 2023

denesb commented Jun 20, 2023

tgrabiec commented Jun 20, 2023

denesb commented Jun 21, 2023

avikivity commented Jun 21, 2023 • edited

denesb commented Jun 21, 2023

denesb commented Jun 21, 2023

denesb commented May 30, 2023 •

edited

vladzcloudius commented May 30, 2023 •

edited

denesb commented May 30, 2023 •

edited

avikivity commented Jun 21, 2023 •

edited