Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] db/system_keyspace: add data_source virtual table #14083

Closed
wants to merge 4 commits into from

Conversation

denesb
Copy link
Contributor

@denesb denesb commented May 30, 2023

Allows querying the mutation-level content of any table, from any of the
mutation sources (memtable, sstables, row-cache).
The table supports reading only a single partition from the given table.
The content is presented in the mutation-fragment level, separately for
all mutation sources. The content of individual mutation sources are
merged, e.g. even if there are multiple memtables/sstables, the contents
of these will be merged into a single per mutation-source stream.

This virtual table provides us a long needed peak into the content of
memtables and cache, something we could only do in coredumps until now.

Fixes: #11130

Example:

given a table:

CREATE TABLE ks.tbl (
pk int,
ck int,
s int static,
v text,
PRIMARY KEY (pk, ck)
)

with the following content:

cqlsh> select * from ks.tbl where pk = 100;

 pk  | ck  | s    | v
-----+-----+------+-----
 100 | 200 | null | www
 100 | 300 | null | vvv
 100 | 400 | null | www

we can query the underlying mutation fragments:

cqlsh> select * from system.data_source where keyspace_name = 'ks' and table_name = 'tbl' and partition_key = ['100'];

 keyspace_name | table_name | partition_key | source   | partition_region | clustering_key | position_weight | kind            | value
---------------+------------+---------------+----------+------------------+----------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
	ks |        tbl |       ['100'] | memtable |                0 |               [] |               0 | partition start |                                                                                                                                                                                                                                             null
	ks |        tbl |       ['100'] | memtable |                2 |        ['200'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685427018668474 0 0} {{row:\n    v{ atomic_cell{www,ts=1685427018668474,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] | memtable |                2 |        ['400'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685427016170123 0 0} {{row:\n    v{ atomic_cell{www,ts=1685427016170123,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] | memtable |                3 |               [] |               0 |   partition end |                                                                                                                                                                                                                                             null
	ks |        tbl |       ['100'] |  sstable |                0 |               [] |               0 | partition start |                                                                                                                                                                                                                                             null
	ks |        tbl |       ['100'] |  sstable |                2 |        ['200'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685364217042150 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217042150,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] |  sstable |                2 |        ['300'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{00040000012c}, 0} {clustering_row: ck 00040000012c dr {deletable_row: {row_marker: 1685364217047564 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217047564,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] |  sstable |                2 |        ['400'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685364217050305 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217050305,expiry=-1,ttl=0} }}}}}}
	ks |        tbl |       ['100'] |  sstable |                3 |               [] |               0 |   partition end |                                                                                                                                                                                                                                             null

Allows reading the content of the memtable(s).
Allows reading the content of all the sstables of the table.
So it can be used where an instance is not available (or cannot be
used).
Allows querying the mutation-level content of any table, from any of the
mutation sources (memtable, sstables, row-cache).
The table supports reading only a single partition from the given table.
The content is presented in the mutation-fragment level, separately for
all mutation sources. The content of individual mutation sources are
merged, e.g. even if there are multiple memtables/sstables, the contents
of these will be merged into a single per mutation-source stream.

This virtual table provides us a long needed peak into the content of
memtables and cache, something we could only do in coredumps until now.

Example:

given a table:

    CREATE TABLE ks.tbl (
	pk int,
	ck int,
	s int static,
	v text,
	PRIMARY KEY (pk, ck)
    )

with the following content:

    cqlsh> select * from ks.tbl where pk = 100;

     pk  | ck  | s    | v
    -----+-----+------+-----
     100 | 200 | null | www
     100 | 300 | null | vvv
     100 | 400 | null | www

we can query the underlying mutation fragments:

    cqlsh> select * from system.data_source where keyspace_name = 'ks' and table_name = 'tbl' and partition_key = ['100'];

     keyspace_name | table_name | partition_key | source   | partition_region | clustering_key | position_weight | kind            | value
    ---------------+------------+---------------+----------+------------------+----------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
		ks |        tbl |       ['100'] | memtable |                0 |               [] |               0 | partition start |                                                                                                                                                                                                                                             null
		ks |        tbl |       ['100'] | memtable |                2 |        ['200'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685427018668474 0 0} {{row:\n    v{ atomic_cell{www,ts=1685427018668474,expiry=-1,ttl=0} }}}}}}
		ks |        tbl |       ['100'] | memtable |                2 |        ['400'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685427016170123 0 0} {{row:\n    v{ atomic_cell{www,ts=1685427016170123,expiry=-1,ttl=0} }}}}}}
		ks |        tbl |       ['100'] | memtable |                3 |               [] |               0 |   partition end |                                                                                                                                                                                                                                             null
		ks |        tbl |       ['100'] |  sstable |                0 |               [] |               0 | partition start |                                                                                                                                                                                                                                             null
		ks |        tbl |       ['100'] |  sstable |                2 |        ['200'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{0004000000c8}, 0} {clustering_row: ck 0004000000c8 dr {deletable_row: {row_marker: 1685364217042150 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217042150,expiry=-1,ttl=0} }}}}}}
		ks |        tbl |       ['100'] |  sstable |                2 |        ['300'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{00040000012c}, 0} {clustering_row: ck 00040000012c dr {deletable_row: {row_marker: 1685364217047564 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217047564,expiry=-1,ttl=0} }}}}}}
		ks |        tbl |       ['100'] |  sstable |                2 |        ['400'] |               0 |  clustering row | {mutation_fragment: clustering row {position: clustered, ckp{000400000190}, 0} {clustering_row: ck 000400000190 dr {deletable_row: {row_marker: 1685364217050305 0 0} {{row:\n    v{ atomic_cell{vvv,ts=1685364217050305,expiry=-1,ttl=0} }}}}}}
		ks |        tbl |       ['100'] |  sstable |                3 |               [] |               0 |   partition end |                                                                                                                                                                                                                                             null
@denesb denesb changed the title db/system_keyspace: add data_source virtual table [RFC] db/system_keyspace: add data_source virtual table May 30, 2023
@denesb
Copy link
Contributor Author

denesb commented May 30, 2023

TODO:

  • add mutation fragments from cache
  • nicer format for value
    • format metadata using JSON
    • separate field for actual value (so it can be omitted from query)
  • take slice into consideration
    • clustering range
    • drop excluded columns early, especially the value column
  • deal with mutation fragments having the same position (range-tombstone-changes)
  • tests
  • docs

@denesb
Copy link
Contributor Author

denesb commented May 30, 2023

This table will allow us to count tombstones in memtable and/or cache, something we've never been able to do before either.

@scylladb-promoter
Copy link
Contributor

@avikivity
Copy link
Member

Doesn't tracing provide the same information? At least, if we enhance the memtable reader to report it. In addition it reports which sstables are involved.

@denesb
Copy link
Contributor Author

denesb commented May 30, 2023

Doesn't tracing provide the same information? At least, if we enhance the memtable reader to report it. In addition it reports which sstables are involved.

Not sure which information you refer to. Tracing doesn't provide any information w.r.t. to the content of any mutation sources. It just mentions some (not all) of the mutation sources involved.

The way I see it, the advantage of using tracing is ease of implementation and ease of very basic usage. We could just shove the mutation fragments to the trace ptr which is already propagated everywhere. One would just add some keywords to an existing query and see the mutation fragments in the tracing output. But any additional analytics would then require custom scripts which extract, parse and query the included data.
With a virtual table, all that comes for free at the slight inconvenience that one has to type up a second query to see the mutations at work for the first one. But this allows for so much more flexibility.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented May 30, 2023

I agree with @avikivity - this belongs to Tracing: at the places where you write data to the virtual table you can simply add a tracing point with the same content exactly in the "parsable" way.

Note that the table above is going to require post processing as well - exactly as the Tracing data.

However a Virtual table from RFC is not providing the full trace of the query in question, e.g. the same information from all replicas involved in the query. And this is just one of the examples that shows the advantage of using Tracing in this RFC instead of a Virtual table.

Another question about this RFC: in the example above you seem to be printing the information about each single tombstone. This is not going to create good results when we have millions of tombstones in a single partition. I think you'd rather print some statistics instead.

@avikivity
Copy link
Member

If tracing misses some mutation source (memtables?) then it needs to be fixed.

It's true that tracing provides much more information than is needed.

btw, don't we also have an HTTP API to find sstables that contain a key? That makes for yet a third way to obtain the the same information.

@denesb
Copy link
Contributor Author

denesb commented May 30, 2023

I feel like there is some misunderstanding as to what this table aims to provide. This table is not a mean to find out where a certain partition is coming from (which mutation sources). This table is away to dump the raw content of those mutation sources. It is a capability that we are currently missing entirely and I think tracing is not the right tool for this job.
Yes, we can add more and better stats to tracing, but that is orthogonal to this. If all you care about is what sstables a certain partition can be found in, then this is not the (best) tool for that.

If however you want to know the exact mutation-level content of the cache for a given partition, then this table is the right tool for that. And I recall many occasions when we spent a lot of time guessing what cache and/or memtable contains in the context of a performance/correctness issue. So I think there is place for a way to dump their content.
So far, the only way to dump the content of cache and memtable was to generate a coredump. Which is quite an intrusive way and it requires an engineer to inspect the core. With this table, anybody can do the dumping, although I guess not everybody can make sense of it still.

@vladzcloudius
Copy link
Contributor

I feel like there is some misunderstanding as to what this table aims to provide. This table is not a mean to find out where a certain partition is coming from (which mutation sources). This table is away to dump the raw content of those mutation sources. It is a capability that we are currently missing entirely and I think tracing is not the right tool for this job. Yes, we can add more and better stats to tracing, but that is orthogonal to this. If all you care about is what sstables a certain partition can be found in, then this is not the (best) tool for that.

If however you want to know the exact mutation-level content of the cache for a given partition, then this table is the right tool for that. And I recall many occasions when we spent a lot of time guessing what cache and/or memtable contains in the context of a performance/correctness issue. So I think there is place for a way to dump their content. So far, the only way to dump the content of cache and memtable was to generate a coredump. Which is quite an intrusive way and it requires an engineer to inspect the core. With this table, anybody can do the dumping, although I guess not everybody can make sense of it still.

Got it. I indeed misunderstood the intent of this RFC. It makes a total sense to me now. Good idea indeed!

@bhalevy bhalevy requested a review from tgrabiec June 11, 2023 13:58
@bhalevy
Copy link
Member

bhalevy commented Jun 11, 2023

@tgrabiec please review too

@avikivity
Copy link
Member

I feel like there is some misunderstanding as to what this table aims to provide. This table is not a mean to find out where a certain partition is coming from (which mutation sources). This table is away to dump the raw content of those mutation sources. It is a capability that we are currently missing entirely and I think tracing is not the right tool for this job. Yes, we can add more and better stats to tracing, but that is orthogonal to this. If all you care about is what sstables a certain partition can be found in, then this is not the (best) tool for that.

If however you want to know the exact mutation-level content of the cache for a given partition, then this table is the right tool for that. And I recall many occasions when we spent a lot of time guessing what cache and/or memtable contains in the context of a performance/correctness issue. So I think there is place for a way to dump their content. So far, the only way to dump the content of cache and memtable was to generate a coredump. Which is quite an intrusive way and it requires an engineer to inspect the core. With this table, anybody can do the dumping, although I guess not everybody can make sense of it still.

Ah, thanks for the explanation. So at least the name can be improved.

I think a virtual table is the wrong interface. A virtual table is something that SELECT * FROM vtable should be legal, and any WHERE clause will serve to filter the data, not to generate inputs.

Perhaps the right interface is a modification to the SELECT statement via a pseudo-function:

SELECT scylladb_mutation_fragment(*) FROM tab WHERE pk = ?

or even a new statement:

DUMP INTERNALS
SELECT * FROM tab WHERE pk = ?

Of course, I wouldn't want to turn this into a mega-project just for the sake of syntactic sugar.

@kostja
Copy link
Contributor

kostja commented Jun 12, 2023

@denesb
Copy link
Contributor Author

denesb commented Jun 13, 2023

Ah, thanks for the explanation. So at least the name can be improved.

Yes, in v2 (not published yet) I renamed table to mutation_dump to further emphasize that it is a mean to dump the content of mutation sources.

I think a virtual table is the wrong interface. A virtual table is something that SELECT * FROM vtable should be legal, and any WHERE clause will serve to filter the data, not to generate inputs.

It is legal, supporting it wouldn't even be a lot of work. Running a full scan of this table just wouldn't be useful, because it would return all the content of all the tables, in an unspecified order (token order). Since I didn't think it would be of much use, I deferred implementing it, to focus on the more valuable parts.

The WHERE clause doesn't generate input. The partition-key selects the table to be queried, as well as the partition to be read. The mapping is direct, there are no magic values. The clustering constraints selects the data sources and the underlying clustering ranges, again the mapping is direct, without any magic values.

The schema of the vtable is the following:

CREATE TABLE system.mutation_dump (
    keyspace_name text,
    table_name text,
    partition_key frozen<list<text>>,
    source text,
    partition_region smallint,
    clustering_key frozen<list<text>>,
    position_weight smallint,
    kind text,
    value text,
    PRIMARY KEY ((keyspace_name, table_name, partition_key), (source, partition_region, clustering_key, position_weight));

The only "unnatural" thing when querying this table is that partitions in the output of this table, will not be ordered according to the token order of the underlying table. So selecting a token-range from this table will not work. I don't know how to work around that, maybe with a new syntax proposed above it is possible. But again, I don't think we will want to select partition ranges from this table, but I may be wrong.

streamed_mutation::forwarding::no,
mutation_reader::forwarding::no,
[&readers] (size_t memtable_count) { readers.reserve(memtable_count); });
return make_combined_reader(std::move(schema), std::move(permit), std::move(readers), streamed_mutation::forwarding::no, mutation_reader::forwarding::no);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more accurate to report each memtable separately, but no big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but thought it would maybe interact weirdly with paging (memtable could disappear between pages, unless we pin it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the capturing the memtable in the reader (which is how it works) be enough? The memtable will still be flushable, in which case the reader will read from the resulting sstable.

It should just work, but we don't have to do it now.

@@ -36,6 +36,7 @@
#include "gms/feature_service.hh"
#include "system_keyspace_view_types.hh"
#include "schema/schema_builder.hh"
#include "schema/schema_registry.hh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected (after reading the more detailed description) that we'd see the source as an individual sstable, not "sstable" as a class. I guess it can be added as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also just work. An sstable reader captures the sstable and prevents it from disappearing.

})});
rs->data_sources.push_back(data_source{"row_cache", mutation_source([] (schema_ptr schema, reader_permit permit, const dht::partition_range& pr, const query::partition_slice& ps) {
//TODO
return make_empty_flat_reader_v2(std::move(schema), std::move(permit));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to be a special row_cache reader that doesn't fall back to sstables on misses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, turns out I can reuse the memtable's existing reader for this (done in v2).


static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "data_source");
return schema_builder(system_keyspace::NAME, "data_source", std::make_optional(id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still called data_source, I thought you renamed it.

Please add the word "debug" to the name so we'd be at liberty to play with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, still unpublished. Anyway, series looks good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did rename it in v2, but it is not published yet.

@denesb
Copy link
Contributor Author

denesb commented Jun 20, 2023

The virtual table interface breaks down badly at the the clustering ordering, because list<string> (my abstraction for all possible clustering key types) is ordered differently than tuple<$ck_type1, $ck_type2, ...> (the real underlying type).

I will have to look into the alternatives provided by @avikivity at #14083 (comment).

@denesb
Copy link
Contributor Author

denesb commented Jun 20, 2023

After some digging, I think the way to go is a new kind of statement, similar to our existing SELECT JSON FROM statement: SELECT MUTATION_FRAGMENTS FROM $column_family WHERE ...; We could add host and mutation_source pseudo columns, that can also be used to restrict the input. I will think about this some more, but any early feed-back is welcome.

@tgrabiec
Copy link
Contributor

You could put the keys as regular columns, and order by sequence number in the fragment stream.

@denesb
Copy link
Contributor Author

denesb commented Jun 21, 2023

You could put the keys as regular columns, and order by sequence number in the fragment stream.

That would solve the ordering, but it would make it really awkward to select a single row or a range of rows. One would have to first scan the entire partition, to obtain the row number. Also, the row number would change every time the rows would change, so the mapping between this number and the clustering key would not be stable. Overall, I think this would lead to a poor UX.

@avikivity
Copy link
Member

avikivity commented Jun 21, 2023

The virtual table interface breaks down badly at the the clustering ordering, because list<string> (my abstraction for all possible clustering key types) is ordered differently than tuple<$ck_type1, $ck_type2, ...> (the real underlying type).

I will have to look into the alternatives provided by @avikivity at #14083 (comment).

I think the easiest (but not prettiest) way is scylladb_mutation_fragment(*), returning a list<text>.

The reason it's the easiest is that we preserve one primary key == one row (but: how to present the partition tombstone and static row?)

Maybe the cleanest is a new statement. The new statement can return different metadata. Let's say the primary key is ((pk1, pk2), ck1, ck2). We'd return metadata of ((pk1, pk2), ck1, ck2, type, source) where type = memtable | sstable | cache and source names the particular memtable or sstable. The partition and static row would have ck1 = ck2 = NULL.

@denesb
Copy link
Contributor Author

denesb commented Jun 21, 2023

Maybe the cleanest is a new statement. The new statement can return different metadata. Let's say the primary key is ((pk1, pk2), ck1, ck2). We'd return metadata of ((pk1, pk2), ck1, ck2, type, source) where type = memtable | sstable | cache and source names the particular memtable or sstable. The partition and static row would have ck1 = ck2 = NULL.

I agree and I already started working on this. I will publish an RFC as soon as I get a simplest example working. Thankfully, I can salvage most of the code I wrote for the virtual table, what I need to figure out is how to wire in properly the new statement. I think I'm getting there and hopefully will be able to publish an early RFC today or tomorrow.

@denesb
Copy link
Contributor Author

denesb commented Jun 21, 2023

Closing this, as the virtual table approach is a dead-end. I opened a new PR for the new statement approach: #14347.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to trace mutation form from each data source during query
7 participants