Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement system administration functions that get the size of table, index, and MV #7766

Closed
3 of 4 tasks
Tracked by #2954 ...
lmatz opened this issue Feb 8, 2023 · 35 comments
Closed
3 of 4 tasks
Tracked by #2954 ...

Comments

@lmatz
Copy link
Contributor

lmatz commented Feb 8, 2023

  • implement pg_table_size
  • implement pg_relation_size
  • implement pg_indexes_size
  • implement pg_size_pretty

didn't find a function that gets the size of a single index though?

  1. It is requested by users to check the size of each table, index and MV(all of three are implemented as MV). Right now, although we have a Grafana dashboard that shows the size of MVs, it does not show the name and type of the object(can be either one of three types),
  2. sometimes the user may not have access to the Grafana dashboard.
  3. Also, since PG supports them, it is better for RW to support them. Some DB management tools may also require them.

It does not conflict with the metrics currently collected by Prometheus and shown on Grafana. I think this is primarily driven by Devs' demands. And it shows a series of data points along time instead of a snapshot. It can be in whatever form that is helpful for development and debugging, decided by Devs.

With these functions completed, Cloud will be able to integrate them into the Cloud catalog.

@github-actions github-actions bot added this to the release-0.1.17 milestone Feb 8, 2023
@lmatz lmatz added type/feature priority/high help wanted Issues that need help from contributors labels Feb 8, 2023
@neverchanje
Copy link
Contributor

neverchanje commented Feb 9, 2023

sometimes the user may not have access to the Grafana dashboard.

Just curious. Why would it happen? When will a user think it's a must to query metrics through SQL?

@lmatz
Copy link
Contributor Author

lmatz commented Feb 9, 2023

sometimes the user may not have access to the Grafana dashboard.

Just curious. Why would it happen? When will a user think it's a must to query metrics through SQL?

Maybe only have a terminal.
And it seems Cloud's Grafana is not fully ready.
And I am not sure if it will be opened up to average users?

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 20, 2023

We already maintain key-value total size on a per table basis and persist it in meta store (#6390). We can expose it via SQL. Is this an urgent issue? It seems like a candidate for good-first-issue.

@lmatz
Copy link
Contributor Author

lmatz commented Feb 20, 2023

Intend to let the Cloud user be able to discover sizes by themselves. Last time, @zwang28 manually checked the Cloud Grafana dashboard(not accessible by users) and risedev ctl(also not accessible by users) the meta to get the id-name mapping.
Therefore, the Cloud user right now has no access to the information of table name -> table size.

As Cloud is supporting showing information contained in the catalog right now, once this feature is implemented, Cloud can directly show the information on the web console instead of Grafana.

Not urgent in the sense that it blocks anything

@lmatz lmatz added the good first issue Good for newcomers label Feb 20, 2023
@erichgess
Copy link
Contributor

May I hop on this issue?

@lmatz
Copy link
Contributor Author

lmatz commented Mar 7, 2023

May I hop on this issue?

Sure @erichgess!

Feel free to raise question if there is any, thanks a lot

@erichgess
Copy link
Contributor

@lmatz thanks! May be a few days delayed on working on this; as my laptop died this morning.

@erichgess
Copy link
Contributor

While I'm waiting on getting my laptop fixed, I wanted to confirm that these are the function specs:

  1. https://pgpedia.info/p/pg_table_size.html
  2. https://pgpedia.info/p/pg_relation_size.html
  3. https://pgpedia.info/p/pg_indexes_size.html

Can the functionality used by the Prometheus metrics (mentioned in the earlier posts) be used to implement these functions?

@erichgess
Copy link
Contributor

@lmatz Will not be able to get to this issue until the weekend, I hope that's not a problem.

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 10, 2023

While I'm waiting on getting my laptop fixed, I wanted to confirm that these are the function specs:

  1. https://pgpedia.info/p/pg_table_size.html
  2. https://pgpedia.info/p/pg_relation_size.html
  3. https://pgpedia.info/p/pg_indexes_size.html

Can the functionality used by the Prometheus metrics (mentioned in the earlier posts) be used to implement these functions?

We already store the physical key count of a table here so I think we can leverage them instead of relying on Prometheus (in fact the Prometheus metrics also come from the version stat we maintained).

@lmatz Will not be able to get to this issue until the weekend, I hope that's not a problem.

No problem. Take your time. Let us know if you have any question.

@erichgess
Copy link
Contributor

Can the functionality used by the Prometheus metrics (mentioned in the earlier posts) be used to implement these functions?

We already store the physical key count of a table here so I think we can leverage them instead of relying on Prometheus (in fact the Prometheus metrics also come from the version stat we maintained).

Cool, this makes sense to me. I figured the Prom metrics used some existing feature to derive their metric data and that these functions just needed to use those same "primitives", glad to know that's the case.

@fuyufjh
Copy link
Contributor

fuyufjh commented Mar 22, 2023

Hi @erichgess Any updates?

@fuyufjh fuyufjh modified the milestones: release-0.18, release-0.19 Mar 22, 2023
@erichgess
Copy link
Contributor

erichgess commented Mar 23, 2023 via email

@erichgess
Copy link
Contributor

@hzxa21 I have two questions. I've been grepping through the code and have not been able to find good examples of the following two operations:

  1. What's the best way to get access to the HummockVersionStats so that I can lookup the stats for a TableId? I'm assuming that the table_stats HashMap is mapping TableIds to stats for that table.
  2. Is there an example where the query compiler takes a table name and looks up the TableId? I'm assuming that I'll need to do this to get the u32 that is needed for HummockVersionStats::table_stats.

@fuyufjh
Copy link
Contributor

fuyufjh commented Mar 27, 2023

@hzxa21 I have two questions. I've been grepping through the code and have not been able to find good examples of the following two operations:

  1. What's the best way to get access to the HummockVersionStats so that I can lookup the stats for a TableId? I'm assuming that the table_stats HashMap is mapping TableIds to stats for that table.
  2. Is there an example where the query compiler takes a table name and looks up the TableId? I'm assuming that I'll need to do this to get the u32 that is needed for HummockVersionStats::table_stats.

These functions run in frontend, while the requested data HummockVersionStats is in Meta node, which makes this feature harder than the first glance.

The most related thing is HummockSnapshotManager. You might need to enhance it to support getting these table_stats from Meta node, as it did for epoch. Alternatively, you can send a RPC to meta to retrieve these info when every time users called these funtions.

I am not an expert on this part. Please correct me if wrong. @hzxa21

@erichgess
Copy link
Contributor

My initial design for this was to add a new function to exp, with this signature:

#[function("pg_table_size(int32) -> int32")]  // TODO: is this the right size int for this operation?
pub fn pg_table_size(table: i32) -> i32

This function would take the TableId (looked up in the frontend) then query HummockVersionStats (I assumed through HummockSnapshotManager) at evaluation time of the function and return an the size (using i32 as an initial placeholder while I find out what the correct type to use would be).

But I wasn't able to find where TableIds are looked up while grepping through the frontend code.

This was based upon my initial, though wrong, understanding that HummockVersionStats was local to each node.

I assume that the correct design is to query a Meta node and get the stats for a table and so I see two possible designs:

  1. The frontend queries the Meta node for the TableId and the the table size and then replaces pg_table_size (and the other functions) with an integer literal.
  2. The frontend queries the Meta node for the TableId and creates a pg_table_size node and then the implementation of pg_table_size queries the Meta node for the size.

Assuming these size functions are not executed often, my instinct is to query the Meta node because it's the simplest design and being slower than other designs is mitigated by the functions being rarely executed.

Design 1 seems more correct to me because I would want to minimize calling Meta nodes from within the expression evaluators (unless that's perfectly acceptable, in which case I would do design 2). But it would mean that there could be considerable lag between when the frontend looks up the size of an object (table, index, MV) and the rest of the expression being evaluated.

As a follow up question: is HummockSnapshotManager local to each node? And does it have global state for a table or would it only have the metadata for the partition of a table that's on that node? If it's only local metadata then I assume we'd have to execute these functions across all the nodes and then sum the results together?

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 27, 2023

The most related thing is HummockSnapshotManager. You might need to enhance it to support getting these table_stats from Meta node, as it did for epoch. Alternatively, you can send a RPC to meta to retrieve these info when every time users called these funtions.

+1. I think we can enhance the hummock snapshot notification to include HummockVersionStats updates so that each frontend node has its own cache of the stats info. With this push based approach, we can avoid the complexity of query meta node on each pg_table_size call and don't need to be concerned about how to optimize the extra RPC. Also, this is consistent with how we query system catalog in the current main.

Some more background here: currently we leverage a push-based notification to propagate meta information from meta service to frontend and compute node. Each frontend node and compute node has its own observer manager to listen changes from meta service. Meta service will push metadata updates to these observer managers via a streaming RPC. For example, catalog and hummock snapshot are pushed to all frontend nodes via this mechnism.

@erichgess
Copy link
Contributor

This makes sense to me.

So, with this design we would then have the frontend resolve calls to functions like pg_table_size by looking up the table size in the cached Hummock Stats and then emit an integer literal rather than a "function call" when compiling the query plan?

Are there issues with consistency across frontend nodes if caches for some nodes have missed notifications? The Meta Service doc mentions that the master node must wait for other nodes to Acknowledge before continuing, is this the master node within the set of Meta nodes? Does this mean that transactions will wait until all nodes have updated their local metadata?

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 28, 2023

frontend resolve calls to functions like pg_table_size by looking up the table size in the cached Hummock Stats and then emit an integer literal rather than a "function call" when compiling the query plan

Though this approach works, I suggest we can do a step further to maintain our own system table to show all information from hummock stats instead of just the size via pg_table_size.

To be more specific, I suggest we implement a rw_table_stats system table in rw_catalog (example here) and let pg_table_size query rw_table_stats (example here). In this case, we can show more stats of a table (e.g. total_key_size, total_value_size, total_key_count) instead of just total size when query from rw_table_stats.

@broccoliSpicy
Copy link
Contributor

broccoliSpicy commented Mar 28, 2023

frontend resolve calls to functions like pg_table_size by looking up the table size in the cached Hummock Stats and then emit an integer literal rather than a "function call"

actually there might be another issue with this approach,
consider queries like these:

postgres=# create table t1(a int);
CREATE TABLE
postgres=# create table t2(a real);
CREATE TABLE
postgres=# create table t3(a varchar);
CREATE TABLE
postgres=# insert into t3 values('t1');
INSERT 0 1
postgres=# insert into t3 values('t2');
INSERT 0 1
postgres=# select pg_table_size(a) from t3;
 pg_table_size 
---------------
             0
             0
(2 rows)
postgres=# insert into t1 values (7);
INSERT 0 1
postgres=# insert into t2 values (6.28);
INSERT 0 1
postgres=# select pg_table_size(a) from t3;
 pg_table_size 
---------------
          8192
          8192
(2 rows)

for use cases like these, we might need to go through the function call framework and do some RPC on the compute node side.

@erichgess
Copy link
Contributor

erichgess commented Mar 28, 2023

frontend resolve calls to functions like pg_table_size by looking up the table size in the cached Hummock Stats and then emit an integer literal rather than a "function call" when compiling the query plan

Though this approach works, I suggest we can do a step further to maintain our own system table to show all information from hummock stats instead of just the size via pg_table_size.

To be more specific, I suggest we implement a rw_table_stats system table in rw_catalog (example here) and let pg_table_size query rw_table_stats (example here). In this case, we can show more stats of a table (e.g. total_key_size, total_value_size, total_key_count) instead of just total size when query from rw_table_stats.

This design feels the most correct to me: all queries about any system or user state data should ultimately be rooted in a table. My prior proposal would mean that some state queries are going to random internal objects while every other query about state is going to user table or a system table. This also decouples the implementation of how object stats are tracked and from how it's queried.

If we implemented a rw_table_stats system table, how would that be updated? Would the Meta node receive updates to Hummock Stats and then write changes to rw_table_stats?

For table creation, I assume there's an existing system for creating other RW system tables that can be leveraged to ensure that the rw_table_stats table is always created when a new database is created. It looks like Postgres has a template that it copies when new databases are created.

How do we handle adding rw_table_stats to existing databases? What happens if this change is applied to a system that does not have rw_table_stats tables? We would then have a situation where someone could run pg_table_size (or Index/MV), we attempt to query rw_table_stats, but it doesn't exist. Is there an existing protocol for adding new systems tables so this situation can be handled well?

After reading through the code more carefully, I think my above questions are not relevant. From what I understand, rw_table_stats would not be a real table but, essentially, a virtual table that, when queried, routes to a function that returns a set of rows that are dynamically generated. So, in this case we would have that function get stats from Hummock Stats data that has been pushed out from the Meta node then convert it into the schema of rw_table_stats and return that as a Vec<OwnedRow> similar to here.

Note: just for the sake of clarity: whenever I refer to pg_table_size I'm referring to all of the functions that we want to add which provide stats about different database objects: tables, indexes, etc.

@erichgess
Copy link
Contributor

@hzxa21 so we would update HummockSnapshot PB message to include HummockVersionStats data:

Something like this:

// We will have two epoch after decouple
message HummockSnapshot {
  // Epoch with checkpoint, we will read durable data with it.
  uint64 committed_epoch = 1;
  // Epoch without checkpoint, we will read real-time data with it. But it may be rolled back.
  uint64 current_epoch = 2;

  HummockVersionStats stats = 3;  // Add stats to the Snapshot message
}

This is using the existing PB HummockVersionStats type.

Or would it be better to add a new event type to SubscribeResponse?

Then for for frontend nodes I update handle_hummock_snapshot_notification to update both the local epoch and a new table stats value.

The Compute node's observer does not logic to handle HummockSnapshot events: so, I would need to add that logic and add a local object to store the most recent Hummock Stats.

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 29, 2023

@hzxa21 so we would update HummockSnapshot PB message to include HummockVersionStats data:

Something like this:

// We will have two epoch after decouple
message HummockSnapshot {
  // Epoch with checkpoint, we will read durable data with it.
  uint64 committed_epoch = 1;
  // Epoch without checkpoint, we will read real-time data with it. But it may be rolled back.
  uint64 current_epoch = 2;

  HummockVersionStats stats = 3;  // Add stats to the Snapshot message
}

This is using the existing PB HummockVersionStats type.

Or would it be better to add a new event type to SubscribeResponse?

Then for for frontend nodes I update handle_hummock_snapshot_notification to update both the local epoch and a new table stats value.

The Compute node's observer does not logic to handle HummockSnapshot events: so, I would need to add that logic and add a local object to store the most recent Hummock Stats.

IMO, adding a new event type seems clearer since only a subset of observers are intersted in this event.

@erichgess
Copy link
Contributor

erichgess commented Apr 1, 2023

@hzxa21

I'm working through the implementation of this feature and there are a couple questions I have.

For storing the stats data on Frontend nodes, I'm considering which struct should be responsible for owning the HummockVersionStats data that's sent from the Meta node. There seem to be two options, either store a shared reference to the table stats in FrontendEnv and SystemCatalogReaderImpl or to store a the table stats in the Catalog type (which is shared between FrontendEnv and SystemCatalogReaderImp). Since this is information used to construct a system table it makes sense to me to have it owned by Catalog but I'm still a new developer so I don't know what the preferences are.

For the Compute nodes, I've been reading through the code under ./src/compute but I don't see an obvious place in there for where the stats should be stored. And I don't see how the stats would be injected into a query in the Compute node code either? The only thing that occurs to me is to create a new "Source" Executor type in batch and streaming which, when created, takes a reference to the local HummockVersionStats (or its wrapper) and then emits that data as the rows from a table.

@fuyufjh
Copy link
Contributor

fuyufjh commented Apr 3, 2023

For storing the stats data on Frontend nodes, I'm considering which struct should be responsible for owning the HummockVersionStats data that's sent from the Meta node. There seem to be two options, either store a shared reference to the table stats in FrontendEnv and SystemCatalogReaderImpl or to store a the table stats in the Catalog type (which is shared between FrontendEnv and SystemCatalogReaderImp). Since this is information used to construct a system table it makes sense to me to have it owned by Catalog but I'm still a new developer so I don't know what the preferences are.

Agree with you. Either place looks to me and I slightly prefer Catalog as well.

For the Compute nodes, I've been reading through the code under ./src/compute but I don't see an obvious place in there for where the stats should be stored. And I don't see how the stats would be injected into a query in the Compute node code either? The only thing that occurs to me is to create a new "Source" Executor type in batch and streaming which, when created, takes a reference to the local HummockVersionStats (or its wrapper) and then emits that data as the rows from a table.

Sounds like you were going to read these stats on Compute Nodes? Currently, we only read these metadata-related things in Frontend Node. For example,

  1. If users query from any tables in pg_catalog, the query will be run with local execution mode, with a special executor SysCatalogReaderImpl to read these meta info and return as rows. See also src/frontend/src/catalog/system_catalog/pg_catalog/mod.rs.
  2. If users call metadata-related functions like pg_table_is_visible(), the function will be replaced to its result in Binder stage. See also src/frontend/src/binder/expr/function.rs

Assuming you follow the 2, then you might need to add a reference to HummockVersionStats into Binder's Context. Didn't go through the details so feel free to correct me.

@erichgess
Copy link
Contributor

Sounds like you were going to read these stats on Compute Nodes? Currently, we only read these metadata-related things in Frontend Node. For example,

1. If users query from any tables in `pg_catalog`, the query will be run with _local execution mode_, with a special executor `SysCatalogReaderImpl` to read these meta info and return as rows. See also `src/frontend/src/catalog/system_catalog/pg_catalog/mod.rs`.

2. If users call metadata-related functions like `pg_table_is_visible()`, the function will be replaced to its result in Binder stage. See also [src/frontend/src/binder/expr/function.rs](https://github.com/risingwavelabs/risingwave/blob/ac104a751161586a1e1b8dcd02e0ca5fa60e96b4/src/frontend/src/binder/expr/function.rs#504)

Assuming you follow the 2, then you might need to add a reference to HummockVersionStats into Binder's Context. Didn't go through the details so feel free to correct me.

@hzxa21 mentioned that we'd want to push the data to the compute nodes. I'd assumed it was because some queries would go to the compute nodes. If the queries to a system table would never go to a compute node then I don't know why we'd need to push stats data to them, but I would love to learn.

Executing only on the Frontend would simplify this task.

@xxchan
Copy link
Member

xxchan commented Apr 4, 2023

Summarize the discussion on slack: https://risingwave-community.slack.com/archives/C03BW6YSBPB/p1680634594673689

Supporting general query via system table would be hard. What's really hard is the resolving identifier part (i.e., varchar::regclass cast).

So we'd better KISS first: Support only varchar literals and evaluate entirely on the Frontend.

@h3n4l
Copy link

h3n4l commented Apr 5, 2023

sometimes the user may not have access to the Grafana dashboard.

Just curious. Why would it happen? When will a user think it's a must to query metrics through SQL?

For some database tools like Bytebase, gather some statistics by executing SELECT FROM pg_table_size. In fact, it's annoying if the database that claims to be PostgreSQL-compatible doesn't support this, and it takes a lot of time to find out how to implement it on that database. : \

@xxchan
Copy link
Member

xxchan commented Apr 5, 2023

@h3n4l Thanks for mentioning this. I think it's somewhat unavoidable since supporting such admin features (in a PostgreSQL-compatible way) are quite difficult 🥲. AFAIK cockroachdb hasn't support this for years neither.

BTW, would you mind sharing an example about how Bytebase uses such queries to implement what kind of features?

@xxchan xxchan removed good first issue Good for newcomers help wanted Issues that need help from contributors labels Apr 5, 2023
@erichgess
Copy link
Contributor

erichgess commented Apr 5, 2023

@h3n4l would you happen to know how Bytebase queries that data?

I'm curious because my proposal was to break this ticket into two parts:

  1. We support explicit queries like select pg_table_size('foo'); to get the size of a table. This design would only support using a varchar literal as the argument for pg_table_size. (I've already got support for this implemented and will make a PR for feedback on it shortly). (Edit: PG also supports passing in the Object ID of the table itself to get the size, so I'll also make sure that we support that. E.g. select pg_table_size(1001) would return the size of a table with object ID 1001).
  2. Expand pg_table_size so that it can take a reference as an argument not just varchar literals. This would be to support a query like SELECT pg_table_size(table_name) FROM important_tables;.

Depending on how Bytebase works part 1 may be sufficient for it to be fully functional.

@h3n4l
Copy link

h3n4l commented Apr 5, 2023

@h3n4l Thanks for mentioning this. I think it's somewhat unavoidable since supporting such admin features (in a PostgreSQL-compatible way) are quite difficult 🥲. AFAIK cockroachdb hasn't support this for years neither.

BTW, would you mind sharing an example about how Bytebase uses such queries to implement what kind of features?

Hi @xxchan , I strongly agree with I think it's somewhat unavoidable since supporting such admin features (in a PostgreSQL-compatible way) are quite difficult 🥲. . I think it's just the result of seeing things in a different role, I develop tools so I want it to be as compatible as possible. But if I build a database likes Risingwave, these management features experience are often not the highest priority when resources are limited. : )

BTW, would you mind sharing an example about how Bytebase uses such queries to implement what kind of features?

For now, Bytebase just displays this simple information, like the following:
CleanShot 2023-04-06 at 00 23 12@2x

https://demo.bytebase.com/db/shop-7015#overview

Missing this information doesn't matter, but you know, people always want to align the interfaces(aka what other databases can have, they better have too).

Anyway, Risingwave is a great product, there's no doubt about it.

@h3n4l
Copy link

h3n4l commented Apr 5, 2023

Hi @erichgess, If you want to know the technical details, take a look at this part of the code built by Bytebase for PostgreSQL. Part I seems to work well, but I'm not sure if there is additional work required to go from regclass to oid.
https://sourcegraph.com/github.com/bytebase/bytebase/-/blob/backend/plugin/db/pg/sync.go?L316

@erichgess
Copy link
Contributor

Hi @erichgess, If you want to know the technical details, take a look at this part of the code built by Bytebase for PostgreSQL. Part I seems to work well, but I'm not sure if there is additional work required to go from regclass to oid. https://sourcegraph.com/github.com/bytebase/bytebase/-/blob/backend/plugin/db/pg/sync.go?L316

Yes, it looks like this would require the full implementation from step 2 rather than the simpler implementation from of step 1.

@erichgess
Copy link
Contributor

Added a draft PR here: #9013

@erichgess
Copy link
Contributor

erichgess commented Apr 7, 2023

@hzxa21 @lmatz pg_relation_size looks to provide data for internal PG data structures that are not provided in HummockVersionStats (and may simply not exist within RW) (see: here and here).

We could have this be an alias for pg_table_size but given that its design specifically returns data that may not even make sense for RW should it even be implemented?

Also, do we want to implement pg_total_relation_size which is the same of the table size and the size of the indexes on the table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants