Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(frontend): Size of Object Functions (pg_table_size, pg_relation_size, pg_indexes_size) [Draft] #9013

Merged
merged 70 commits into from
Jul 3, 2023

Conversation

erichgess
Copy link
Contributor

@erichgess erichgess commented Apr 6, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Draft PR

This Draft PR only implements pg_table_size the other two functions will be easily implemented using the same code, but I wanted to get feedback on the current implementation while completing the other 2 functions.

This PR implements the functions pg_table_size, pg_relation_size, and pg_indexes_size for literal value arguments. This PR does not implement support for using references (such as a column name) as the argument for these functions.

In this PR, the above functions are computed entirely on the Frontend nodes by using the local Catalog to convert the function argument to a TableID. That TableID is then used to look up the stats for the table, index, or relation within a local copy of the HummockVersionStats which are collected by the Meta nodes. To compute value of the size of an object, the total_key_size and total_value_size values for an object are added together and returned by the function.

The functions pg_table_size, pg_relation_size, and pg_indexes_size are implemented within the Frontend node as part of the Binder type. These functions can only take a literal integer value or a literal varchar. If an integer is given as the argument it is treated as an Object ID and the function simply attempts to find an entry in HummockVersionStats with the same ID. If a varchar is given then the funtion uses the ObjectName parser to convert the value of the varchar into an ObjectName. The ObjectName is then used to look up the TableId so that the associated stats can be found. By using the Parser any valid format for an object name in PG SQL can be used as the value of the varchar literal (e.g. '"my table"', 'public.foo', and 'public."my table"' are all valid arguments). If the object is found in HummockVersionStats then the total size of the object (keys + values) is returned.

A "virtual" table rw_table_stats is implemented, which acts as an interface between the query execution engine and the HummockVersionStats data pushed from the Meta node and that contains the table stats data. Calls to pg_*_size functions get converted to queries on this table and are executed by the query engine in local mode.

In order for the Frontend nodes to have a local copy of the HummockVersionStats and new Meta node notification event was added called HummockStats which the Meta nodes use to send table stat updates to Frontend and Compute nodes. These events are generated whenever tables are compacted or an epoch is committed.

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • SQL commands, functions, and operators

Release note

This PR adds two Postgres functions: pg_table_size and pg_indexes_size. It also expands the domain of values that the ::regclass operation can be applied to: it will work with object names that include their parent schema (e.g., 'public.test'::regclass) it will also work with integers (when applied to an integer it will simply resolve to the integer value itself, mirroring Postgres).

  • pg_table_size: this function will return the amount of space, in total, taken up by the specified table. If given the name of an index it will return the total size of that index.
  • pg_indexes_size: this will return the total space taken up by all indexes on a given table. If given the name of an index, it will return 0. If you want the size of a specific index you can use pg_table_size.

src/frontend/src/binder/select.rs Outdated Show resolved Hide resolved
src/meta/src/hummock/manager/mod.rs Outdated Show resolved Hide resolved
Comment on lines 522 to 526
// We use the full parser here because this function needs to accept every legal way
// of identifying a table in PG SQL as a valid value for the varchar
// literal. For example: 'foo', 'public.foo', '"my table"', and
// '"my schema".foo' must all work as values passed pg_table_size.
let mut tokenizer = Tokenizer::new(name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a littble bit weird to me to use parser in the binder. cc @xiangjinwu Is this a good practice? Any suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is wired, but also somewhat reasonable because we are doing introspection here. PG even have a SQL function parse_ident but it looks similar to SplitIdentifierString. My suggestion would be to extract this to a utility function outside binder, and still use parser inside to avoid the burden of maintaining duplicate implementation. This does sound like cheating and makes no difference at run time. 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about writing something custom for this, but we need to support any valid representation of a table which would mean building a parallel parser for just table names that are passed as varchar literals and, critically, having to always remember that any change to the SQL parser would have to be reflected in the parallel parser. I couldn't come up with a good reason that would be worth that maintenance load.

Putting into a helper function makes sense to me: what would be the best place to put it?

src/frontend/src/binder/select.rs Outdated Show resolved Hide resolved
@erichgess erichgess changed the title 7766 Size of Objects (pg_table_size, pg_relation_size, pg_indexes_size) [Draft] feat(frontend): Size of Object Functions (pg_table_size, pg_relation_size, pg_indexes_size) [Draft] Apr 6, 2023
let (schema_name, table_name) =
Self::resolve_schema_qualified_name(&self.db_name, object_name)?;

self.bind_table(schema_name.as_deref(), &table_name, None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I use bind the table whose size is being looked up? I did this because using bind_table was the existing on Binder that would provide the TableId needed to query rw_table_stats. But binding the table from the pg_table_size argument strikes me as less than safe and, instead, I should use a function that looks up the Table information but does not bind the table to the query compiler context.

@erichgess
Copy link
Contributor Author

@yezizp2012 @hzxa21 is there a way to have the e2e test script ci/scripts/e2e-test-parallel-in-memory.sh skip a specific test? The tests for table and index size fail when run "in memory" because, since it's in-memory, there's no space taken up by the table and all calls to pg_table_size and pg_indexes_size return 0.

This is assuming that e2e_test/batch/catalog is the best place to put tests for pg_table_size. There doesn't appear to be a more appropriate place.

@lmatz
Copy link
Contributor

lmatz commented May 16, 2023

Hi @erichgess, sorry for the late response.

After we implement risinglightdb/sqllogictest-rs#177, then we can skip the test and merge the PR soon.

@lmatz
Copy link
Contributor

lmatz commented Jun 9, 2023

risinglightdb/sqllogictest-rs#179 is merged,
once there is a new release of sqllogictest-rs, we can modify the test cases accordingly

@TennyZhuang TennyZhuang self-requested a review June 28, 2023 01:38
@TennyZhuang
Copy link
Collaborator

TennyZhuang commented Jun 28, 2023

It seems that a new version of sqllogictest has been released, but this PR has been forgotten.

@erichgess Do you have time to resolve the conflicts? Or I can help you.
@wangrunji0408 @lmatz Can you help push for this PR to be merged? e.g. sqllogictest related issues.

@lmatz
Copy link
Contributor

lmatz commented Jun 28, 2023

It seems that a new version of sqllogictest has been released, but this PR has been forgotten.

@erichgess Do you have time to resolve the conflicts? Or I can help you. @wangrunji0408 @lmatz Can you help push for this PR to be merged? e.g. sqllogictest related issues.

I can, but I don't know how to push changes into erichgess:7766-size-of-db-objects, is it even possible?

@TennyZhuang
Copy link
Collaborator

I can, but I don't know how to push changes into erichgess:7766-size-of-db-objects, is it even possible?

It's allowed by default. https://stackoverflow.com/questions/63341296/github-pull-request-allow-edits-by-maintainers

@lmatz
Copy link
Contributor

lmatz commented Jun 30, 2023

dev=> create table t (v1 int);
CREATE_TABLE
dev=> insert into t values (3);
INSERT 0 1
dev=> SELECT pg_table_size('t');
 pg_table_size 
---------------
            51
(1 row)

dev=> SELECT pg_indexes_size('t');
 pg_indexes_size 
-----------------
               0
(1 row)

dev=> create index t_idx on t (v1);
CREATE_INDEX
dev=> flush;
FLUSH
dev=> select pg_indexes_size('t');
 pg_indexes_size 
-----------------
              43
(1 row)

Conflicts solved

Copy link
Contributor

@lmatz lmatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM(pass the tests)

@hzxa21 @xiangjinwu @xxchan @TennyZhuang

@lmatz lmatz requested a review from yezizp2012 July 3, 2023 07:06
@lmatz lmatz added user-facing-changes Contains changes that are visible to users type/feature labels Jul 3, 2023
Copy link
Contributor

@yezizp2012 yezizp2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lmatz lmatz added this pull request to the merge queue Jul 3, 2023
Merged via the queue into risingwavelabs:main with commit 5de8c44 Jul 3, 2023
40 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants