Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column_family: Make toppartitions queries more generic #7864

Closed

Conversation

StarostaGit
Copy link
Contributor

@StarostaGit StarostaGit commented Jan 4, 2021

Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.

We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.

Fixes #4520

@StarostaGit
Copy link
Contributor Author

StarostaGit commented Jan 4, 2021

This requires a change in the nodetool repo as well to wholly fix the #4520 issue (which is coming and I'll link it here)

@psarna
Copy link
Contributor

psarna commented Jan 7, 2021

From my quick test it looks like this change isn't backward compatible, i.e. it's no longer possible to pass a single column family name to this interface. Is it hard to maintain compatibility by allowing both a single name and a list? This change would become way more user friendly that way. From my quick local tests it looks like parameters for this API are not validated anyway, so it should be easy enough to keep accepting both name, which translates to a single column family, and name_list, which accepts a list.

@StarostaGit
Copy link
Contributor Author

The problem with backwards compatibility is that previously name was included in the path and that prevented from passing an empty list (which right now means 'all families'). If I just add the name as another query parameter, next to the name_list it's still not gonna be compatible as far as I understand, because the parameter is now named. And if I leave the name as part of the path then something will always have to be put in there, which is gonna be counterintuitive.
Is passing a single name in form of a list rather than a singular field that much worse?

I can rework this, but it's always gonna be awkward if we don't want the name to be a list - because something will always have to be put in as that name

@psarna
Copy link
Contributor

psarna commented Jan 7, 2021

So, can't you just leave the old API (/column_family/toppartitions/{name}) intact, and add the new way, which operates on /column_family/toppartitions and takes the name_list as an argument? That's even better, since all old scripts would work just fine, without any adjustments needed.

@StarostaGit
Copy link
Contributor Author

You're absolutely right, for some reason I forgot that they can coexist

@StarostaGit
Copy link
Contributor Author

Addressed the feedback and also added an information from which column family the listed partition is, as with many CFs we can have partitions with exact same names

@psarna
Copy link
Contributor

psarna commented Jan 7, 2021

Looks good to me, but @amnonh should probably take a look as the API owner.

@avikivity
Copy link
Member

Nothing as a shorthand for everything is bad practice. Nothing should stand for nothing, not everything.

@@ -663,8 +711,8 @@
"parameters":[
{
"name":"name",
"description":"The column family name in keyspace:name format",
"required":true,
"description":"The column family name in keyspace:name format. Omitting the name (i.e. 'keyspace:') will run the query on the whole keyspace",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, the path parameter should be required, if the name is omitted, you would fall to your new implementation that does not have a query name parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out, forgot to change it back to required

apilog.info("toppartitions query: #names={} duration={} list_size={} capacity={}",
names.size(), duration.param, list_size.param, capacity.param);

return seastar::do_with(db::toppartitions_query(ctx.db, std::move(names), duration.value, list_size, capacity), [&ctx](auto& q) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we unified the implamentation for the specific and non specific CF?
if not anyting else, you can always call the general with the specific names, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly sure what you mean, can you elaborate a bit more?
Are you referring to the distinction between toppartitions and toppartitions_generic? Or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it seems that there will be two implementation, one for a specific table and one for multiple tables

@amnonh
Copy link
Contributor

amnonh commented Jan 7, 2021

I would have loved to see #7797 maybe you can do add that?

@StarostaGit
Copy link
Contributor Author

StarostaGit commented Jan 11, 2021

Nothing as a shorthand for everything is bad practice. Nothing should stand for nothing, not everything.

It depends how you phrase it - "all unless specified" is not uncommon and is the same thing. I can change the description to reflect that more.

I can also replace that with something like an "ALL" keyword, but that seems to me like overcomplication. Do you have a simpler solution in mind @avikivity?

@StarostaGit
Copy link
Contributor Author

I would have loved to see #7797 maybe you can do add that?

Not in this issue, but I could take a look at that once I'm done with current toppartitions and nodetool stuff I'm working on.

@avikivity
Copy link
Member

Nothing as a shorthand for everything is bad practice. Nothing should stand for nothing, not everything.

It depends how you phrase it - "all unless specified" is not uncommon and is the same thing. I can change the description to reflect that more.

I can also replace that with something like an "ALL" keyword, but that seems to me like overcomplication. Do you have a simpler solution in mind @avikivity?

If it's an optional parameter, then if it's not included it can mean all. But an empty value shouldn't mean all.

However, I have some memory of the same pattern used in other APIs. If that's correct, then we can follow the existing pattern.

@amnonh
Copy link
Contributor

amnonh commented Jan 12, 2021

Nothing as a shorthand for everything is bad practice. Nothing should stand for nothing, not everything.

It depends how you phrase it - "all unless specified" is not uncommon and is the same thing. I can change the description to reflect that more.
I can also replace that with something like an "ALL" keyword, but that seems to me like overcomplication. Do you have a simpler solution in mind @avikivity?

If it's an optional parameter, then if it's not included it can mean all. But an empty value shouldn't mean all.

However, I have some memory of the same pattern used in other APIs. If that's correct, then we can follow the existing pattern.

the operation documentation is just confusing, it's a path parameter, so it's mandatory, if it's not there then it's a different url (i.e different end point).
It's common to have this kind of two endpoints:
/resource
/resource/{specific}
Where the first one is for everything and the second one is for a specific resource.

@avikivity
Copy link
Member

I mean that we already have some violations of "empty should mean nothing" in our REST API. Is this correct?

@amnonh
Copy link
Contributor

amnonh commented Jan 12, 2021

I mean that we already have some violations of "empty should mean nothing" in our REST API. Is this correct?

I'm not sure I follow, having to end points which one is more specific like:
/employee
/employee/{name}

Is a common RESTFull definition, but those are two different endpoints, so the documentation of the more specific one, should not say leave empty for everything, even if it happen to be the case, it's a mandatory parameter, if you leave it empty, it will not be the same endpoint.

@StarostaGit
Copy link
Contributor Author

Ok, I restructured the code a bit. Now, not providing the cf list means to query "all", while providing an empty one is exactly that (empty list, so no families are getting included).
Also, extracted the common code between the two endpoints to a shared lambda, which is what you wanted @amnonh if I understood correctly

@StarostaGit
Copy link
Contributor Author

As per my discussion with @amnonh, I moved the generic toppartitions endpoint to storage_service

@avikivity
Copy link
Member

@amnonh please review again

api/api-doc/column_family.json Outdated Show resolved Hide resolved
api/api-doc/storage_service.json Outdated Show resolved Hide resolved
api/api-doc/storage_service.json Outdated Show resolved Hide resolved
api/api-doc/storage_service.json Outdated Show resolved Hide resolved
@@ -59,11 +59,11 @@ void data_listeners::on_write(const schema_ptr& s, const frozen_mutation& m) {

toppartitions_item_key::operator sstring() const {
std::ostringstream oss;
oss << key.key().with_schema(*schema);
oss << "(" << schema->ks_name() << ":" << schema->cf_name() << ") " << key.key().with_schema(*schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it change current behaviour? it would now add ks/table for users who call nodetool toppartitions for a specific table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the goal, as we can now have partitions from different tables that have the same names - it's good to have a way to differentiate between them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true for someone that uses the new functionality, what about backwards compatibility for users who use the old functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to include backwards compatibility

if (zis) {
zis->_top_k_read.append(toppartitions_item_key{s, dk});

for (const auto& [ks, cf] : _families.value()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't expect it to be too big, because you'll do this loop for any operation and you don't have that quick bale-out in the original code where you compare the ks/table and return imidiately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be big, but I'll change it to a map as you pointed out. It's a good point.

@@ -155,15 +154,14 @@ public:

class toppartitions_query {
distributed<database>& _xdb;
sstring _ks;
sstring _cf;
std::optional<std::vector<std::tuple<sstring, sstring>>> _families;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wny making a vector optional?
BTW if it was unorderd_map you could have do a quick bale-out in the implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's optional to reflect a difference between an empty set of filters and no filters. The former will match against no tables, while the latter will match against all of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yeah, changing to a hashmap is a good point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get it, if you know there are none, why using it in the first place? or is it something that is always there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can omit that parameter or they can input an empty list - they mean different things and have to be handled accordingly. Once we pass that information down to the toppartitions query object and all the filtering happens, we need to be able to tell which of the described situations happened, because they give different results.
Using std::optional seemed like the most logical way to do it, as we're literally describing an "optional" parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly the two option:

  1. the user passes an empty container, which means they want everything
  2. the user didn't pass anything, so they want nothing - so do nothing, at the topest level, just return imidiatly with an empty result

Even if you mean the logic is the other way around, I don't see why you need to implement the empty case, if a user request something that by definition returns nothing, return imidiatly with nohting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the optionals and check for empty requests before instantiating the query, as asked

Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.

We provide three ways for filtering in the query parameter "name_list":
    1. A specific column family to include in the form "ks:cf"
    2. A keyspace, telling the server to include all column families in it.
       Specified by omitting the cf name, i.e. "ks:"
    3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.

Fixes scylladb#4520
@StarostaGit
Copy link
Contributor Author

Addressed all the above feedback

"paramType":"query"
},
{
"name":"keyspace_filters",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it should be just keyspace (or ks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why? Many keyspaces can be provided, so a singular here seems weird

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you tested it? I don't think we support array of query parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nitpick, but from looking at the code this is a comma seperated string, the API currently does not enforce it, but it's better be correct anyhow, but I'll not fight over it

],
"parameters":[
{
"name":"table_filters",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it should be table or cf, if you are adding keyspace then it's only table on not keyspace:table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be multiples of both, you can specify a couple of tables, which are identified by "ks:cf" and a couple of whole keyspaces. For example, you can request two tables: "ks:t, ks:simple" and one keyspace "system". The result would be toppartitions for the two tables in ks keyspace and all tables in the system keyspace

}

// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can that be? someone either table_filters was true or keyspace_filters was true or filter_pfovided was false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filters_provided means that there were some filters included in the request, ergo the request is not the default one running on all tables
Then, the only situation where we can safely return without running the query is if some filters were included (e.g. the optional query parameters were provided), but both of them were empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still not completely understand, they added at least one of the query parameters (keyspace or table), but they added it without a value?
I'm not even sure if we support it, did you check that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes
For simplicity, let's pretend there is no keyspace_filters parameter, only the table ones. Now, one of the two situations can happen:
1. The user doesn't use the table_filters option -> the option is not in the params
2. The user provides the 'table_filters' option -> the option is in the params and contains a list of table names

In the first case, the query runs on all tables and keyspaces as this is the default
In the second case, the query runs only on the tables provided in the 'table_filters' option. A list can be empty, of course, and that means the query runs on a empty set - therefore, it gives empty results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the code, my question was about the query parameter, it does not hurt to add the check like you did, I wasn't sure you can pass an empty parameter, so it looked redundant.

db/data_listeners.hh Show resolved Hide resolved
@StarostaGit
Copy link
Contributor Author

@amnonh ping

1 similar comment
@StarostaGit
Copy link
Contributor Author

@amnonh ping

@haaawk
Copy link
Contributor

haaawk commented Mar 22, 2021

@slivne is there anyone else than @amnonh that can review this?

@slivne slivne requested a review from amnonh March 24, 2021 12:44
}

// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the code, my question was about the query parameter, it does not hurt to add the check like you did, I wasn't sure you can pass an empty parameter, so it looked redundant.

"paramType":"query"
},
{
"name":"keyspace_filters",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nitpick, but from looking at the code this is a comma seperated string, the API currently does not enforce it, but it's better be correct anyhow, but I'll not fight over it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request for enhancement nodetool toppartitions - need to be more general
5 participants