column_family: Make toppartitions queries more generic #7864

StarostaGit · 2021-01-04T14:01:01Z

Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.

We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.

Fixes #4520

StarostaGit · 2021-01-04T14:08:44Z

This requires a change in the nodetool repo as well to wholly fix the #4520 issue (which is coming and I'll link it here)

psarna · 2021-01-07T10:32:24Z

From my quick test it looks like this change isn't backward compatible, i.e. it's no longer possible to pass a single column family name to this interface. Is it hard to maintain compatibility by allowing both a single name and a list? This change would become way more user friendly that way. From my quick local tests it looks like parameters for this API are not validated anyway, so it should be easy enough to keep accepting both name, which translates to a single column family, and name_list, which accepts a list.

StarostaGit · 2021-01-07T11:20:38Z

The problem with backwards compatibility is that previously name was included in the path and that prevented from passing an empty list (which right now means 'all families'). If I just add the name as another query parameter, next to the name_list it's still not gonna be compatible as far as I understand, because the parameter is now named. And if I leave the name as part of the path then something will always have to be put in there, which is gonna be counterintuitive.
Is passing a single name in form of a list rather than a singular field that much worse?

I can rework this, but it's always gonna be awkward if we don't want the name to be a list - because something will always have to be put in as that name

psarna · 2021-01-07T11:55:07Z

So, can't you just leave the old API (/column_family/toppartitions/{name}) intact, and add the new way, which operates on /column_family/toppartitions and takes the name_list as an argument? That's even better, since all old scripts would work just fine, without any adjustments needed.

StarostaGit · 2021-01-07T12:03:16Z

You're absolutely right, for some reason I forgot that they can coexist

StarostaGit · 2021-01-07T13:05:36Z

Addressed the feedback and also added an information from which column family the listed partition is, as with many CFs we can have partitions with exact same names

psarna · 2021-01-07T13:07:09Z

Looks good to me, but @amnonh should probably take a look as the API owner.

avikivity · 2021-01-07T13:12:38Z

Nothing as a shorthand for everything is bad practice. Nothing should stand for nothing, not everything.

amnonh · 2021-01-07T13:18:03Z

api/api-doc/column_family.json

@@ -663,8 +711,8 @@
               "parameters":[
                  {
                     "name":"name",
-                     "description":"The column family name in keyspace:name format",
-                     "required":true,
+                     "description":"The column family name in keyspace:name format. Omitting the name (i.e. 'keyspace:') will run the query on the whole keyspace",


This is not correct, the path parameter should be required, if the name is omitted, you would fall to your new implementation that does not have a query name parameter

Thanks for pointing that out, forgot to change it back to required

amnonh · 2021-01-07T13:23:18Z

api/column_family.cc

+        apilog.info("toppartitions query: #names={} duration={} list_size={} capacity={}",
+            names.size(), duration.param, list_size.param, capacity.param);
+
+        return seastar::do_with(db::toppartitions_query(ctx.db, std::move(names), duration.value, list_size, capacity), [&ctx](auto& q) {


Can't we unified the implamentation for the specific and non specific CF?
if not anyting else, you can always call the general with the specific names, right?

not exactly sure what you mean, can you elaborate a bit more?
Are you referring to the distinction between toppartitions and toppartitions_generic? Or something else?

yes, it seems that there will be two implementation, one for a specific table and one for multiple tables

amnonh · 2021-01-07T13:24:43Z

I would have loved to see #7797 maybe you can do add that?

StarostaGit · 2021-01-11T10:45:57Z

Nothing as a shorthand for everything is bad practice. Nothing should stand for nothing, not everything.

It depends how you phrase it - "all unless specified" is not uncommon and is the same thing. I can change the description to reflect that more.

I can also replace that with something like an "ALL" keyword, but that seems to me like overcomplication. Do you have a simpler solution in mind @avikivity?

StarostaGit · 2021-01-11T10:53:47Z

I would have loved to see #7797 maybe you can do add that?

Not in this issue, but I could take a look at that once I'm done with current toppartitions and nodetool stuff I'm working on.

avikivity · 2021-01-12T11:22:23Z

Nothing as a shorthand for everything is bad practice. Nothing should stand for nothing, not everything.

It depends how you phrase it - "all unless specified" is not uncommon and is the same thing. I can change the description to reflect that more.

I can also replace that with something like an "ALL" keyword, but that seems to me like overcomplication. Do you have a simpler solution in mind @avikivity?

If it's an optional parameter, then if it's not included it can mean all. But an empty value shouldn't mean all.

However, I have some memory of the same pattern used in other APIs. If that's correct, then we can follow the existing pattern.

amnonh · 2021-01-12T11:35:19Z

Nothing as a shorthand for everything is bad practice. Nothing should stand for nothing, not everything.

It depends how you phrase it - "all unless specified" is not uncommon and is the same thing. I can change the description to reflect that more.
I can also replace that with something like an "ALL" keyword, but that seems to me like overcomplication. Do you have a simpler solution in mind @avikivity?

If it's an optional parameter, then if it's not included it can mean all. But an empty value shouldn't mean all.

However, I have some memory of the same pattern used in other APIs. If that's correct, then we can follow the existing pattern.

the operation documentation is just confusing, it's a path parameter, so it's mandatory, if it's not there then it's a different url (i.e different end point).
It's common to have this kind of two endpoints:
/resource
/resource/{specific}
Where the first one is for everything and the second one is for a specific resource.

avikivity · 2021-01-12T13:29:01Z

I mean that we already have some violations of "empty should mean nothing" in our REST API. Is this correct?

amnonh · 2021-01-12T14:31:55Z

I mean that we already have some violations of "empty should mean nothing" in our REST API. Is this correct?

I'm not sure I follow, having to end points which one is more specific like:
/employee
/employee/{name}

Is a common RESTFull definition, but those are two different endpoints, so the documentation of the more specific one, should not say leave empty for everything, even if it happen to be the case, it's a mandatory parameter, if you leave it empty, it will not be the same endpoint.

StarostaGit · 2021-01-13T17:56:31Z

Ok, I restructured the code a bit. Now, not providing the cf list means to query "all", while providing an empty one is exactly that (empty list, so no families are getting included).
Also, extracted the common code between the two endpoints to a shared lambda, which is what you wanted @amnonh if I understood correctly

StarostaGit · 2021-01-20T14:55:37Z

As per my discussion with @amnonh, I moved the generic toppartitions endpoint to storage_service

avikivity · 2021-01-21T08:30:06Z

@amnonh please review again

api/api-doc/column_family.json

api/api-doc/storage_service.json

amnonh · 2021-01-21T08:46:07Z

db/data_listeners.cc

@@ -59,11 +59,11 @@ void data_listeners::on_write(const schema_ptr& s, const frozen_mutation& m) {

 toppartitions_item_key::operator sstring() const {
    std::ostringstream oss;
-    oss << key.key().with_schema(*schema);
+    oss << "(" << schema->ks_name() << ":" << schema->cf_name() << ") " << key.key().with_schema(*schema);


wouldn't it change current behaviour? it would now add ks/table for users who call nodetool toppartitions for a specific table

That was the goal, as we can now have partitions from different tables that have the same names - it's good to have a way to differentiate between them.

this is true for someone that uses the new functionality, what about backwards compatibility for users who use the old functionality?

changed it to include backwards compatibility

amnonh · 2021-01-21T08:54:47Z

db/data_listeners.cc

-        if (zis) {
-            zis->_top_k_read.append(toppartitions_item_key{s, dk});
+
+    for (const auto& [ks, cf] : _families.value()) {


I guess we don't expect it to be too big, because you'll do this loop for any operation and you don't have that quick bale-out in the original code where you compare the ks/table and return imidiately

It shouldn't be big, but I'll change it to a map as you pointed out. It's a good point.

amnonh · 2021-01-21T08:57:30Z

db/data_listeners.hh

@@ -155,15 +154,14 @@ public:

 class toppartitions_query {
    distributed<database>& _xdb;
-    sstring _ks;
-    sstring _cf;
+    std::optional<std::vector<std::tuple<sstring, sstring>>> _families;


wny making a vector optional?
BTW if it was unorderd_map you could have do a quick bale-out in the implementation

It's optional to reflect a difference between an empty set of filters and no filters. The former will match against no tables, while the latter will match against all of them

and yeah, changing to a hashmap is a good point

I still don't get it, if you know there are none, why using it in the first place? or is it something that is always there?

The user can omit that parameter or they can input an empty list - they mean different things and have to be handled accordingly. Once we pass that information down to the toppartitions query object and all the filtering happens, we need to be able to tell which of the described situations happened, because they give different results.
Using std::optional seemed like the most logical way to do it, as we're literally describing an "optional" parameter

If I understand correctly the two option:

the user passes an empty container, which means they want everything

the user didn't pass anything, so they want nothing - so do nothing, at the topest level, just return imidiatly with an empty result

Even if you mean the logic is the other way around, I don't see why you need to implement the empty case, if a user request something that by definition returns nothing, return imidiatly with nohting.

I removed the optionals and check for empty requests before instantiating the query, as asked

Right now toppartitions can only be invoked on one column family at a time. This change introduces a natural extension to this functionality, allowing to specify a list of families. We provide three ways for filtering in the query parameter "name_list": 1. A specific column family to include in the form "ks:cf" 2. A keyspace, telling the server to include all column families in it. Specified by omitting the cf name, i.e. "ks:" 3. All column families, which is represented by an empty list The list can include any amount of one or both of the 1. and 2. option. Fixes scylladb#4520

StarostaGit · 2021-01-21T14:46:21Z

Addressed all the above feedback

amnonh · 2021-01-21T15:28:38Z

api/api-doc/storage_service.json

+                     "paramType":"query"
+                  },
+                  {
+                     "name":"keyspace_filters",


then it should be just keyspace (or ks)

but why? Many keyspaces can be provided, so a singular here seems weird

Did you tested it? I don't think we support array of query parameters

It's a nitpick, but from looking at the code this is a comma seperated string, the API currently does not enforce it, but it's better be correct anyhow, but I'll not fight over it

amnonh · 2021-01-21T15:30:10Z

api/api-doc/storage_service.json

+               ],
+               "parameters":[
+                  {
+                     "name":"table_filters",


then it should be table or cf, if you are adding keyspace then it's only table on not keyspace:table

there can be multiples of both, you can specify a couple of tables, which are identified by "ks:cf" and a couple of whole keyspaces. For example, you can request two tables: "ks:t, ks:simple" and one keyspace "system". The result would be toppartitions for the two tables in ks keyspace and all tables in the system keyspace

amnonh · 2021-01-21T15:34:59Z

api/storage_service.cc

+        }
+
+        // when the query is empty return immediately
+        if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {


how can that be? someone either table_filters was true or keyspace_filters was true or filter_pfovided was false

filters_provided means that there were some filters included in the request, ergo the request is not the default one running on all tables
Then, the only situation where we can safely return without running the query is if some filters were included (e.g. the optional query parameters were provided), but both of them were empty

I still not completely understand, they added at least one of the query parameters (keyspace or table), but they added it without a value?
I'm not even sure if we support it, did you check that?

yes
For simplicity, let's pretend there is no keyspace_filters parameter, only the table ones. Now, one of the two situations can happen:
1. The user doesn't use the table_filters option -> the option is not in the params
2. The user provides the 'table_filters' option -> the option is in the params and contains a list of table names

In the first case, the query runs on all tables and keyspaces as this is the default
In the second case, the query runs only on the tables provided in the 'table_filters' option. A list can be empty, of course, and that means the query runs on a empty set - therefore, it gives empty results.

I understand the code, my question was about the query parameter, it does not hurt to add the check like you did, I wasn't sure you can pass an empty parameter, so it looked redundant.

db/data_listeners.hh

StarostaGit · 2021-03-10T10:59:07Z

@amnonh ping

StarostaGit · 2021-03-22T10:13:24Z

@amnonh ping

haaawk · 2021-03-22T13:21:09Z

@slivne is there anyone else than @amnonh that can review this?

amnonh · 2021-03-24T14:39:08Z

api/storage_service.cc

+        }
+
+        // when the query is empty return immediately
+        if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {


I understand the code, my question was about the query parameter, it does not hurt to add the check like you did, I wasn't sure you can pass an empty parameter, so it looked redundant.

amnonh · 2021-03-24T14:41:46Z

api/api-doc/storage_service.json

+                     "paramType":"query"
+                  },
+                  {
+                     "name":"keyspace_filters",


It's a nitpick, but from looking at the code this is a comma seperated string, the API currently does not enforce it, but it's better be correct anyhow, but I'll not fight over it

StarostaGit force-pushed the generic_cardinality_api branch from 1df3c52 to 9df0511 Compare January 7, 2021 13:04

amnonh reviewed Jan 7, 2021

View reviewed changes

StarostaGit force-pushed the generic_cardinality_api branch from 9df0511 to 5dacf02 Compare January 11, 2021 12:05

StarostaGit force-pushed the generic_cardinality_api branch from 5dacf02 to 301db48 Compare January 13, 2021 17:51

StarostaGit force-pushed the generic_cardinality_api branch from 301db48 to 4949e5c Compare January 20, 2021 14:53

StarostaGit force-pushed the generic_cardinality_api branch from 4949e5c to 01c5873 Compare January 20, 2021 15:18

amnonh suggested changes Jan 21, 2021

View reviewed changes

StarostaGit force-pushed the generic_cardinality_api branch from 01c5873 to ae63d26 Compare January 21, 2021 14:44

amnonh reviewed Jan 21, 2021

View reviewed changes

amnonh suggested changes Jan 21, 2021

View reviewed changes

StarostaGit mentioned this pull request Feb 15, 2021

storage_service: Add a generic toppartitions endpoint scylladb/scylla-jmx#157

Merged

slivne requested a review from amnonh March 24, 2021 12:44

amnonh approved these changes Mar 24, 2021

View reviewed changes

scylladb-promoter closed this in c1daf2b Mar 24, 2021

column_family: Make toppartitions queries more generic #7864

column_family: Make toppartitions queries more generic #7864

Conversation

StarostaGit commented Jan 4, 2021 • edited Loading

StarostaGit commented Jan 4, 2021 • edited Loading

psarna commented Jan 7, 2021

StarostaGit commented Jan 7, 2021

psarna commented Jan 7, 2021

StarostaGit commented Jan 7, 2021

StarostaGit commented Jan 7, 2021

psarna commented Jan 7, 2021

avikivity commented Jan 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amnonh commented Jan 7, 2021

StarostaGit commented Jan 11, 2021 • edited Loading

StarostaGit commented Jan 11, 2021

avikivity commented Jan 12, 2021

amnonh commented Jan 12, 2021

avikivity commented Jan 12, 2021

amnonh commented Jan 12, 2021

StarostaGit commented Jan 13, 2021

StarostaGit commented Jan 20, 2021

avikivity commented Jan 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StarostaGit commented Jan 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StarostaGit commented Mar 10, 2021

StarostaGit commented Mar 22, 2021

haaawk commented Mar 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StarostaGit commented Jan 4, 2021 •

edited

Loading

StarostaGit commented Jan 4, 2021 •

edited

Loading

StarostaGit commented Jan 11, 2021 •

edited

Loading