Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count(col) returns incorrect values if col is NULL #14198

Closed
avikivity opened this issue Jun 9, 2023 · 3 comments
Closed

count(col) returns incorrect values if col is NULL #14198

avikivity opened this issue Jun 9, 2023 · 3 comments
Milestone

Comments

@avikivity
Copy link
Member

avikivity commented Jun 9, 2023

count(col) (unlike count(*) or count(1)) is supposed not to count rows where col is NULL. However, it does, if the column has a collection type:

cqlsh> create table ks1.tab (id int PRIMARY KEY , i int, l frozen<list<int>>);                                                                         
cqlsh> insert into ks1.tab (id, i) values (1, 1);                                                                                                  
cqlsh> insert into ks1.tab (id, l) values (2, [2]);
cqlsh> select * from ks1.tab;

 id | i    | l
----+------+------
  1 |    1 | null
  2 | null |  [2]

(2 rows)

cqlsh> select count(i) from ks1.tab;

 system.count(i)
-----------------
               1

(1 rows)

cqlsh> select count(l) from ks1.tab;

 count
-------
     2

(1 rows)

Checked with 5.2.1. 5.0.5 known to be good.

@avikivity
Copy link
Member Author

Likely due to

    } else if (name.has_keyspace()
                ? name == COUNT_NAME
                : name.name == COUNT_NAME.name) {
        auto arg_types = get_arguments(COUNT_NAME.name);
        if (arg_types.size() != 1) {
            throw std::runtime_error("count() function requires only 1 argument");
        }

        auto& arg = arg_types[0];
        return aggregate_fcts::make_
            return aggregate_fcts::make_count_rows_function();
        }

which converts COUNT_NAME to COUNT_ROWS_NAME.

@avikivity avikivity changed the title count(col count(col) returns incorrect values if col is NULL Jun 9, 2023
@avikivity
Copy link
Member Author

In 5.0 it works by accident. count(frozen<list<int>>) happens to match the signature of count(blob) because of the weakly_assignable thing.

avikivity added a commit to avikivity/scylladb that referenced this issue Jun 9, 2023
count(col), unlike count(*), does not count rows for which col is NULL.
However, if col's data type is not a scalar (e.g. a collection, tuple,
or user-defined type) it behaves like count(*), counting NULLs too.

The cause is that get_dynamic_aggregate() converts count() to
the count(*) version. It works for scalars because get_dynamic_aggregate()
intentionally fails to match scalar arguments, and functions::get() then
matches the arguments against the pre-declared count functions.

As we can only pre-declare count(scalar) (there's an infinite number
of non-scalar types), we change the approach to be the same as min/max:
we make count() a generic function. In fact count(col) is much better
as a generic function, as it only examines its input to see if it is
NULL.

A unit test is added.

Fixes scylladb#14198.
avikivity added a commit to avikivity/scylladb that referenced this issue Jun 10, 2023
count(col), unlike count(*), does not count rows for which col is NULL.
However, if col's data type is not a scalar (e.g. a collection, tuple,
or user-defined type) it behaves like count(*), counting NULLs too.

The cause is that get_dynamic_aggregate() converts count() to
the count(*) version. It works for scalars because get_dynamic_aggregate()
intentionally fails to match scalar arguments, and functions::get() then
matches the arguments against the pre-declared count functions.

As we can only pre-declare count(scalar) (there's an infinite number
of non-scalar types), we change the approach to be the same as min/max:
we make count() a generic function. In fact count(col) is much better
as a generic function, as it only examines its input to see if it is
NULL.

A unit test is added.

Fixes scylladb#14198.
avikivity added a commit to avikivity/scylladb that referenced this issue Jun 11, 2023
count(col), unlike count(*), does not count rows for which col is NULL.
However, if col's data type is not a scalar (e.g. a collection, tuple,
or user-defined type) it behaves like count(*), counting NULLs too.

The cause is that get_dynamic_aggregate() converts count() to
the count(*) version. It works for scalars because get_dynamic_aggregate()
intentionally fails to match scalar arguments, and functions::get() then
matches the arguments against the pre-declared count functions.

As we can only pre-declare count(scalar) (there's an infinite number
of non-scalar types), we change the approach to be the same as min/max:
we make count() a generic function. In fact count(col) is much better
as a generic function, as it only examines its input to see if it is
NULL.

A unit test is added. It passes with Cassandra as well.

Fixes scylladb#14198.
avikivity added a commit to avikivity/scylladb that referenced this issue Jun 12, 2023
count(col), unlike count(*), does not count rows for which col is NULL.
However, if col's data type is not a scalar (e.g. a collection, tuple,
or user-defined type) it behaves like count(*), counting NULLs too.

The cause is that get_dynamic_aggregate() converts count() to
the count(*) version. It works for scalars because get_dynamic_aggregate()
intentionally fails to match scalar arguments, and functions::get() then
matches the arguments against the pre-declared count functions.

As we can only pre-declare count(scalar) (there's an infinite number
of non-scalar types), we change the approach to be the same as min/max:
we make count() a generic function. In fact count(col) is much better
as a generic function, as it only examines its input to see if it is
NULL.

A unit test is added. It passes with Cassandra as well.

Fixes scylladb#14198.
@DoronArazii DoronArazii added this to the 5.4 milestone Jun 20, 2023
@avikivity
Copy link
Member Author

5.1 is also broken.

This is difficult to backport due to all the refactoring, but perhaps not difficult to rework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants