add aggregate functions IntersectionsMax and IntersectionsMaxPos #2012

furmur · 2018-03-08T13:33:00Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

new aggregate functions:

IntersectionsMax(start_column ,end_column)
IntersectionsMaxPos(start_column ,end_column)

returns maximum count of the intersected intervals defined by start_column and end_column values

if start_column value is NULL or 0 then interval will be skipped
if end_column value is NULL or 0 then interval is considered to have no end (interval (3,0) in example below)

IntersectionsMaxPos in addition returns position where the maximum of intersected intervals found

typical application for this functions (and why they were implemented) is to count maximum active calls for the certain time frame by stored calls detailed records with timestamps for the calls start/end

example:

:) CREATE DATABASE test;
:) CREATE TABLE test(start Integer, end Integer) engine = Memory;
:) INSERT INTO test(start,end) VALUES (1,3),(2,7),(3,0),(4,7),(5,8);

intervals in the table after the insert:

1 2 3 4 5 6 7 8 9
------------------>
1---3
  2---------7
    3-------------
      4-----7
        5-----8
------------------>
1 2 3 3 4 4 4 2 1  //intersections count for each point

functions output:

:) SELECT IntersectionsMax(start,end) FROM test;

SELECT IntersectionsMax(start, end)
FROM test

┌─IntersectionsMax(start, end)─┐
│                            4 │
└──────────────────────────────┘

1 rows in set. Elapsed: 0.003 sec.

:) SELECT IntersectionsMaxPos(start,end) FROM test;

SELECT IntersectionsMaxPos(start, end)
FROM test 

┌─IntersectionsMaxPos(start, end)─┐
│                               4 │
│                               5 │
└─────────────────────────────────┘

2 rows in set. Elapsed: 0.003 sec.

alexey-milovidov · 2018-03-14T03:17:03Z

Few points:

IntersectionsMax

Currently we name our functions in camelCase, not CamelCase.
Also I think that maxIntersections sounds better than intersectionsMax.

if start_column value is NULL

NULL values are skipped before being passed to aggregate functions.

if start_column value is NULL or 0 then interval will be skipped
if end_column value is NULL or 0 then interval is considered to have no end (interval (3,0) in example below)

It looks very specific to reserve zero value as something special...
Also I don't like assymetry, but I understand the business logic.
Maybe we can totally remove the support for half-open intervals?

Motivation:

it's not possible to have calls/sessions with infinite duration;
you can pass some very big value as end of interval instead;

example:
:) CREATE DATABASE test;
:) CREATE TABLE test(start Integer, end Integer) engine = Memory;

BTW, this example is incorrect, because you don't use database test in second query.
CREATE TABLE test.test will be correct.

SELECT IntersectionsMaxPos(start, end)
FROM test
┌─IntersectionsMaxPos(start, end)─┐
│ 4 │
│ 5 │
└─────────────────────────────────┘

Don't understand how the aggregate function can return two rows.

alexey-milovidov · 2018-03-14T03:25:13Z

Missing documentation in code (.h file).
Missing test.

alexey-milovidov · 2018-03-14T03:26:44Z

    String getName() const override
    {
        return "IntersectionsMax";
    }

this is wrong according that you have two different functions.

alexey-milovidov · 2018-03-14T03:28:14Z

The function IntersectionsMaxPos should be named maxIntersectionsPosition according the fact that we have position function, not pos.

alexey-milovidov · 2018-03-14T03:29:26Z

The user could expect that the function will work for floats and signed numbers.

alexey-milovidov · 2018-03-14T03:30:56Z

    void insertResultInto(ConstAggregateDataPtr place, IColumn & to) const override
    {
        auto & ret = static_cast<ColumnUInt64 &>(to).getData();
        ret.push_back(data(place).max());
        if (return_position)
            ret.push_back(data(place).max_pos());
    }

That's absolutely incorrect: you cannot insert two values in a column as a result.

alexey-milovidov · 2018-03-14T03:42:12Z

Missing performance test.
Missing algorithm description inside the code.

…ed sorting #2012

alexey-milovidov · 2018-03-14T05:32:46Z

I've slightly changed the algorithm. Please take a look.

furmur · 2018-03-15T13:49:47Z

thank you for such a lot of fixes.

i see that algorithm was completely changed.
i used rb-tree to keep array sorted and to really aggregate data instead of adding points to the plain array.
your algorithm sorts accumulated array right in the insertResultInto and simply counts points. it's more straightforward but is also correct.

i'm not sure which approach is better, so i would wait till build in the master will be fixed and compare them.
if my approach would show better results then i'll try to explain it for others, otherwise let's leave your implementation.

alexey-milovidov · 2018-03-15T23:18:51Z

My algorithm will eat more memory if there are many intervals that start/end at the same second.
It's expected, for example, if we accumulating data for one day into single aggregation state, and we have more than about 100 000 intervals per day. That's quite natural.

We can simply improve it later by adding "compress" function that will sort an array and accumulate duplicate values (or do batch insertion into already sorted part of array); and we can call this function when array becomes large (and also before serialization of the state to save network traffic).

Also my algorithm will be worse if you have small amount of duplicated intervals. For example, millions of rows with just a few unique intervals.

Using plain array should be better for the following reasons:

if you sort array at once, it is more efficient than using tree (incremental sorting);
the memory for std::map is allocated separately (one allocation per node, about 24 bytes) and these allocations are not accounted by MemoryTracker (memory limits for query won't work);
BTW, we can switch to radix sort (see RadixSort) for big enough arrays.

furmur and others added 2 commits March 8, 2018 15:23

add aggregate functions IntersectionsMax and IntersectionsMaxPos

8cb5733

Update AggregateFunctionIntersectionsMax.cpp

cbab87f

alexey-milovidov added a commit that referenced this pull request Mar 14, 2018

clang-format #2012

b4f1ecb

alexey-milovidov added a commit that referenced this pull request Mar 14, 2018

Adaptation #2012

688d277

alexey-milovidov self-assigned this Mar 14, 2018

alexey-milovidov merged commit cbab87f into ClickHouse:master Mar 14, 2018

alexey-milovidov added a commit that referenced this pull request Mar 14, 2018

Fixed handling of NaNs in aggregate functions that use comparison bas…

f08940c

…ed sorting #2012

filimonov mentioned this pull request Dec 28, 2020

Add function runningConcurrency() #18508

Merged

den-crane mentioned this pull request Dec 26, 2022

IntersectionsMax / IntersectionsMaxPos are not documented #44619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add aggregate functions IntersectionsMax and IntersectionsMaxPos #2012

add aggregate functions IntersectionsMax and IntersectionsMaxPos #2012

furmur commented Mar 8, 2018

alexey-milovidov commented Mar 14, 2018 •

edited

Loading

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018 •

edited

Loading

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018

furmur commented Mar 15, 2018

alexey-milovidov commented Mar 15, 2018 •

edited

Loading

add aggregate functions IntersectionsMax and IntersectionsMaxPos #2012

add aggregate functions IntersectionsMax and IntersectionsMaxPos #2012

Conversation

furmur commented Mar 8, 2018

alexey-milovidov commented Mar 14, 2018 • edited Loading

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018 • edited Loading

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018

alexey-milovidov commented Mar 14, 2018

furmur commented Mar 15, 2018

alexey-milovidov commented Mar 15, 2018 • edited Loading

alexey-milovidov commented Mar 14, 2018 •

edited

Loading

alexey-milovidov commented Mar 14, 2018 •

edited

Loading

alexey-milovidov commented Mar 15, 2018 •

edited

Loading