Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Window functions #1469

Closed
calgs opened this issue Nov 9, 2017 · 18 comments
Closed

Window functions #1469

calgs opened this issue Nov 9, 2017 · 18 comments

Comments

@calgs
Copy link

calgs commented Nov 9, 2017

We have been very encouraged by Clickhouse. However, as we are trying to port all of our existing scripts to Clickhouse, we are running into few roadblocks. For example: CUMULATIVE SUM or RUNNING TOTAL. We are trying to find an equivalent of Window Functions e.g. SUM(SALES) OVER (PARTITION BY PRODUCT ORDER BY SALES)

Is there a way to get Cumulative Sum or Running Total. Any inputs or guidance is much appreciated. Thanks!

@filimonov
Copy link
Contributor

filimonov commented Nov 14, 2017

Window functions are not supported currently, but there are chances that they will appear in the future, in some cases you can find workarounds. Possible directions: parametric aggregate functions, higher-order functions, subselects, and LIMIT BY statements.

For running total, there is a related function
https://clickhouse.yandex/docs/en/functions/other_functions.html#runningdifference-x

Latest release (v1.1.54310-stable) add support for a function runningIncome (not documented yet)
which should work just as runningDifference , but should calculate sum instead of difference. Check that commit: #1427

@kszucs
Copy link
Contributor

kszucs commented Jan 11, 2018

With the following versions:

ClickHouse client version 1.1.54310
Connected to ClickHouse server version 1.1.54310.

It seems like both runningIncome and runningDifference are implementing difference operation.
The implementation implicates the same.

:) select number, runningDifference(number), runningIncome(number) from system.numbers limit 10;

SELECT
    number,
    runningDifference(number),
    runningIncome(number)
FROM system.numbers
LIMIT 10

┌─number─┬─runningDifference(number)─┬─runningIncome(number)─┐
│      000 │
│      111 │
│      211 │
│      311 │
│      411 │
│      511 │
│      611 │
│      711 │
│      811 │
│      911 │
└────────┴───────────────────────────┴───────────────────────┘

10 rows in set. Elapsed: 0.009 sec.

cc @filimonov

@ztlpn
Copy link
Contributor

ztlpn commented Jan 11, 2018

@kszucs You are right. runningIncome() is a pretty misleading name for this function, so in the next release it will be renamed to runningDifferenceStartingWithFirstValue(). For the first row it outputs the input value instead of 0:

:) select number + 123 as x, runningDifference(x), runningDifferenceStartingWithFirstValue(x) from system.numbers limit 5;

SELECT
    number + 123 AS x,
    runningDifference(x),
    runningDifferenceStartingWithFirstValue(x)
FROM system.numbers
LIMIT 5

┌───x─┬─runningDifference(plus(number, 123))─┬─runningDifferenceStartingWithFirstValue(plus(number, 123))─┐
│ 123 │                                    0 │                                                        123 │
│ 124 │                                    1 │                                                          1 │
│ 125 │                                    1 │                                                          1 │
│ 126 │                                    1 │                                                          1 │
│ 127 │                                    1 │                                                          1 │
└─────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.001 sec.

You can use the runningAccumulate() function that cumulatively merges aggregate function states to calculate running totals. Here is an example (calculating cumulative count of rows by date in a table):

:) SELECT date FROM t

SELECT date
FROM t

┌───────date─┐
│ 2018-01-08 │
│ 2018-01-08 │
│ 2018-01-09 │
│ 2018-01-09 │
│ 2018-01-09 │
│ 2018-01-10 │
└────────────┘

6 rows in set. Elapsed: 0.001 sec.

:) SELECT date, finalizeAggregation(state), runningAccumulate(state) FROM (SELECT date, countState() AS state FROM t GROUP BY date ORDER BY date)

SELECT
    date,
    finalizeAggregation(state),
    runningAccumulate(state)
FROM
(
    SELECT
        date,
        countState() AS state
    FROM t
    GROUP BY date
    ORDER BY date ASC
)

┌───────date─┬─finalizeAggregation(state)─┬─runningAccumulate(state)─┐
│ 2018-01-08 │                          2 │                        2 │
│ 2018-01-09 │                          3 │                        5 │
│ 2018-01-10 │                          1 │                        6 │
└────────────┴────────────────────────────┴──────────────────────────┘

3 rows in set. Elapsed: 0.002 sec.

@kszucs
Copy link
Contributor

kszucs commented Jan 11, 2018

Hey @ztlpn! Thanks for the clarification!

Currently I'm implementing ClickHouse backend for https://github.com/ibis-project/ibis which is a match made in heaven except for the missing window functionality in CH which I'd really like to use.
Any chance for a preliminary implementation for that in the near future?

Meanwhile could You suggest an alternative way (without a join) to compute e.g. zscore?

With window support It would like something like this:

select column - avg(column) over (partition by key) from t

AFAIK currently this requires a join:

select column - avg_column
from t left join (
  select key, avg(column) as avg_column
  from t
  group by key
) _
using (key)

@ztlpn
Copy link
Contributor

ztlpn commented Jan 11, 2018

Cool! Of course window functions support is a must-have feature but it is a pretty big task and currently there is no definite timeline. You can get pretty far with arrays and ARRAY JOINs though. Your query becomes

select arrayJoin(values) - avg_value from (
  select avg(value) as avg_value, groupArray(value) as values
  from t group by key)

@inkrement
Copy link

@ztlpn is it somehow possible to use runningDifference in combination with a group by? For example to calculate some lag-value for each unit.

@tuedelue
Copy link

I am also looking for a solution to this very problem: I would like to get running accumulates grouped by a certain key.
Any ideas?

@zhangguiyu
Copy link

zhangguiyu commented Apr 19, 2018

My experience is xxxState (sumState, countState) will only work across 1-dimension, if you group using more than 1 columns, then the runningAccumulate will get messed up (not in sequence).

@tuedelue
Copy link

tuedelue commented May 9, 2018

Just as a followup: I have seen in recent changelogs, that there is the arrayCumSum method, which can be used for getting at least the accumulate values. (Edit: You of course may use any other array operation, the following describes sort of a grouping workaround)

I am not sure whether this performs well enough on large scale, but as for testing, the following query seems to work:

select
	user_key,
	theday,
	theevents,
	thecumsum
from (
	select
		user_key,
		days,
		events,
		arrayCumSum(events) as cumsum -- and calculate the cumulative values
	from (
		select -- put this into arrays
			user_key,
			groupArray(day) as days,
			groupArray(events) as events
		from (
			select -- selecting counts on a daily basis for each user
				user_key,
				toDate(date_key) as day,
				count(*) as events
			from
				fact_interaction
			group by
				user_key,
				day
			order by
				day
		)
		group by
			user_key
	)
)
array Join -- this is the 'tricky' part transposing the arrays back into rows.
	cumsum AS thecumsum,
	events AS theevents,
	days as theday

@zhangguiyu
Copy link

Wow! thanks for the heads-up! I will try this out.

@cameronbraid
Copy link

My experience is xxxState (sumState, countState) will only work across 1-dimension, if you group using more than 1 columns, then the runningAccumulate will get messed up (not in sequence).

I have noticed this too.

Is there any way to make runningAccumulate work with more than 1 group by column ?

@pramitchoudhary
Copy link

Is there a plan on when the window function will be part of a stable release tentatively?

@filimonov filimonov added st-hold We've paused the work on issue for some reason sql-compatibility labels Aug 28, 2019
@shcheglovnd
Copy link
Contributor

Another way to make cumulative sum:

SELECT
    number_1,
    arraySum(arraySlice(numbers, 1, i)) AS cum_sum
FROM
(
    SELECT groupArray(number) AS numbers
    FROM
    (
        SELECT number
        FROM system.numbers
        LIMIT 10
    )
)
ARRAY JOIN
    numbers AS number_1,
    arrayEnumerate(numbers) AS i

┌─number_1─┬─cum_sum─┐
│        0 │       0 │
│        1 │       1 │
│        2 │       3 │
│        3 │       6 │
│        4 │      10 │
│        5 │      15 │
│        6 │      21 │
│        7 │      28 │
│        8 │      36 │
│        9 │      45 │
└──────────┴─────────┘

@andrejserafim
Copy link

This is a pretty long issue. Let me post my findings and how we made this work.

For us the problem turned out to be the PARTITION BY toYYYYMM(time) which we of course found out only after being in production for 1 month :) We'd get breaks in runningDifference on the month boundary.

After many experiments it turns out that sorting by non-partition key first pushes CH into providing us with a single set of data. Providing a rather complex example with a group by on a key:

select bucket, sum(aggValue) as aggValue from (
    select
        key,
        toDateTime(
            toStartOfInterval(
                time,
                INTERVAL 9999999999 SECOND
            )
        ) as bucket,
        sumIf(runningDifference(value) as diff, diff > 0) as aggValue
    from (select key, time, value from table
    where key in (....)
    order by key, time) t
    group by key, bucket
)
group by bucket
order by bucket

If someone can spot any limitation here, we'd be glad to hear your feedback. And I hope this saves someone else a day or two.

@l1t1
Copy link

l1t1 commented Oct 18, 2020

those functions are hard to understand, i prefer the standard window functions (#5132).
as i know, both actian vector and sqlite realized them.

@SkyGra
Copy link

SkyGra commented Oct 20, 2020

when the window function will be supported in Clickhouse?
I hope to see the function early!

1 similar comment
@kiwimg
Copy link

kiwimg commented Nov 11, 2020

when the window function will be supported in Clickhouse?
I hope to see the function early!

@akuzm akuzm added comp-window-functions and removed st-hold We've paused the work on issue for some reason labels Feb 4, 2021
@alexey-milovidov
Copy link
Member

It's supported under experimental flag in the recent releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests