Draft: ✨ add a first working draft of auto_aggregate selection #33

Xasin · 2023-06-03T15:37:02Z

This PR will add one new file which will create new functions to allow for the execution of an automatic downsampling of data for TimescaleDB hypertables and continuous aggregates.

Functionality of the downsampling method is documented in the utils/aggregate_selector.sql file.
There are no dependencies with other systems, and no internal state of TimescaleDB is modified, as such there should be little conflict with pre-existing systems.

Please confirm that the naming of the functions is appropriate for their use case, as I was not sure about existing naming conventions.
Please also detail whether or not functions need to be documented in the docs/ folder, or if in-source documentation as provided is sufficient.

…egate selection

jonatas · 2023-06-07T15:29:19Z

Hi @Xasin, very good addition! So happy to see it!

Would you mind adding some example of this in action?

Probably improving the README with some instructions would be good for test, review and future users.

Xasin · 2023-06-07T20:49:02Z

On it!
I'll copy the documentation and add an example section for this function in the docs/ folder, and add a link to the main README

Xasin · 2023-06-07T22:11:53Z

Preliminary docs entry and a link from the README to the docs file added.
I'd appreciate some feedback on the description I am using for the function. I think it's a bit hard for me to understand if its purpose and functionality comes across properly, partly because I know the internals of the function quite well.

utils/aggregate_selector.sql

d-sooter · 2023-06-16T06:54:07Z

utils/aggregate_selector.sql

+	aggregate_choices JSONB, 
+	groupby_clause TEXT, filter_query TEXT,
+	hypertable_schema TEXT DEFAULT 'public',
+	time_column TEXT DEFAULT 'time')


On thought, would it make sense to add maybe a debug flag and do a raise notice on the actual query that is run?
That way the user could just grab it and run explain analyze and see what is going on. The function ends up hiding that.

Yes, I've noticed that it's rather tricky to figure out what is happening inside a function...

I suppose I could reformat the query a bit to first generate a string and then optionally raise that string as debug flag if a value is set. It's not too much of a change and could definitely really help...

Thanks!

If you have a moment, try the new pushed change.
I hope it works right

Signed-off-by: Xasin <davidbailey.2889@gmail.com>

jonatas · 2023-06-19T18:38:32Z

utils/auto_downsample.sql

+		RAISE NOTICE 'Generated query output:'
+		RAISE NOTICE query_construct


It seems there's a bug trying to load this lines:

jonatasdp@MacBook-Pro-3 ~/c/t/t/utils (Xasin/master)> psql playground -f auto_downsample.sql Expanded display is used automatically. Border style is 2. Line style is unicode. CREATE FUNCTION CREATE FUNCTION CREATE FUNCTION CREATE FUNCTION psql:auto_downsample.sql:215: ERROR: syntax error at or near "RAISE" LINE 47: RAISE NOTICE query_construct ^ ```

utils/auto_downsample.sql

jonatas

I was trying to test it with the following hypertable:

\d ticks
                  Table "toolkit_experimental.ticks"
┌────────┬──────────────────────────┬───────────┬──────────┬─────────┐
│ Column │           Type           │ Collation │ Nullable │ Default │
├────────┼──────────────────────────┼───────────┼──────────┼─────────┤
│ time   │ timestamp with time zone │           │ not null │         │
│ symbol │ text                     │           │          │         │
│ price  │ numeric                  │           │          │         │
│ volume │ double precision         │           │          │         │
└────────┴──────────────────────────┴───────────┴──────────┴─────────┘
Indexes:
    "ticks_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON ticks FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Number of child tables: 1 (Use \d+ to list them.)

Now, creating the auto downsample:

SELECT *
FROM auto_downsample(
                                'ticks',
                                INTERVAL '10m',
                                $aggregate_options$
                                        [
                                                {"with_columns": ["price"],  "aggregate":"avg(price) AS value"}
                                        ]
                                $aggregate_options$,
                                'symbol',
                                $where_clause$
                                                                WHERE time BETWEEN NOW()-INTERVAL'30d' AND NOW()-INTERVAL'10d'
                                $where_clause$)

                                AS (time TIMESTAMP, symbol text, value DOUBLE PRECISION);

I got the following error:

 Using parameter set <NULL>
ERROR:  No aggregator given!
HINT:  Supply a "aggregate" field in the JSON aggregate object
CONTEXT:  PL/pgSQL function auto_downsample(text,interval,jsonb,text,text,text,text,boolean) line 21 at RAISE

Why can't I get it with a single aggregates? Maybe I'm missing something or my context is not suitable for this 🤔

Co-authored-by: Jônatas Davi Paganini <jonatasdp@gmail.com> Signed-off-by: Xasin <davidbailey.2889@gmail.com>

…ield

Xasin · 2023-06-19T19:34:19Z

@jonatas I think the problem there is that you are using a different schema than the "public" schema. The schema name can be given via the hypertable_schema named parameter.

I am assuming "public" is the default schema, but perhaps there is a better way, e.g. using the currently "active" schema, if such a thing exists?

What caught me off guard is that it's raising the wrong error. I thought I had a guard in check that specifically detects when it can't find a fitting table and give an appropriate error message, but it seems that it didn't work.
I tweaked it so it now correctly says when it can't find the hypertable, the error message looks like this:

ERROR:  No fitting hypertable or aggregate found for given columns and table "public"."kernel"!
HINT:  Make sure the 'with_columns' field of the aggregate options is correct, and that `hypertable` and `hypertable_schema` are correct!

It seems to work fine on a Hypertable with no continuous aggregates for me.

jonatas · 2023-06-19T19:45:04Z

Thanks for the fix @Xasin.

Just trying it again:

CREATE TABLE ticks
( time TIMESTAMP NOT NULL,
    symbol varchar,
    price decimal,
    volume int);
SELECT create_hypertable('ticks', 'time');

And then:

SELECT *
FROM auto_downsample(
                                'ticks',
                                INTERVAL '10m',
                                $aggregate_options$
                                        [
                                                {"with_columns": ["price"],  "aggregate":"avg(price) AS value"}
                                        ]
                                $aggregate_options$,
                                'symbol',
                                $where_clause$
                                                                WHERE time BETWEEN NOW()-INTERVAL'30d' AND NOW()-INTERVAL'10d'
                                $where_clause$)

                                AS (time TIMESTAMP, symbol text, value DOUBLE PRECISION);

Now, after reloading I have the following error:

ERROR:  function auto_downsample(unknown, interval, unknown, unknown, unknown) does not exist
LINE 2: FROM auto_downsample(
             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Xasin · 2023-06-19T19:52:35Z

@jonatas very strange, I am getting an error as well, but it's quite different.

When I run your commands 1:1 in my psql prompt (on a PostgreSQL 13.11 server), the error message is the following:

NOTICE:  Using parameter set {"aggregate": "avg(price) AS value", "table_name": "ticks", "table_schema": "public", "with_columns": ["price"], "table_interval": "00:00:00", "interval_matched": false}
ERROR:  structure of query does not match function result type
DETAIL:  Returned type character varying does not match expected type text in column 2.
CONTEXT:  PL/pgSQL function auto_downsample(text,interval,jsonb,text,text,text,text,boolean) line 42 at RETURN QUERY

But after fixing those by replacing the AS () with the proper types (i.e. CHARACTER VARYING and NUMERIC) it actually works fine for me.
Maybe different PostgreSQL versions pull the AS ( ) statement into the function definition itself and throw a less understandable error on return type missmatch?

jonatas · 2023-06-20T13:44:25Z

I'm on Postgresql 14.7. Maybe we could leave a complete example running in the examples, just to allow people to try to replicate it and then adapt to their needs. Can you share the changes you did just to make sure I'm following all the replaces in the AS ( ) that you mentioned?

Xasin · 2023-06-20T14:18:57Z

@jonatas the edited "AS" clause has to have the exact same types as the return types of the query, so it looks like:
AS (time TIMESTAMP, symbol character varying, value numeric);
for the case you gave me. varchar and text are different, and the aggregation functions seem to return numeric.

I also confirmed that I have no other typedef of the function that might be hiding the issue, so I'm sure I only have the auto_downsample function version of the provided SQL file.

I'll update a fully working example in a moment.

jonatas

Thanks for all your help here! very good to have this helper around 🚀 🙇

Tested manually ✅

Xasin · 2023-06-20T19:48:48Z

I'm really happy to help!
It might come in quite handy at some of my work soon, which is very exciting. In any case, I am more than happy to give back to such a helpful too.

I don't think there's anything else from me left to change, so unless there's something else that's needed I think this can be merged in.

jonatas · 2023-06-21T22:15:43Z

Amazing! Thanks again! Adding Chris to have a second 👀 and if @d-sooter has any feedback let us know before we merge it.

nimbit-software · 2023-06-25T14:27:47Z

So i played around a bit with the function and i love it. And will be using it :) One issue i had is that i join my hypertable/Cagg on a metadata table. And since i dont know which table will be selected i cant really perform my join.

For my situation i just added a fixed alias to the table so that it can be referenced if needed in a join but im not sure if it would make sense to allow an alias to be passed for the table or to have a fixed alias

Xasin · 2023-06-26T08:09:23Z

@nimbit-software hm... That is an interesting problem. Would it help if I aliased the selected table to a common name inside the internal query, so that it's consistently referenceable for JOINs?

For now, can you try something for me?
The way the WHERE clause query is injected into the SQL it's appended directly after the FROM clause, so perhaps if you start the filter clause with a AS table_name, the AS will be appended directly to the FROM table to give it an alias?

EDIT: Maybe that's what you already did.
I think internally assigning a fixed table alias is a cleaner solution.

d-sooter · 2023-06-26T10:22:32Z

so basically this is what i did.

i set a static alias "timeseries" and then i can perform my join. I can imagine cases where people would want to pass in their own alias, but then we have another input parameter and its starts getting a bit full.

query_construct := format($qry$ SELECT time_bucket(%L, %I) AS time, %s, %s FROM %I.%I as timeseries %s GROUP BY 1, %s ORDER BY 1 $qry$, data_interval, time_column, groupby_clause, aggregator_column, selected_parameters->>'table_schema', selected_parameters->>'table_name', filter_query, groupby_clause);

we could check the table name and if there is a space in it then split it up and set the alias.

CREATE OR REPLACE FUNCTION auto_downsample(hypertable TEXT, data_interval INTERVAL,
	aggregate_choices JSONB, 
	groupby_clause TEXT, filter_query TEXT,
	hypertable_schema TEXT DEFAULT 'public',
	time_column TEXT DEFAULT 'time',
	debug_query BOOLEAN DEFAULT FALSE)
RETURNS SETOF RECORD
LANGUAGE plpgsql
STABLE
AS $BODY$
DECLARE
	selected_parameters jsonb;
	aggregator_column TEXT;
	query_construct TEXT;
	hypertable_internal text;
  	hypertable_alias text;
  	parts text[];
BEGIN
	  IF hypertable ~* '\s+AS\s+' THEN
    	parts := regexp_split_to_array(hypertable, '\s+AS\s+', 'i');
	  ELSE
		parts := regexp_split_to_array(hypertable, '\s+', 'i');
	  END IF;

	  -- Extract the table name and alias
	  hypertable_internal := parts[1];
	  hypertable_alias := parts[array_length(parts, 1)];



	SELECT aggregate_choice(hypertable_internal, data_interval, aggregate_choices, hypertable_schema) INTO selected_parameters;

	IF selected_parameters IS NULL THEN
		RAISE EXCEPTION $$No fitting hypertable or aggregate found for given columns and table "%"."%"!$$,
			hypertable_schema, hypertable_internal
			USING HINT = $$Make sure the 'with_columns' field of the aggregate options is correct, and that `hypertable` and `hypertable_schema` are correct!$$
		RETURN;
	END IF;

	RAISE NOTICE 'Using parameter set %', selected_parameters;

	aggregator_column := selected_parameters->>'aggregate';

	IF aggregator_column IS NULL THEN
		RAISE EXCEPTION 'No aggregator given!' USING HINT = 'Supply a "aggregate" field in the JSON aggregate object'
		RETURN;
	END IF;

	query_construct := format($qry$
			SELECT time_bucket(%L, %I) AS time, %s, %s
			FROM %I.%I as %s
			%s
			GROUP BY 1, %s
			ORDER BY 1
		$qry$, data_interval, time_column, groupby_clause, aggregator_column,
		selected_parameters->>'table_schema', selected_parameters->>'table_name', hypertable_alias,
		filter_query,
		groupby_clause);

	IF debug_query THEN
		RAISE NOTICE 'Generated query output: %', query_construct;
	END IF;

	RETURN QUERY EXECUTE query_construct;
END;
$BODY$;

Xasin · 2023-06-26T14:51:23Z

@d-sooter that doesn't sound too bad.
"timeseries" should be OK, but it almost sounds like a potential internal keyword?

It doesn't actually matter how many parameters the function has. PLPGSQL functions support named parameters, where individual parameters can be accessed through e.g. some_function(parameter_a => 10, parameter_c => 32)
This keeps it fairly readable even with higher parameter counts, and if there is a "sane default" that most people will use (like how the default schema is public and the default time column name is time), it's not that bad to add more arguments.

I don't feel comfortable with the RegExp solution, it can get a bit messy and harder to understand than an explicit parameter like that.

nimbit-software · 2023-07-12T12:17:37Z

I had one last idea. I already implemented it in my system and it works good. That is gapfill

I added a parameter to the function

bucket_function TEXT DEFAULT 'time_bucket',

and then just updated the query

          query_construct := format($qry$
                        SELECT %I(%L, %I) AS time, %s, %s
                        FROM %I.%I AS %I
                        %s
                        GROUP BY 1, %s
                        ORDER BY 1
                    $qry$, bucket_function, data_interval, time_column, groupby_clause, aggregator_column,
                    selected_parameters->>'table_schema', selected_parameters->>'table_name',
                    table_alias,
                    filter_query,
                    groupby_clause);

Since the where is passed it it makes it easy to use the interpolation or locf functions as well.

I tested it and it works like a charm

feat(auto_aggregation): ✨ add a first working draft of automatic aggr…

76ce709

…egate selection

docs: 📝 add docs for the ts_autoagg function

d41aec6

jonatas reviewed Jun 12, 2023

View reviewed changes

utils/aggregate_selector.sql Outdated Show resolved Hide resolved

d-sooter reviewed Jun 16, 2023

View reviewed changes

Xasin added 3 commits June 17, 2023 11:49

refactor: change name to auto_downsample

27af11c

Signed-off-by: Xasin <davidbailey.2889@gmail.com>

docs: update docs naming to auto_downsample

62f1a7b

refactor: get filename in line with function name

6269ca9

Xasin marked this pull request as ready for review June 17, 2023 09:54

feat: add internal query debug output

ba0e56c

jonatas reviewed Jun 19, 2023

View reviewed changes

utils/auto_downsample.sql Outdated Show resolved Hide resolved

jonatas reviewed Jun 19, 2023

View reviewed changes

Xasin and others added 2 commits June 19, 2023 21:16

fix: properly format RAISE NOTICE statement

37c38b8

Co-authored-by: Jônatas Davi Paganini <jonatasdp@gmail.com> Signed-off-by: Xasin <davidbailey.2889@gmail.com>

fix: 🐛 correctly detect missing hypertable versus missing aggregate f…

9010c85

…ield

docs: 📝 add working minimal example to auto_downsample docs

78e77a7

jonatas approved these changes Jun 20, 2023

View reviewed changes

jonatas requested a review from noctarius June 21, 2023 22:14

Xasin added 2 commits June 28, 2023 07:46

feat: add configurable table alias name to auto_downsample FROM

2ffc34a

docs: update documentation for the new table_alias parameter

fd415b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: ✨ add a first working draft of auto_aggregate selection #33

Draft: ✨ add a first working draft of auto_aggregate selection #33

Xasin commented Jun 3, 2023 •

edited

jonatas commented Jun 7, 2023

Xasin commented Jun 7, 2023

Xasin commented Jun 7, 2023

d-sooter Jun 16, 2023

Xasin Jun 17, 2023

Xasin Jun 17, 2023

jonatas Jun 19, 2023

jonatas left a comment

Xasin commented Jun 19, 2023

jonatas commented Jun 19, 2023 •

edited

Xasin commented Jun 19, 2023

jonatas commented Jun 20, 2023

Xasin commented Jun 20, 2023

jonatas left a comment •

edited

Xasin commented Jun 20, 2023

jonatas commented Jun 21, 2023

nimbit-software commented Jun 25, 2023

Xasin commented Jun 26, 2023 •

edited

d-sooter commented Jun 26, 2023 •

edited

Xasin commented Jun 26, 2023 •

edited

nimbit-software commented Jul 12, 2023

		RAISE NOTICE 'Generated query output:'
		RAISE NOTICE query_construct

Draft: ✨ add a first working draft of auto_aggregate selection #33

Are you sure you want to change the base?

Draft: ✨ add a first working draft of auto_aggregate selection #33

Conversation

Xasin commented Jun 3, 2023 • edited

jonatas commented Jun 7, 2023

Xasin commented Jun 7, 2023

Xasin commented Jun 7, 2023

d-sooter Jun 16, 2023

Choose a reason for hiding this comment

Xasin Jun 17, 2023

Choose a reason for hiding this comment

Xasin Jun 17, 2023

Choose a reason for hiding this comment

jonatas Jun 19, 2023

Choose a reason for hiding this comment

jonatas left a comment

Choose a reason for hiding this comment

Xasin commented Jun 19, 2023

jonatas commented Jun 19, 2023 • edited

Xasin commented Jun 19, 2023

jonatas commented Jun 20, 2023

Xasin commented Jun 20, 2023

jonatas left a comment • edited

Choose a reason for hiding this comment

Xasin commented Jun 20, 2023

jonatas commented Jun 21, 2023

nimbit-software commented Jun 25, 2023

Xasin commented Jun 26, 2023 • edited

d-sooter commented Jun 26, 2023 • edited

Xasin commented Jun 26, 2023 • edited

nimbit-software commented Jul 12, 2023

Xasin commented Jun 3, 2023 •

edited

jonatas commented Jun 19, 2023 •

edited

jonatas left a comment •

edited

Xasin commented Jun 26, 2023 •

edited

d-sooter commented Jun 26, 2023 •

edited

Xasin commented Jun 26, 2023 •

edited