New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add job execution stats to telemetry #4824
Conversation
24e8c0f
to
f4505a5
Compare
TODO
TOFIX
|
7358aae
to
fc64761
Compare
a10e16b
to
ac0f1c1
Compare
src/telemetry/telemetry.c
Outdated
"SELECT (" | ||
" case WHEN q.proc_schema = \'_timescaledb_internal\' AND q.proc_name = \'policy_retention\' then q.proc_name::text " | ||
" WHEN q.proc_schema = \'_timescaledb_internal\' AND q.proc_name = \'policy_compression\' then q.proc_name::text " | ||
" WHEN q.proc_schema = \'_timescaledb_internal\' AND q.proc_name = \'policy_reorder\' then q.proc_name::text " | ||
" WHEN q.proc_schema = \'_timescaledb_internal\' AND q.proc_name = \'policy_refresh_continuous_aggregate\' then q.proc_name::text " | ||
" ELSE \'user_defined_action\'::text " | ||
" end" | ||
") as job_type, sum(total_runs)::bigint as total_runs, sum(total_successes)::bigint as " | ||
"total_successes," | ||
"sum(total_failures)::bigint as total_failures, sum(total_crashes)::bigint as " | ||
"total_crashes," | ||
"sum(total_duration) as total_duration, sum(total_duration_failures) as total_duration_failures, max(consecutive_failures)::int as " | ||
"max_consecutive_failures, max(consecutive_crashes)::int as max_consecutive_crashes " | ||
"FROM " | ||
"(select j.proc_schema, j.proc_name, s.total_runs, s.total_successes, s.total_failures," | ||
"s.total_crashes, s.total_duration, s.total_duration_failures, s.consecutive_crashes, s.consecutive_failures " | ||
"FROM " | ||
"_timescaledb_internal.bgw_job_stat s join _timescaledb_config.bgw_job j on j.id = " | ||
"s.job_id) q " | ||
"GROUP BY job_type"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SELECT (" | |
" case WHEN q.proc_schema = \'_timescaledb_internal\' AND q.proc_name = \'policy_retention\' then q.proc_name::text " | |
" WHEN q.proc_schema = \'_timescaledb_internal\' AND q.proc_name = \'policy_compression\' then q.proc_name::text " | |
" WHEN q.proc_schema = \'_timescaledb_internal\' AND q.proc_name = \'policy_reorder\' then q.proc_name::text " | |
" WHEN q.proc_schema = \'_timescaledb_internal\' AND q.proc_name = \'policy_refresh_continuous_aggregate\' then q.proc_name::text " | |
" ELSE \'user_defined_action\'::text " | |
" end" | |
") as job_type, sum(total_runs)::bigint as total_runs, sum(total_successes)::bigint as " | |
"total_successes," | |
"sum(total_failures)::bigint as total_failures, sum(total_crashes)::bigint as " | |
"total_crashes," | |
"sum(total_duration) as total_duration, sum(total_duration_failures) as total_duration_failures, max(consecutive_failures)::int as " | |
"max_consecutive_failures, max(consecutive_crashes)::int as max_consecutive_crashes " | |
"FROM " | |
"(select j.proc_schema, j.proc_name, s.total_runs, s.total_successes, s.total_failures," | |
"s.total_crashes, s.total_duration, s.total_duration_failures, s.consecutive_crashes, s.consecutive_failures " | |
"FROM " | |
"_timescaledb_internal.bgw_job_stat s join _timescaledb_config.bgw_job j on j.id = " | |
"s.job_id) q " | |
"GROUP BY job_type"; | |
"SELECT " | |
" CASE " | |
" WHEN j.proc_schema = \'_timescaledb_internal\' AND j.proc_name ~ \'^policy_(retention|compression|reorder|refresh_continuous_aggregate)$\' THEN j.proc_name::text " | |
" ELSE \'user_defined_action\'::text " | |
" END AS job_type, " | |
" sum(s.total_runs)::bigint AS total_runs, " | |
" sum(s.total_successes)::bigint AS total_successes, " | |
" sum(s.total_failures)::bigint AS total_failures, " | |
" sum(s.total_crashes)::bigint AS total_crashes, " | |
" sum(s.total_duration) AS total_duration, " | |
" sum(s.total_duration_failures) AS total_duration_failures, " | |
" max(s.consecutive_failures)::int AS max_consecutive_failures, " | |
" max(s.consecutive_crashes)::int AS max_consecutive_crashes " | |
"FROM " | |
" _timescaledb_internal.bgw_job_stat s " | |
" JOIN _timescaledb_config.bgw_job j ON j.id = s.job_id " | |
"GROUP BY " | |
" job_type " |
I didn't test but just for better code readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beautiful, thank you! Changed it as you suggest
350a088
to
49003cd
Compare
Codecov Report
@@ Coverage Diff @@
## main #4824 +/- ##
===========================================
+ Coverage 63.05% 89.52% +26.46%
===========================================
Files 226 226
Lines 44692 50818 +6126
===========================================
+ Hits 28182 45496 +17314
+ Misses 16510 5322 -11188
Continue to review full report at Codecov.
|
27d0b42
to
8b25087
Compare
8b25087
to
fa41abe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should fully schema-qualify everything in your SPI queries.
You can use pgspot like so to check for unqualified references:
pgspot <<<"SELECT (CASE WHEN j.proc_schema = '_timescaledb_internal' AND j.proc_name ~ '^policy_(retention|compression|reorder|refresh_continuous_aggregate)$' THEN j.proc_name::text ELSE 'user_defined_action'::text end) as job_type, sum(total_runs)::bigint as total_runs, sum(total_successes)::bigint as total_successes, sum(total_failures)::bigint as total_failures, sum(total_crashes)::bigint as total_crashes, sum(total_duration) as total_duration, sum(total_duration_failures) as total_duration_failures, max(consecutive_failures)::int as max_consecutive_failures, max(consecutive_crashes)::int as max_consecutive_crashes FROM _timescaledb_internal.bgw_job_stat s JOIN _timescaledb_config.bgw_job j on j.id = s.job_id GROUP BY job_type"
PS016: Unqualified function call: sum at line 2
PS016: Unqualified function call: sum at line 2
PS001: Unqualified operator: '=' in j.id = s.job_id at line 2
PS017: Unqualified object reference: text in CAST('user_defined_action' AS text) at line 2
PS016: Unqualified function call: sum at line 2
PS016: Unqualified function call: sum at line 2
PS016: Unqualified function call: sum at line 2
PS016: Unqualified function call: sum at line 2
PS016: Unqualified function call: max at line 2
PS016: Unqualified function call: max at line 2
PS017: Unqualified object reference: text in CAST(j.proc_name AS text) at line 2
PS001: Unqualified operator: '=' in j.proc_schema = '_timescaledb_internal' at line 2
PS001: Unqualified operator: '~' in j.proc_name ~ '^policy_(retention|compression|reorder|refresh_continuous_aggregate)$' at line 2
Errors: 2 Warnings: 11 Unknown: 0
d25fb70
to
88f5b80
Compare
Pasting some performance details for the query that groups by sqlerrcode:
The table
The query takes 3.5 sec to execute and here's the query plan:
Results:
|
46f9bd3
to
bffe462
Compare
6b1d9d5
to
9ac87e4
Compare
970eb3b
to
0745da2
Compare
MemoryContext old_context = CurrentMemoryContext, spi_context; | ||
|
||
const char *command_string = | ||
"SELECT " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cant we do SET search_path TO pg_catalog, pg_temp
here and then run the query. Would make the query clearer and also safeguard against future non-qualified references in the query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had tried that suggestion initially but ran into some issues with the tests so did it like this eventually. I see your point though, I will try to do SET search_path
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
-[ RECORD 1 ]-----------+-------------------------------- | ||
job_id | 1000 | ||
last_start | Fri Dec 31 16:00:00.05 1999 PST | ||
last_finish | Fri Dec 31 16:00:00.05 1999 PST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wont this introduce flakyness on slower ci environments when the job cant be finished in those 0.05 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because in those tests we use a virtual timer that advances when mock_wait is called, and the setting for mock_wait in this case is WAIT_ON_JOB
(
timescaledb/test/src/bgw/timer_mock.c
Line 53 in 2475c1b
case WAIT_ON_JOB: |
The handle on which we wait is the handle of the last registered job, in this case 1000, so the job will complete first before advancing the timer.
edcd9d6
to
8da38a9
Compare
8da38a9
to
87e9ad4
Compare
This patch adds two new fields to the telemetry report, `stats_by_job_type` and `errors_by_sqlerrcode`. Both report results grouped by job type (different types of policies or user defined action). The patch also adds a new field to the `bgw_job_stats` table, `total_duration_errors` to separate the duration of the failed runs from the duration of successful ones.
87e9ad4
to
0e6ee72
Compare
No description provided.