New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add telemetry about replication #4412
Add telemetry about replication #4412
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4412 +/- ##
==========================================
+ Coverage 90.86% 91.17% +0.30%
==========================================
Files 220 220
Lines 40543 39761 -782
==========================================
- Hits 36839 36251 -588
+ Misses 3704 3510 -194
Continue to review full report at Codecov.
|
src/telemetry/telemetry.c
Outdated
res = SPI_execute("SELECT count(pid) from pg_catalog.pg_stat_get_wal_senders() WHERE pid is " | ||
"not null", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COUNT(<column>)
usually counts only non-null columns so don't think that you need the not null
.
res = SPI_execute("SELECT count(pid) from pg_catalog.pg_stat_get_wal_senders() WHERE pid is " | |
"not null", | |
res = SPI_execute("SELECT count(pid) from pg_catalog.pg_stat_get_wal_senders()", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this query actually giving us the thing we are looking for? Can't there be many different kinds of WAL senders, e.g., for backups, archiving, logical replication, etc? Do we want all of those or do we want to try and categorizing them or at least determine which ones are actual standby servers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkindahl hmm, that was orginally a holdover from a bool_and()
but looking at it now I think I like that it's explicit "we're counting all the rows where pid is not null", and not that we're counting pids and are missing null ones by oversight. If I was to do a change for this I might prefer to do
SELECT count(*) FROM pg_catalog.pg_stat_get_wal_senders() WHERE pid is not null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erimatnor unfortunately there doesn't seem to really be a great way to tell them apart at this layer¹, though fortunately we should generally have the replicas in telemetry also to correlate with. I can rename the JSON field to num_wal_senders
if you perfer to make what this is sending clearer.
¹For instance, pg_stat_replication
is also just returning all the rows from this function.
src/telemetry/telemetry.c
Outdated
res = SPI_execute("SELECT count(pid) > 0 from pg_catalog.pg_stat_get_wal_receiver() WHERE pid " | ||
"is not null", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
res = SPI_execute("SELECT count(pid) > 0 from pg_catalog.pg_stat_get_wal_receiver() WHERE pid " | |
"is not null", | |
res = SPI_execute("SELECT count(pid) > 0 from pg_catalog.pg_stat_get_wal_receiver()", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to understand better whether this is actually giving us the stats we want. I believe there can be many different kinds of wal senders. See inline comments.
There are also no new tests to check that these queries actually work and gives us what we want. I suggest adding a TAP test that checks telemetry after setting up a replica. There are already similar TAP tests for streaming replication so it should be straightforward.
src/telemetry/telemetry.c
Outdated
res = SPI_execute("SELECT count(pid) > 0 from pg_catalog.pg_stat_get_wal_receiver() WHERE pid " | ||
"is not null", | ||
true /* read_only */ | ||
, | ||
0 /*count*/ | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency:
res = SPI_execute("SELECT count(pid) > 0 from pg_catalog.pg_stat_get_wal_receiver() WHERE pid " | |
"is not null", | |
true /* read_only */ | |
, | |
0 /*count*/ | |
); | |
res = SPI_execute("SELECT count(pid) > 0 FROM pg_catalog.pg_stat_get_wal_receiver() WHERE pid " | |
"IS NOT NULLl", | |
true /* read_only */ | |
, | |
0 /*count*/ | |
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note also that res
return value is not checked here.
src/telemetry/telemetry.c
Outdated
@@ -455,6 +499,7 @@ add_function_call_telemetry(JsonbParseState *state) | |||
#define REQ_RELS_DISTRIBUTED_HYPERTABLES_DN "distributed_hypertables_data_node" | |||
#define REQ_RELS_CONTINUOUS_AGGS "continuous_aggregates" | |||
#define REQ_FUNCTIONS_USED "functions_used" | |||
#define REQ_IN_RECOVERY "in_recovery" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to be used.
@@ -445,6 +449,46 @@ add_function_call_telemetry(JsonbParseState *state) | |||
pushJsonbValue(&state, WJB_END_OBJECT, NULL); | |||
} | |||
|
|||
static void | |||
add_replication_telemetry(JsonbParseState *state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest separating the functionality of gathering the stats and updating the JSON instead of putting this all in one function. For instance, defining a struct with the fields we'd like to gather and then putting that in the json or at least reading the data into data fields first and always adding the info at the end.
Now there are several places where this function can exit early, without adding any json fields or just adding some of them. Is this what we want?
src/telemetry/telemetry.c
Outdated
res = SPI_execute("SELECT count(pid) from pg_catalog.pg_stat_get_wal_senders() WHERE pid is " | ||
"not null", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this query actually giving us the thing we are looking for? Can't there be many different kinds of WAL senders, e.g., for backups, archiving, logical replication, etc? Do we want all of those or do we want to try and categorizing them or at least determine which ones are actual standby servers?
src/telemetry/telemetry.c
Outdated
#define REQ_NUM_REPLICAS "num_replicas" | ||
#define REQ_IS_REPLICA "is_replica" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of issues:
- I think the term "replica" is actually very ambiguous. Is it a standby, async/sync standby, logical replica, etc? I suggest coming up with something more specific, like
num_standbys
if that's what we are looking for. - Since previous refactorings of the telemetry tried to move away from the flat JSON, I think it could make sense to group this information in a sub-object. Otherwise, these fields might not end up next to each other and the JSON will eventually become unwieldy if we keep adding flat fields. The information will be harder to read and parse as related fields will not be grouped together.
97ab772
to
d502a0f
Compare
src/telemetry/replication.c
Outdated
SPI_finish(); | ||
|
||
return info; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new line
"SELECT get_telemetry_report()->'replication'->>'num_wal_senders'"); | ||
is($result, qq(1), 'number of wal senders on primary'); | ||
|
||
$result = $node_primary->safe_psql('postgres', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice testing
This commit adds telemetry about replication status to our telemetry gatherer. It adds a new sub object `replication` containing two fields: - `is_wal_receiver` is a boolean which is true if-and-only-if the current cluster has a `wal_receiver`. - `num_wal_senders` is the number of `wal_senders` that the current cluster has.
d502a0f
to
a32a694
Compare
This release adds major new features since the 2.7.2 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * time_bucket now supports bucketing by month, year and timezone * Improve performance of bulk SELECT and COPY for distributed hypertables * 1 step CAgg policy management * Migrate Continuous Aggregates to the new format **Features** * timescale#4188 Use COPY protocol in row-by-row fetcher * timescale#4307 Mark partialize_agg as parallel safe * timescale#4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE * timescale#4384 Add schedule_interval to policies * timescale#4390 Faster lookup of chunks by point * timescale#4393 Support intervals with day component when constifying now() * timescale#4397 Support intervals with month component when constifying now() * timescale#4405 Support ON CONFLICT ON CONSTRAINT for hypertables * timescale#4412 Add telemetry about replication * timescale#4415 Drop remote data when detaching data node * timescale#4416 Handle TRUNCATE TABLE on chunks * timescale#4425 Add parameter check_config to alter_job * timescale#4430 Create index on Continuous Aggregates * timescale#4439 Allow ORDER BY on continuous aggregates * timescale#4443 Add stateful partition mappings * timescale#4484 Use non-blocking data node connections for COPY * timescale#4495 Support add_dimension() with existing data * timescale#4502 Add chunks to baserel cache on chunk exclusion * timescale#4545 Add hypertable distributed argument and defaults * timescale#4552 Migrate Continuous Aggregates to the new format * timescale#4556 Add runtime exclusion for hypertables * timescale#4561 Change get_git_commit to return full commit hash * timescale#4563 1 step CAgg policy management * timescale#4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill * timescale#4642 Add timezone support to time_bucket **Bugfixes** * timescale#4359 Create composite index on segmentby columns * timescale#4374 Remove constified now() constraints from plan * timescale#4416 Handle TRUNCATE TABLE on chunks * timescale#4478 Synchronize chunk cache sizes * timescale#4486 Adding boolean column with default value doesn't work on compressed table * timescale#4512 Fix unaligned pointer access * timescale#4519 Throw better error message on incompatible row fetcher settings * timescale#4549 Fix dump_meta_data for windows * timescale#4553 Fix timescaledb_post_restore GUC handling * timescale#4573 Load TSL library on compressed_data_out call * timescale#4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function * timescale#4577 Fix segfaults in compression code with corrupt data * timescale#4580 Handle default privileges on CAggs properly * timescale#4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA * timescale#4583 Fix partitioning functions * timescale#4589 Fix rename for distributed hypertable * timescale#4601 Reset compression sequence when group resets * timescale#4611 Fix a potential OOM when loading large data sets into a hypertable * timescale#4624 Fix heap buffer overflow * timescale#4627 Fix telemetry initialization * timescale#4631 Ensure TSL library is loaded on database upgrades * timescale#4646 Fix time_bucket_ng origin handling * timescale#4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause. **Thanks** * @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function * @Creatation for reporting an issue with renaming hypertables * @janko for reporting an issue when adding bool column with default value to compressed hypertable * @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk * @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates * @mwahlhuetter for reporting error in joins in RETURNING clause * @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable * @PBudmark for reporting an issue with dump_meta_data.sql on Windows * @ssmoss for reporting an issue with time_bucket_ng origin handling
This release adds major new features since the 2.7.2 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * time_bucket now supports bucketing by month, year and timezone * Improve performance of bulk SELECT and COPY for distributed hypertables * 1 step CAgg policy management * Migrate Continuous Aggregates to the new format **Features** * #4188 Use COPY protocol in row-by-row fetcher * #4307 Mark partialize_agg as parallel safe * #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE * #4384 Add schedule_interval to policies * #4390 Faster lookup of chunks by point * #4393 Support intervals with day component when constifying now() * #4397 Support intervals with month component when constifying now() * #4405 Support ON CONFLICT ON CONSTRAINT for hypertables * #4412 Add telemetry about replication * #4415 Drop remote data when detaching data node * #4416 Handle TRUNCATE TABLE on chunks * #4425 Add parameter check_config to alter_job * #4430 Create index on Continuous Aggregates * #4439 Allow ORDER BY on continuous aggregates * #4443 Add stateful partition mappings * #4484 Use non-blocking data node connections for COPY * #4495 Support add_dimension() with existing data * #4502 Add chunks to baserel cache on chunk exclusion * #4545 Add hypertable distributed argument and defaults * #4552 Migrate Continuous Aggregates to the new format * #4556 Add runtime exclusion for hypertables * #4561 Change get_git_commit to return full commit hash * #4563 1 step CAgg policy management * #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill * #4642 Add timezone support to time_bucket **Bugfixes** * #4359 Create composite index on segmentby columns * #4374 Remove constified now() constraints from plan * #4416 Handle TRUNCATE TABLE on chunks * #4478 Synchronize chunk cache sizes * #4486 Adding boolean column with default value doesn't work on compressed table * #4512 Fix unaligned pointer access * #4519 Throw better error message on incompatible row fetcher settings * #4549 Fix dump_meta_data for windows * #4553 Fix timescaledb_post_restore GUC handling * #4573 Load TSL library on compressed_data_out call * #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function * #4577 Fix segfaults in compression code with corrupt data * #4580 Handle default privileges on CAggs properly * #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA * #4583 Fix partitioning functions * #4589 Fix rename for distributed hypertable * #4601 Reset compression sequence when group resets * #4611 Fix a potential OOM when loading large data sets into a hypertable * #4624 Fix heap buffer overflow * #4627 Fix telemetry initialization * #4631 Ensure TSL library is loaded on database upgrades * #4646 Fix time_bucket_ng origin handling * #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause. **Thanks** * @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function * @Creatation for reporting an issue with renaming hypertables * @janko for reporting an issue when adding bool column with default value to compressed hypertable * @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk * @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates * @mwahlhuetter for reporting error in joins in RETURNING clause * @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable * @PBudmark for reporting an issue with dump_meta_data.sql on Windows * @ssmoss for reporting an issue with time_bucket_ng origin handling
This release adds major new features since the 2.7.2 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * time_bucket now supports bucketing by month, year and timezone * Improve performance of bulk SELECT and COPY for distributed hypertables * 1 step CAgg policy management * Migrate Continuous Aggregates to the new format **Features** * #4188 Use COPY protocol in row-by-row fetcher * #4307 Mark partialize_agg as parallel safe * #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE * #4384 Add schedule_interval to policies * #4390 Faster lookup of chunks by point * #4393 Support intervals with day component when constifying now() * #4397 Support intervals with month component when constifying now() * #4405 Support ON CONFLICT ON CONSTRAINT for hypertables * #4412 Add telemetry about replication * #4415 Drop remote data when detaching data node * #4416 Handle TRUNCATE TABLE on chunks * #4425 Add parameter check_config to alter_job * #4430 Create index on Continuous Aggregates * #4439 Allow ORDER BY on continuous aggregates * #4443 Add stateful partition mappings * #4484 Use non-blocking data node connections for COPY * #4495 Support add_dimension() with existing data * #4502 Add chunks to baserel cache on chunk exclusion * #4545 Add hypertable distributed argument and defaults * #4552 Migrate Continuous Aggregates to the new format * #4556 Add runtime exclusion for hypertables * #4561 Change get_git_commit to return full commit hash * #4563 1 step CAgg policy management * #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill * #4642 Add timezone support to time_bucket **Bugfixes** * #4359 Create composite index on segmentby columns * #4374 Remove constified now() constraints from plan * #4416 Handle TRUNCATE TABLE on chunks * #4478 Synchronize chunk cache sizes * #4486 Adding boolean column with default value doesn't work on compressed table * #4512 Fix unaligned pointer access * #4519 Throw better error message on incompatible row fetcher settings * #4549 Fix dump_meta_data for windows * #4553 Fix timescaledb_post_restore GUC handling * #4573 Load TSL library on compressed_data_out call * #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function * #4577 Fix segfaults in compression code with corrupt data * #4580 Handle default privileges on CAggs properly * #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA * #4583 Fix partitioning functions * #4589 Fix rename for distributed hypertable * #4601 Reset compression sequence when group resets * #4611 Fix a potential OOM when loading large data sets into a hypertable * #4624 Fix heap buffer overflow * #4627 Fix telemetry initialization * #4631 Ensure TSL library is loaded on database upgrades * #4646 Fix time_bucket_ng origin handling * #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause. **Thanks** * @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function * @Creatation for reporting an issue with renaming hypertables * @janko for reporting an issue when adding bool column with default value to compressed hypertable * @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk * @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates * @mwahlhuetter for reporting error in joins in RETURNING clause * @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable * @PBudmark for reporting an issue with dump_meta_data.sql on Windows * @ssmoss for reporting an issue with time_bucket_ng origin handling
This release adds major new features since the 2.7.2 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * time_bucket now supports bucketing by month, year and timezone * Improve performance of bulk SELECT and COPY for distributed hypertables * 1 step CAgg policy management * Migrate Continuous Aggregates to the new format **Features** * #4188 Use COPY protocol in row-by-row fetcher * #4307 Mark partialize_agg as parallel safe * #4380 Enable chunk exclusion for space dimensions in UPDATE/DELETE * #4384 Add schedule_interval to policies * #4390 Faster lookup of chunks by point * #4393 Support intervals with day component when constifying now() * #4397 Support intervals with month component when constifying now() * #4405 Support ON CONFLICT ON CONSTRAINT for hypertables * #4412 Add telemetry about replication * #4415 Drop remote data when detaching data node * #4416 Handle TRUNCATE TABLE on chunks * #4425 Add parameter check_config to alter_job * #4430 Create index on Continuous Aggregates * #4439 Allow ORDER BY on continuous aggregates * #4443 Add stateful partition mappings * #4484 Use non-blocking data node connections for COPY * #4495 Support add_dimension() with existing data * #4502 Add chunks to baserel cache on chunk exclusion * #4545 Add hypertable distributed argument and defaults * #4552 Migrate Continuous Aggregates to the new format * #4556 Add runtime exclusion for hypertables * #4561 Change get_git_commit to return full commit hash * #4563 1 step CAgg policy management * #4641 Allow bucketing by month, year, century in time_bucket and time_bucket_gapfill * #4642 Add timezone support to time_bucket **Bugfixes** * #4359 Create composite index on segmentby columns * #4374 Remove constified now() constraints from plan * #4416 Handle TRUNCATE TABLE on chunks * #4478 Synchronize chunk cache sizes * #4486 Adding boolean column with default value doesn't work on compressed table * #4512 Fix unaligned pointer access * #4519 Throw better error message on incompatible row fetcher settings * #4549 Fix dump_meta_data for windows * #4553 Fix timescaledb_post_restore GUC handling * #4573 Load TSL library on compressed_data_out call * #4575 Fix use of `get_partition_hash` and `get_partition_for_key` inside an IMMUTABLE function * #4577 Fix segfaults in compression code with corrupt data * #4580 Handle default privileges on CAggs properly * #4582 Fix assertion in GRANT .. ON ALL TABLES IN SCHEMA * #4583 Fix partitioning functions * #4589 Fix rename for distributed hypertable * #4601 Reset compression sequence when group resets * #4611 Fix a potential OOM when loading large data sets into a hypertable * #4624 Fix heap buffer overflow * #4627 Fix telemetry initialization * #4631 Ensure TSL library is loaded on database upgrades * #4646 Fix time_bucket_ng origin handling * #4647 Fix the error "SubPlan found with no parent plan" that occurred if using joins in RETURNING clause. **Thanks** * @AlmiS for reporting error on `get_partition_hash` executed inside an IMMUTABLE function * @Creatation for reporting an issue with renaming hypertables * @janko for reporting an issue when adding bool column with default value to compressed hypertable * @jayadevanm for reporting error of TRUNCATE TABLE on compressed chunk * @michaelkitson for reporting permission errors using default privileges on Continuous Aggregates * @mwahlhuetter for reporting error in joins in RETURNING clause * @ninjaltd and @mrksngl for reporting a potential OOM when loading large data sets into a hypertable * @PBudmark for reporting an issue with dump_meta_data.sql on Windows * @ssmoss for reporting an issue with time_bucket_ng origin handling
This commit adds telemetry about replication status to our telemetry gatherer. It adds a new sub object
replication
containing two fields:is_wal_receiver
is a boolean which is true if-and-only-if the current cluster has awal_receiver
.num_wal_senders
is the number ofwal_senders
that the current cluster has.