sql: don't throw errors for skipped auto stats jobs #149538

mw5h · 2025-07-04T17:55:31Z

Previously, auto stats jobs would throw errors and increase failed jobs
counters if they attempted to start while a stats collection was already
in progress on the table. For large clusters with
'sql.stats.automatic_job_check_before_creating_job.enabled' set to true,
this could create quite a few failed jobs. These failed jobs don't seem
to cause any performance issues, but they clutter logs, potentially
obscuring real problems and alarming customers, who then file tickets
with support to figure out why their jobs are failing.

This patch:

refactors the autostats checks to reduce code duplication.
swallows the error for concurrent auto stats creation, logging at
INFO level instead.
changes the create stats jobs test so that it no longer expects these
jobs creations to fail and instead expects the stats to not be
collected.
fixes a bug in the create stats jobs test that would cause it to hang
instead of exiting on error.
adds a cluster setting,
sql.stats.error_on_concurrent_create_stats.enabled, which controls
this new behavior. By default the old behavior is maintained.

Fixes: #148413
Release note (ops change): CockroachDB now has a cluster setting,
sql.stats.error_on_concurrent_create_stats.enabled, which modifies how
it reacts to concurrent auto stats jobs. The default, true, maintains
the previous behavior. Setting this to false will cause the concurrent
auto stats job to be skipped with just a log entry and no increased
error counters.

cockroach-teamcity · 2025-07-04T17:55:46Z

This change is

yuzefovich

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball)

pkg/sql/create_stats.go line 195 at r1 (raw file):

	}); err != nil {
		if job != nil {
			if errors.Is(err, stats.ConcurrentCreateStatsError) {

I think this check needs to be before if job != nil condition since job is only populated in CreateStartableJobWithTxn, which we won't reach when we hit the concurrent stats error.

pkg/sql/stats/create_stats_job_test.go line 488 at r1 (raw file):

	// Allow both auto partial stat jobs to complete.
	close(allowRequest)

Do we not need to close this channel here to let the auto stats jobs through?

If this is driven by desire to ensure that allowRequest is channel is closed in all scenarios (including erroneous where the test fails but it'd previously be blocked), in 4f252fa I had a similar problem and went around it by "conditional defer".

pkg/sql/stats/create_stats_job_test.go line 304 at r1 (raw file):

	// Allow the running full stat job and the new full and partial stat jobs to complete.
	defer close(allowRequest)
	// Don't block the on the autostats jobs.

nit: "the on".

pkg/sql/stats/create_stats_job_test.go line 476 at r1 (raw file):

	// Attempt to start a simultaneous auto partial stat run on the same table.
	// It should fail.
	_, err := conn.Exec(`CREATE STATISTICS __auto_partial__ FROM d.t1 USING EXTREMES`)

nit: maybe worth extracting the logic to start a job and verify that its result is present in table_statistics into a helper, that is shared between different tests? It could also be made to support both full and partial.

mw5h · 2025-07-08T20:15:17Z

RFAL. I also added a cluster setting so that we can backport to 24.3.

DrewKimball

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mw5h)

yuzefovich

Thanks!

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @mw5h)

-- commits line 27 at r2:
Double checking that we plan to backport with error still returned by default through 24.3, and then we'll have a separate patch that will change the cluster setting default to false to swallow the error by default on master and 25.3 only?

pkg/sql/stats/create_stats_job_test.go line 300 at r2 (raw file):

	setTableID(descpb.InvalidID)

	// Attempt to start automatic full and partial stats runs. Both should fail.

nit: adjust the comments that "Both should fail when the cluster setting to return an error is enabled.", or just remove these "should fail" sentences.

pkg/sql/create_stats.go line 221 at r2 (raw file):

				log.Warningf(ctx, "failed to delete job: %v", delErr)
			}
			return nil

Should we check the value of the cluster setting here, and only return nil if the setting is false, otherwise we fall through to return err below?

michae2

Nice!

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @mw5h)

mw5h · 2025-07-08T22:51:14Z

-- commits line 27 at r2:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Double checking that we plan to backport with error still returned by default through 24.3, and then we'll have a separate patch that will change the cluster setting default to false to swallow the error by default on master and 25.3 only?

That's correct. It seemed like a cleaner state transition than putting the patch in with it set to false and then modifying the backport.

Previously, auto stats jobs would throw errors and increase failed jobs counters if they attempted to start while a stats collection was already in progress on the table. For large clusters with 'sql.stats.automatic_job_check_before_creating_job.enabled' set to true, this could create quite a few failed jobs. These failed jobs don't seem to cause any performance issues, but they clutter logs, potentially obscuring real problems and alarming customers, who then file tickets with support to figure out why their jobs are failing. This patch: * refactors the autostats checks to reduce code duplication. * swallows the error for concurrent auto stats creation, logging at INFO level instead. * changes the create stats jobs test so that it no longer expects these jobs creations to fail and instead expects the stats to not be collected. * fixes a bug in the create stats jobs test that would cause it to hang instead of exiting on error. * adds a cluster setting, sql.stats.error_on_concurrent_create_stats.enabled, which controls this new behavior. By default the old behavior is maintained. Fixes: cockroachdb#148413 Release note (ops change): CockroachDB now has a cluster setting, sql.stats.error_on_concurrent_create_stats.enabled, which modifies how it reacts to concurrent auto stats jobs. The default, true, maintains the previous behavior. Setting this to false will cause the concurrent auto stats job to be skipped with just a log entry and no increased error counters.

mw5h

Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, and @yuzefovich)

pkg/sql/create_stats.go line 221 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Should we check the value of the cluster setting here, and only return nil if the setting is false, otherwise we fall through to return err below?

Yup, good catch, missed a spot.

mw5h requested review from a team as code owners July 4, 2025 17:55

mw5h requested review from DrewKimball and removed request for a team July 4, 2025 17:55

mw5h force-pushed the stats-dirty-concurrency branch from 07c640e to 51d132b Compare July 7, 2025 16:21

yuzefovich reviewed Jul 7, 2025

View reviewed changes

mw5h force-pushed the stats-dirty-concurrency branch from 51d132b to 59f2b02 Compare July 8, 2025 20:13

DrewKimball approved these changes Jul 8, 2025

View reviewed changes

yuzefovich approved these changes Jul 8, 2025

View reviewed changes

michae2 approved these changes Jul 8, 2025

View reviewed changes

mw5h force-pushed the stats-dirty-concurrency branch from 59f2b02 to e3ade40 Compare July 8, 2025 23:04

mw5h commented Jul 8, 2025

View reviewed changes

mw5h added the backport-24.3.x Flags PRs that need to be backported to 24.3 label Jul 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sql: don't throw errors for skipped auto stats jobs #149538

sql: don't throw errors for skipped auto stats jobs #149538

mw5h commented Jul 4, 2025 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jul 4, 2025

Uh oh!

yuzefovich left a comment

Uh oh!

mw5h commented Jul 8, 2025

Uh oh!

DrewKimball left a comment

Uh oh!

yuzefovich left a comment

Uh oh!

michae2 left a comment

Uh oh!

mw5h commented Jul 8, 2025

Uh oh!

mw5h left a comment

Uh oh!

Uh oh!

sql: don't throw errors for skipped auto stats jobs #149538

Are you sure you want to change the base?

sql: don't throw errors for skipped auto stats jobs #149538

Conversation

mw5h commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 4, 2025

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

mw5h commented Jul 8, 2025

Uh oh!

DrewKimball left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

mw5h commented Jul 8, 2025

Uh oh!

mw5h left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mw5h commented Jul 4, 2025 •

edited

Loading