Skip to content

sql: don't throw errors for skipped auto stats jobs #149538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mw5h
Copy link
Contributor

@mw5h mw5h commented Jul 4, 2025

Previously, auto stats jobs would throw errors and increase failed jobs
counters if they attempted to start while a stats collection was already
in progress on the table. For large clusters with
'sql.stats.automatic_job_check_before_creating_job.enabled' set to true,
this could create quite a few failed jobs. These failed jobs don't seem
to cause any performance issues, but they clutter logs, potentially
obscuring real problems and alarming customers, who then file tickets
with support to figure out why their jobs are failing.

This patch:

  • refactors the autostats checks to reduce code duplication.
  • swallows the error for concurrent auto stats creation, logging at
    INFO level instead.
  • changes the create stats jobs test so that it no longer expects these
    jobs creations to fail and instead expects the stats to not be
    collected.
  • fixes a bug in the create stats jobs test that would cause it to hang
    instead of exiting on error.
  • adds a cluster setting,
    sql.stats.error_on_concurrent_create_stats.enabled, which controls
    this new behavior. By default the old behavior is maintained.

Fixes: #148413
Release note (ops change): CockroachDB now has a cluster setting,
sql.stats.error_on_concurrent_create_stats.enabled, which modifies how
it reacts to concurrent auto stats jobs. The default, true, maintains
the previous behavior. Setting this to false will cause the concurrent
auto stats job to be skipped with just a log entry and no increased
error counters.

@mw5h mw5h requested review from a team as code owners July 4, 2025 17:55
@mw5h mw5h requested review from DrewKimball and removed request for a team July 4, 2025 17:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mw5h mw5h force-pushed the stats-dirty-concurrency branch from 07c640e to 51d132b Compare July 7, 2025 16:21
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball)


pkg/sql/create_stats.go line 195 at r1 (raw file):

	}); err != nil {
		if job != nil {
			if errors.Is(err, stats.ConcurrentCreateStatsError) {

I think this check needs to be before if job != nil condition since job is only populated in CreateStartableJobWithTxn, which we won't reach when we hit the concurrent stats error.


pkg/sql/stats/create_stats_job_test.go line 488 at r1 (raw file):

	// Allow both auto partial stat jobs to complete.
	close(allowRequest)

Do we not need to close this channel here to let the auto stats jobs through?

If this is driven by desire to ensure that allowRequest is channel is closed in all scenarios (including erroneous where the test fails but it'd previously be blocked), in 4f252fa I had a similar problem and went around it by "conditional defer".


pkg/sql/stats/create_stats_job_test.go line 304 at r1 (raw file):

	// Allow the running full stat job and the new full and partial stat jobs to complete.
	defer close(allowRequest)
	// Don't block the on the autostats jobs.

nit: "the on".


pkg/sql/stats/create_stats_job_test.go line 476 at r1 (raw file):

	// Attempt to start a simultaneous auto partial stat run on the same table.
	// It should fail.
	_, err := conn.Exec(`CREATE STATISTICS __auto_partial__ FROM d.t1 USING EXTREMES`)

nit: maybe worth extracting the logic to start a job and verify that its result is present in table_statistics into a helper, that is shared between different tests? It could also be made to support both full and partial.

@mw5h mw5h force-pushed the stats-dirty-concurrency branch from 51d132b to 59f2b02 Compare July 8, 2025 20:13
@mw5h
Copy link
Contributor Author

mw5h commented Jul 8, 2025

RFAL. I also added a cluster setting so that we can backport to 24.3.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mw5h)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks!

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mw5h)


-- commits line 27 at r2:
Double checking that we plan to backport with error still returned by default through 24.3, and then we'll have a separate patch that will change the cluster setting default to false to swallow the error by default on master and 25.3 only?


pkg/sql/stats/create_stats_job_test.go line 300 at r2 (raw file):

	setTableID(descpb.InvalidID)

	// Attempt to start automatic full and partial stats runs. Both should fail.

nit: adjust the comments that "Both should fail when the cluster setting to return an error is enabled.", or just remove these "should fail" sentences.


pkg/sql/create_stats.go line 221 at r2 (raw file):

				log.Warningf(ctx, "failed to delete job: %v", delErr)
			}
			return nil

Should we check the value of the cluster setting here, and only return nil if the setting is false, otherwise we fall through to return err below?

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice!

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @mw5h)

@mw5h
Copy link
Contributor Author

mw5h commented Jul 8, 2025

-- commits line 27 at r2:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Double checking that we plan to backport with error still returned by default through 24.3, and then we'll have a separate patch that will change the cluster setting default to false to swallow the error by default on master and 25.3 only?

That's correct. It seemed like a cleaner state transition than putting the patch in with it set to false and then modifying the backport.

Previously, auto stats jobs would throw errors and increase failed jobs
counters if they attempted to start while a stats collection was already
in progress on the table. For large clusters with
'sql.stats.automatic_job_check_before_creating_job.enabled' set to true,
this could create quite a few failed jobs. These failed jobs don't seem
to cause any performance issues, but they clutter logs, potentially
obscuring real problems and alarming customers, who then file tickets
with support to figure out why their jobs are failing.

This patch:
*  refactors the autostats checks to reduce code duplication.
*  swallows the error for concurrent auto stats creation, logging at
   INFO level instead.
*  changes the create stats jobs test so that it no longer expects these
   jobs creations to fail and instead expects the stats to not be
   collected.
*  fixes a bug in the create stats jobs test that would cause it to hang
   instead of exiting on error.
*  adds a cluster setting,
   sql.stats.error_on_concurrent_create_stats.enabled, which controls
   this new behavior. By default the old behavior is maintained.

Fixes: cockroachdb#148413
Release note (ops change): CockroachDB now has a cluster setting,
sql.stats.error_on_concurrent_create_stats.enabled, which modifies how
it reacts to concurrent auto stats jobs. The default, true, maintains
the previous behavior. Setting this to false will cause the concurrent
auto stats job to be skipped with just a log entry and no increased
error counters.
@mw5h mw5h force-pushed the stats-dirty-concurrency branch from 59f2b02 to e3ade40 Compare July 8, 2025 23:04
Copy link
Contributor Author

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, and @yuzefovich)


pkg/sql/create_stats.go line 221 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Should we check the value of the cluster setting here, and only return nil if the setting is false, otherwise we fall through to return err below?

Yup, good catch, missed a spot.

@mw5h mw5h added the backport-24.3.x Flags PRs that need to be backported to 24.3 label Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.3.x Flags PRs that need to be backported to 24.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make "another CREATE STATICS job is already running" errors less urgent
5 participants