-
Notifications
You must be signed in to change notification settings - Fork 3.9k
sql: don't throw errors for skipped auto stats jobs #149538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
07c640e
to
51d132b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball)
pkg/sql/create_stats.go
line 195 at r1 (raw file):
}); err != nil { if job != nil { if errors.Is(err, stats.ConcurrentCreateStatsError) {
I think this check needs to be before if job != nil
condition since job
is only populated in CreateStartableJobWithTxn
, which we won't reach when we hit the concurrent stats error.
pkg/sql/stats/create_stats_job_test.go
line 488 at r1 (raw file):
// Allow both auto partial stat jobs to complete. close(allowRequest)
Do we not need to close this channel here to let the auto stats jobs through?
If this is driven by desire to ensure that allowRequest
is channel is closed in all scenarios (including erroneous where the test fails but it'd previously be blocked), in 4f252fa I had a similar problem and went around it by "conditional defer".
pkg/sql/stats/create_stats_job_test.go
line 304 at r1 (raw file):
// Allow the running full stat job and the new full and partial stat jobs to complete. defer close(allowRequest) // Don't block the on the autostats jobs.
nit: "the on".
pkg/sql/stats/create_stats_job_test.go
line 476 at r1 (raw file):
// Attempt to start a simultaneous auto partial stat run on the same table. // It should fail. _, err := conn.Exec(`CREATE STATISTICS __auto_partial__ FROM d.t1 USING EXTREMES`)
nit: maybe worth extracting the logic to start a job and verify that its result is present in table_statistics
into a helper, that is shared between different tests? It could also be made to support both full and partial.
51d132b
to
59f2b02
Compare
RFAL. I also added a cluster setting so that we can backport to 24.3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @mw5h)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @mw5h)
-- commits
line 27 at r2:
Double checking that we plan to backport with error still returned by default through 24.3, and then we'll have a separate patch that will change the cluster setting default to false
to swallow the error by default on master and 25.3 only?
pkg/sql/stats/create_stats_job_test.go
line 300 at r2 (raw file):
setTableID(descpb.InvalidID) // Attempt to start automatic full and partial stats runs. Both should fail.
nit: adjust the comments that "Both should fail when the cluster setting to return an error is enabled.", or just remove these "should fail" sentences.
pkg/sql/create_stats.go
line 221 at r2 (raw file):
log.Warningf(ctx, "failed to delete job: %v", delErr) } return nil
Should we check the value of the cluster setting here, and only return nil
if the setting is false
, otherwise we fall through to return err
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 3 of 0 LGTMs obtained (waiting on @mw5h)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
That's correct. It seemed like a cleaner state transition than putting the patch in with it set to false and then modifying the backport. |
Previously, auto stats jobs would throw errors and increase failed jobs counters if they attempted to start while a stats collection was already in progress on the table. For large clusters with 'sql.stats.automatic_job_check_before_creating_job.enabled' set to true, this could create quite a few failed jobs. These failed jobs don't seem to cause any performance issues, but they clutter logs, potentially obscuring real problems and alarming customers, who then file tickets with support to figure out why their jobs are failing. This patch: * refactors the autostats checks to reduce code duplication. * swallows the error for concurrent auto stats creation, logging at INFO level instead. * changes the create stats jobs test so that it no longer expects these jobs creations to fail and instead expects the stats to not be collected. * fixes a bug in the create stats jobs test that would cause it to hang instead of exiting on error. * adds a cluster setting, sql.stats.error_on_concurrent_create_stats.enabled, which controls this new behavior. By default the old behavior is maintained. Fixes: cockroachdb#148413 Release note (ops change): CockroachDB now has a cluster setting, sql.stats.error_on_concurrent_create_stats.enabled, which modifies how it reacts to concurrent auto stats jobs. The default, true, maintains the previous behavior. Setting this to false will cause the concurrent auto stats job to be skipped with just a log entry and no increased error counters.
59f2b02
to
e3ade40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, and @yuzefovich)
pkg/sql/create_stats.go
line 221 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should we check the value of the cluster setting here, and only
return nil
if the setting isfalse
, otherwise we fall through toreturn err
below?
Yup, good catch, missed a spot.
Previously, auto stats jobs would throw errors and increase failed jobs
counters if they attempted to start while a stats collection was already
in progress on the table. For large clusters with
'sql.stats.automatic_job_check_before_creating_job.enabled' set to true,
this could create quite a few failed jobs. These failed jobs don't seem
to cause any performance issues, but they clutter logs, potentially
obscuring real problems and alarming customers, who then file tickets
with support to figure out why their jobs are failing.
This patch:
INFO level instead.
jobs creations to fail and instead expects the stats to not be
collected.
instead of exiting on error.
sql.stats.error_on_concurrent_create_stats.enabled, which controls
this new behavior. By default the old behavior is maintained.
Fixes: #148413
Release note (ops change): CockroachDB now has a cluster setting,
sql.stats.error_on_concurrent_create_stats.enabled, which modifies how
it reacts to concurrent auto stats jobs. The default, true, maintains
the previous behavior. Setting this to false will cause the concurrent
auto stats job to be skipped with just a log entry and no increased
error counters.