SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default #15912

mattlord · 2024-05-10T00:18:55Z

Description

In SchemaEngine's GetTableForPos function, we would not find a table schema for a given GTID position when the historian is disabled and we would then return the "current" table schema from the cache. The problem there is that the cache can be quite stale (up to 30 minutes by default, see --queryserver-config-schema-reload-time ). Similarly, if you want to get the current schema for a table that was recently created (within up to the last 30 minutes by default) then GetTableForPos would return a table not found error as it wasn't yet in the cache. So if you're e.g. doing a fair amount of schema changes/migrations then this function's results can be out of date and block subsequent schema changes/migrations.

This PR changes GetTableForPos so that it ensures we get the table schema for the "current" position. If we do have the table in our cache, then we refresh the schema for that table from the database to be sure that it's up to date (and we update the entry in the cache). If we do not have the table in our cache then we refresh the schema cache from the database in full. This then also serves as a way to initialize the cache if it hasn't yet been done (recently started vttablet process).

Refreshing the schema cache is an expensive operation, which is why --queryserver-config-schema-reload-time defaults to 30m. But this change only affects VReplication as the only users of GetTableForPos() are vstreamers. And we only reload the schema cache there if the requested table is not already in the cache and it should be rare that you're trying to replicate a table that doesn't exist (we get the table schema during the planning step). This effectively turns the table schema cache — for VReplication — into a read-through cache.

This aims to be a more complete fix for: #9832

Related Issue(s)

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

… a position This would typically mean that the historian is disabled, so we should return the "current" schema at that time. This means that we need to reload our cache before we return the current table schema. Signed-off-by: Matt Lord <mattalord@gmail.com>

vitess-bot · 2024-05-10T00:18:57Z

Fixup tests Signed-off-by: Matt Lord <mattalord@gmail.com>

codecov · 2024-05-10T14:06:12Z

Codecov Report

Attention: Patch coverage is 87.17949% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 68.43%. Comparing base (0353ad4) to head (8282f55).
Report is 5 commits behind head on main.

Files	Patch %	Lines
go/vt/vttablet/tabletserver/schema/engine.go	81.48%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15912      +/-   ##
==========================================
- Coverage   68.45%   68.43%   -0.02%     
==========================================
  Files        1559     1559              
  Lines      196825   196856      +31     
==========================================
- Hits       134736   134726      -10     
- Misses      62089    62130      +41

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Matt Lord <mattalord@gmail.com>

shlomi-noach

Loving this!

go/vt/vttablet/tabletserver/schema/engine.go

shlomi-noach · 2024-05-12T06:44:25Z

go/vt/vttablet/tabletserver/schema/engine.go

+// database (updating the cache entry). If the table is not found in the cache, it will
+// reload the cache from the database in case the table was created after the last schema
+// reload or the cache has not yet been initialized. This function makes the schema
+// cache a read-through cache for VReplication purposes.


This is fantastic. Asking to check: is there a legitimate scenario where the code would be asked again and again for a table that does not exist yet, and which would cause the cache to reload too much (as in creating excessive load on the server / locking / whatever)?

Let's say that you start a MoveTables workflow and mention a table that doesn't exist.... the workflow will go into the error state. And IF the error is considered ephemeral we'll retry again in 5 seconds by default. And repeat.

I'll think this through and do some testing to see if it's a potential issue in practice and if so, how to better deal with it. As a final backstop, we do have --vreplication_max_time_to_retry_on_error although that's disabled by default.

The NoSuchTable error is considered unRecoverable and thus we do not retry:

vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go

Lines 262 to 277 in eb22cfb

// If this is a MySQL error that we know needs manual intervention or

// it's a FAILED_PRECONDITION vterror, OR we cannot identify this as

// non-recoverable BUT it has persisted beyond the retry limit

// (maxTimeToRetryError). In addition, we cannot restart a workflow

// started with AtomicCopy which has _any_ error.

if (err != nil && vr.WorkflowSubType == int32(binlogdatapb.VReplicationWorkflowSubType_AtomicCopy)) ||

isUnrecoverableError(err) ||

!ct.lastWorkflowError.ShouldRetry() {

if errSetState := vr.setState(binlogdatapb.VReplicationWorkflowState_Error, err.Error()); errSetState != nil {

log.Errorf("INTERNAL: unable to setState() in controller: %v. Could not set error text to: %v.", errSetState, err)

return err // yes, err and not errSetState.

}

log.Errorf("vreplication stream %d going into error state due to %+v", ct.id, err)

return nil // this will cause vreplicate to quit the workflow

}

vitess/go/vt/vttablet/tabletmanager/vreplication/utils.go

Line 195 in eb22cfb

sqlerror.ERNoSuchTable,

So the potentially problematic scenario that came to my mind should not be an issue.

shlomi-noach · 2024-05-12T06:45:51Z

go/vt/vttablet/tabletserver/vstreamer/rowstreamer.go

-		// In the future, we will reduce this operation to reading a single table rather than the entire schema.
-		rs.se.ReloadAt(context.Background(), replication.Position{})
-		st, err = rs.se.GetTableForPos(fromTable, "")
-	}


I remember adding this after an exhausting debugging session. Very glad to see this removed!

shlomi-noach · 2024-05-12T06:48:34Z

go/vt/vttablet/tabletserver/schema/engine.go

+			return nil, err
+		}
+		if st, ok := se.tables[tableNameStr]; ok {
+			return newMinimalTable(st), nil


Signed-off-by: Matt Lord <mattalord@gmail.com>

rohit-nayak-ps

Nice work!

mattlord added Type: Bug Component: VReplication Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels May 10, 2024

mattlord removed NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels May 10, 2024

github-actions bot added this to the v20.0.0 milestone May 10, 2024

Only reload when table is not in cache

7414217

Fixup tests Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vrepl_onlineddl_schema branch from 2fa4e55 to 7414217 Compare May 10, 2024 13:34

mattlord removed the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label May 10, 2024

Improve func and remove FIXME

8655975

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord changed the title ~~SchemaEngine: GetTableForPos return latest schema when none found for a position~~ SchemaEngine: GetTableForPos reload schema when table not found May 10, 2024

Improve function and eliminate now unnecessary reloads

cd3d27e

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord removed the NeedsBackportReason If backport labels have been applied to a PR, a justification is required label May 10, 2024

Merge remote-tracking branch 'origin/main' into vrepl_onlineddl_schema

858c0ad

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord changed the title ~~SchemaEngine: GetTableForPos reload schema when table not found~~ SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default May 10, 2024

mattlord force-pushed the vrepl_onlineddl_schema branch from 30047ca to 26adaff Compare May 10, 2024 18:25

More test fixes

52438dd

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord force-pushed the vrepl_onlineddl_schema branch from 26adaff to 52438dd Compare May 10, 2024 19:58

shlomi-noach approved these changes May 12, 2024

View reviewed changes

mattlord added 3 commits May 13, 2024 16:04

Add dedicated unit test

194b84c

Signed-off-by: Matt Lord <mattalord@gmail.com>

Correct PK handling

efd373f

Signed-off-by: Matt Lord <mattalord@gmail.com>

Minor changes after self review

8282f55

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord marked this pull request as ready for review May 13, 2024 22:15

mattlord requested review from harshit-gangal, systay, rohit-nayak-ps and deepthi as code owners May 13, 2024 22:15

rohit-nayak-ps approved these changes May 18, 2024

View reviewed changes

mattlord merged commit 3b09eb2 into vitessio:main May 20, 2024
93 checks passed

mattlord deleted the vrepl_onlineddl_schema branch May 20, 2024 21:14

mattlord mentioned this pull request Jan 10, 2025

Bug Report: infinite loop for "schema engine altered" #17458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default #15912

SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default #15912

mattlord commented May 10, 2024 •

edited

Loading

vitess-bot bot commented May 10, 2024

codecov bot commented May 10, 2024 •

edited

Loading

shlomi-noach left a comment

shlomi-noach May 12, 2024

mattlord May 13, 2024

mattlord May 13, 2024

shlomi-noach May 12, 2024

shlomi-noach May 12, 2024

rohit-nayak-ps left a comment

	// If this is a MySQL error that we know needs manual intervention or
	// it's a FAILED_PRECONDITION vterror, OR we cannot identify this as
	// non-recoverable BUT it has persisted beyond the retry limit
	// (maxTimeToRetryError). In addition, we cannot restart a workflow
	// started with AtomicCopy which has _any_ error.
	if (err != nil && vr.WorkflowSubType == int32(binlogdatapb.VReplicationWorkflowSubType_AtomicCopy)) \|\|
	isUnrecoverableError(err) \|\|
	!ct.lastWorkflowError.ShouldRetry() {

	if errSetState := vr.setState(binlogdatapb.VReplicationWorkflowState_Error, err.Error()); errSetState != nil {
	log.Errorf("INTERNAL: unable to setState() in controller: %v. Could not set error text to: %v.", errSetState, err)
	return err // yes, err and not errSetState.
	}
	log.Errorf("vreplication stream %d going into error state due to %+v", ct.id, err)
	return nil // this will cause vreplicate to quit the workflow
	}

SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default #15912

SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default #15912

Conversation

mattlord commented May 10, 2024 • edited Loading

Description

Related Issue(s)

Checklist

vitess-bot bot commented May 10, 2024

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

codecov bot commented May 10, 2024 • edited Loading

Codecov Report

shlomi-noach left a comment

Choose a reason for hiding this comment

shlomi-noach May 12, 2024

Choose a reason for hiding this comment

mattlord May 13, 2024

Choose a reason for hiding this comment

mattlord May 13, 2024

Choose a reason for hiding this comment

shlomi-noach May 12, 2024

Choose a reason for hiding this comment

shlomi-noach May 12, 2024

Choose a reason for hiding this comment

rohit-nayak-ps left a comment

Choose a reason for hiding this comment

mattlord commented May 10, 2024 •

edited

Loading

codecov bot commented May 10, 2024 •

edited

Loading