-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default #15912
Conversation
… a position This would typically mean that the historian is disabled, so we should return the "current" schema at that time. This means that we need to reload our cache before we return the current table schema. Signed-off-by: Matt Lord <mattalord@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Fixup tests Signed-off-by: Matt Lord <mattalord@gmail.com>
2fa4e55
to
7414217
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15912 +/- ##
==========================================
- Coverage 68.45% 68.43% -0.02%
==========================================
Files 1559 1559
Lines 196825 196856 +31
==========================================
- Hits 134736 134726 -10
- Misses 62089 62130 +41 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
30047ca
to
26adaff
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
26adaff
to
52438dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loving this!
// database (updating the cache entry). If the table is not found in the cache, it will | ||
// reload the cache from the database in case the table was created after the last schema | ||
// reload or the cache has not yet been initialized. This function makes the schema | ||
// cache a read-through cache for VReplication purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic. Asking to check: is there a legitimate scenario where the code would be asked again and again for a table that does not exist yet, and which would cause the cache to reload too much (as in creating excessive load on the server / locking / whatever)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say that you start a MoveTables workflow and mention a table that doesn't exist.... the workflow will go into the error state. And IF the error is considered ephemeral we'll retry again in 5 seconds by default. And repeat.
I'll think this through and do some testing to see if it's a potential issue in practice and if so, how to better deal with it. As a final backstop, we do have --vreplication_max_time_to_retry_on_error
although that's disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NoSuchTable
error is considered unRecoverable and thus we do not retry:
vitess/go/vt/vttablet/tabletmanager/vreplication/controller.go
Lines 262 to 277 in eb22cfb
// If this is a MySQL error that we know needs manual intervention or | |
// it's a FAILED_PRECONDITION vterror, OR we cannot identify this as | |
// non-recoverable BUT it has persisted beyond the retry limit | |
// (maxTimeToRetryError). In addition, we cannot restart a workflow | |
// started with AtomicCopy which has _any_ error. | |
if (err != nil && vr.WorkflowSubType == int32(binlogdatapb.VReplicationWorkflowSubType_AtomicCopy)) || | |
isUnrecoverableError(err) || | |
!ct.lastWorkflowError.ShouldRetry() { | |
if errSetState := vr.setState(binlogdatapb.VReplicationWorkflowState_Error, err.Error()); errSetState != nil { | |
log.Errorf("INTERNAL: unable to setState() in controller: %v. Could not set error text to: %v.", errSetState, err) | |
return err // yes, err and not errSetState. | |
} | |
log.Errorf("vreplication stream %d going into error state due to %+v", ct.id, err) | |
return nil // this will cause vreplicate to quit the workflow | |
} |
sqlerror.ERNoSuchTable, |
So the potentially problematic scenario that came to my mind should not be an issue.
// In the future, we will reduce this operation to reading a single table rather than the entire schema. | ||
rs.se.ReloadAt(context.Background(), replication.Position{}) | ||
st, err = rs.se.GetTableForPos(fromTable, "") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember adding this after an exhausting debugging session. Very glad to see this removed!
return nil, err | ||
} | ||
if st, ok := se.tables[tableNameStr]; ok { | ||
return newMinimalTable(st), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
… schema for "current" position by default (#5218) * cherry pick of 15912 * kick CI Signed-off-by: Matt Lord <mattalord@gmail.com> * Update new workflows for vitess-private Signed-off-by: Matt Lord <mattalord@gmail.com> * Adjust Static Checks workflow's skip check for private Signed-off-by: Matt Lord <mattalord@gmail.com> --------- Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
Description
In
SchemaEngine
'sGetTableForPos
function, we would not find a table schema for a given GTID position when the historian is disabled and we would then return the "current" table schema from the cache. The problem there is that the cache can be quite stale (up to 30 minutes by default, see--queryserver-config-schema-reload-time
). Similarly, if you want to get the current schema for a table that was recently created (within up to the last 30 minutes by default) thenGetTableForPos
would return atable not found
error as it wasn't yet in the cache. So if you're e.g. doing a fair amount of schema changes/migrations then this function's results can be out of date and block subsequent schema changes/migrations.This PR changes
GetTableForPos
so that it ensures we get the table schema for the "current" position. If we do have the table in our cache, then we refresh the schema for that table from the database to be sure that it's up to date (and we update the entry in the cache). If we do not have the table in our cache then we refresh the schema cache from the database in full. This then also serves as a way to initialize the cache if it hasn't yet been done (recently startedvttablet
process).Refreshing the schema cache is an expensive operation, which is why
--queryserver-config-schema-reload-time
defaults to 30m. But this change only affects VReplication as the only users ofGetTableForPos()
are vstreamers. And we only reload the schema cache there if the requested table is not already in the cache and it should be rare that you're trying to replicate a table that doesn't exist (we get the table schema during the planning step). This effectively turns the table schema cache — for VReplication — into a read-through cache.This aims to be a more complete fix for: #9832
Related Issue(s)
Checklist