Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default #15912

Merged
merged 9 commits into from
May 20, 2024

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented May 10, 2024

Description

In SchemaEngine's GetTableForPos function, we would not find a table schema for a given GTID position when the historian is disabled and we would then return the "current" table schema from the cache. The problem there is that the cache can be quite stale (up to 30 minutes by default, see --queryserver-config-schema-reload-time ). Similarly, if you want to get the current schema for a table that was recently created (within up to the last 30 minutes by default) then GetTableForPos would return a table not found error as it wasn't yet in the cache. So if you're e.g. doing a fair amount of schema changes/migrations then this function's results can be out of date and block subsequent schema changes/migrations.

This PR changes GetTableForPos so that it ensures we get the table schema for the "current" position. If we do have the table in our cache, then we refresh the schema for that table from the database to be sure that it's up to date (and we update the entry in the cache). If we do not have the table in our cache then we refresh the schema cache from the database in full. This then also serves as a way to initialize the cache if it hasn't yet been done (recently started vttablet process).

Refreshing the schema cache is an expensive operation, which is why --queryserver-config-schema-reload-time defaults to 30m. But this change only affects VReplication as the only users of GetTableForPos() are vstreamers. And we only reload the schema cache there if the requested table is not already in the cache and it should be rare that you're trying to replicate a table that doesn't exist (we get the table schema during the planning step). This effectively turns the table schema cache — for VReplication — into a read-through cache.

This aims to be a more complete fix for: #9832

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

… a position

This would typically mean that the historian is disabled, so we should return
the "current" schema at that time. This means that we need to reload
our cache before we return the current table schema.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord added Type: Bug Component: VReplication Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels May 10, 2024
Copy link
Contributor

vitess-bot bot commented May 10, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels May 10, 2024
@mattlord mattlord removed NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels May 10, 2024
@github-actions github-actions bot added this to the v20.0.0 milestone May 10, 2024
Fixup tests

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord removed the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label May 10, 2024
Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 87.17949% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 68.43%. Comparing base (0353ad4) to head (8282f55).
Report is 5 commits behind head on main.

Files Patch % Lines
go/vt/vttablet/tabletserver/schema/engine.go 81.48% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15912      +/-   ##
==========================================
- Coverage   68.45%   68.43%   -0.02%     
==========================================
  Files        1559     1559              
  Lines      196825   196856      +31     
==========================================
- Hits       134736   134726      -10     
- Misses      62089    62130      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title SchemaEngine: GetTableForPos return latest schema when none found for a position SchemaEngine: GetTableForPos reload schema when table not found May 10, 2024
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord removed the NeedsBackportReason If backport labels have been applied to a PR, a justification is required label May 10, 2024
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title SchemaEngine: GetTableForPos reload schema when table not found SchemaEngine: Ensure GetTableForPos returns table schema for "current" position by default May 10, 2024
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving this!

go/vt/vttablet/tabletserver/schema/engine.go Outdated Show resolved Hide resolved
// database (updating the cache entry). If the table is not found in the cache, it will
// reload the cache from the database in case the table was created after the last schema
// reload or the cache has not yet been initialized. This function makes the schema
// cache a read-through cache for VReplication purposes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic. Asking to check: is there a legitimate scenario where the code would be asked again and again for a table that does not exist yet, and which would cause the cache to reload too much (as in creating excessive load on the server / locking / whatever)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say that you start a MoveTables workflow and mention a table that doesn't exist.... the workflow will go into the error state. And IF the error is considered ephemeral we'll retry again in 5 seconds by default. And repeat.

I'll think this through and do some testing to see if it's a potential issue in practice and if so, how to better deal with it. As a final backstop, we do have --vreplication_max_time_to_retry_on_error although that's disabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NoSuchTable error is considered unRecoverable and thus we do not retry:

// If this is a MySQL error that we know needs manual intervention or
// it's a FAILED_PRECONDITION vterror, OR we cannot identify this as
// non-recoverable BUT it has persisted beyond the retry limit
// (maxTimeToRetryError). In addition, we cannot restart a workflow
// started with AtomicCopy which has _any_ error.
if (err != nil && vr.WorkflowSubType == int32(binlogdatapb.VReplicationWorkflowSubType_AtomicCopy)) ||
isUnrecoverableError(err) ||
!ct.lastWorkflowError.ShouldRetry() {
if errSetState := vr.setState(binlogdatapb.VReplicationWorkflowState_Error, err.Error()); errSetState != nil {
log.Errorf("INTERNAL: unable to setState() in controller: %v. Could not set error text to: %v.", errSetState, err)
return err // yes, err and not errSetState.
}
log.Errorf("vreplication stream %d going into error state due to %+v", ct.id, err)
return nil // this will cause vreplicate to quit the workflow
}

So the potentially problematic scenario that came to my mind should not be an issue.

// In the future, we will reduce this operation to reading a single table rather than the entire schema.
rs.se.ReloadAt(context.Background(), replication.Position{})
st, err = rs.se.GetTableForPos(fromTable, "")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember adding this after an exhausting debugging session. Very glad to see this removed!

return nil, err
}
if st, ok := se.tables[tableNameStr]; ok {
return newMinimalTable(st), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review May 13, 2024 22:15
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@mattlord mattlord merged commit 3b09eb2 into vitessio:main May 20, 2024
93 checks passed
@mattlord mattlord deleted the vrepl_onlineddl_schema branch May 20, 2024 21:14
systay pushed a commit that referenced this pull request Jul 22, 2024
… schema for "current" position by default (#5218)

* cherry pick of 15912

* kick CI

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Update new workflows for vitess-private

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Adjust Static Checks workflow's skip check for private

Signed-off-by: Matt Lord <mattalord@gmail.com>

---------

Signed-off-by: Matt Lord <mattalord@gmail.com>
Co-authored-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants