Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vttablet/tabletmanager: add isInSRVKeyspace/isShardServing #7929

Merged
merged 2 commits into from
Apr 28, 2021

Conversation

guidoiaquinti
Copy link
Member

Signed-off-by: Guido Iaquinti giaquinti@slack-corp.com

Description

Add a new metric in vttablet to keep track if the keyspace partition it belongs to is serving or not. This metric can be useful for external systems to automatically enable/disable alerts.

In order to don't add additional watches to the topology service (as we don't want those to linearly scale with the number of vttablets), we refresh the serving keyspace view in TabletManager only when RefreshFromTopoInfo is called.

Note: this PR is still a WIP but I'll appreciate an early feedback.

Related Issue(s)

N/A

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

N/A

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

Signed-off-by: Guido Iaquinti <giaquinti@slack-corp.com>
@@ -115,6 +118,7 @@ func init() {
statsTabletType = stats.NewString("TabletType")
statsTabletTypeCount = stats.NewCountersWithSingleLabel("TabletTypeCount", "Number of times the tablet changed to the labeled type", "type")
statsBackupIsRunning = stats.NewGaugesWithMultiLabels("BackupIsRunning", "Whether a backup is running", []string{"mode"})
statsIsInSRVKeyspace = stats.NewGauge("IsInSRVKeyspace", "Whether the vttablet is in the serving keyspace (1 = true / 0 = false)")
Copy link
Member

@deepthi deepthi Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be IsInSrvKeyspace otherwise the prometheus metric might turn out to be vttablet_is_in_s_r_v_keyspace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be IsInSrvKeyspace otherwise the prometheus metric might turn out to be vttablet_is_in_s_r_v_keyspace

Actually I think it is rendered correctly:

# HELP vttablet_is_in_srv_keyspace Whether the vttablet is in the serving keyspace (1 = true / 0 = false)
# TYPE vttablet_is_in_srv_keyspace gauge
vttablet_is_in_srv_keyspace 0

Copy link
Member

@deepthi deepthi Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL :)
It will still be nice to rename it to be consistent with how we use SrvKeyspace everywhere else in the codebase.

@guidoiaquinti
Copy link
Member Author

guidoiaquinti commented Apr 23, 2021

I found out that with this implementation the topology info is not propagated to the tablet manager state after operations like SwitchReads/SwitchWrites are completed. I need to take a better look at this.

Signed-off-by: Guido Iaquinti <giaquinti@slack-corp.com>
@guidoiaquinti
Copy link
Member Author

guidoiaquinti commented Apr 27, 2021

I found out that with this implementation the topology info is not propagated to the tablet manager state after operations like SwitchReads/SwitchWrites are completed. I need to take a better look at this.

I think I found the underlying issue: in few places of the codebase we call MigrateServedType that "removes/adds shards from srvKeyspace when migrating a served type" but then we never refresh the tablet manager state.

I think the right approach here to don't end up with stale info in the tablet manager state is to make sure we call RefreshTabletsByShard every time the service keyspace view changes (despite this will add additional topo lookup).

Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guido this looks great! I added a small comment, but I think that could be part of a different PR.

Going to merge this one as is.

@@ -154,6 +155,158 @@ func TestStateTabletControls(t *testing.T) {
assert.False(t, qsc.IsServing())
}

func TestStateIsShardServingisInSrvKeyspace(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be super throughout with the testing, we should also assert that the gauge ends up with the expected values when we call func (ts *tmState) Open(). That's one more place where updateLocked will be called and we expect the gauge to be updated.

Adding this test, will make sure that we will detect a regression if the Open method is refactored and no longer calls updateLocked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, let me add that. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants