Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VReplication: add table for logging stream errors, state changes and key workflow steps #7831

Merged
merged 9 commits into from
Apr 27, 2021

Conversation

rohit-nayak-ps
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps commented Apr 12, 2021

Description

This PR adds a new table to the _vt sidecar database. The _vt.vreplication_log table will record certain types of events. To start with we record

  • stream creation along with stream parameters
  • stream errors
  • state changes
  • key workflow steps (currently start and end of the copy phase) (Todo)
CREATE TABLE IF NOT EXISTS _vt.vreplication_log (
	id BIGINT(20) AUTO_INCREMENT,
	vrepl_id INT NOT NULL,
	type VARBINARY(256) NOT NULL,
	state VARBINARY(100) NOT NULL,
	created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
	updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
	message text NOT NULL,
	count BIGINT(20) NOT NULL DEFAULT 1,
	PRIMARY KEY (id))

The value of the message will usually be a string. In the case of stream inserts it records all the parameters including the binlogsource field.

Before inserting a new row we check if the latest message matches the state and message of the new message and if it does then it increments the count. This prevents any repeated errors from spamming the table.

Tests

When I tried adding the queries related to the vreplication_log to expect them during other vreplication tests it lead to a huge number of updates to existing tests and also created race conditions and other flakiness. For now we ignore these queries
during the workflow and test the log insertion code separately.

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Impacted Areas in Vitess

Components that this PR will affect:

  • VReplication
  • VTAdmin

@rohit-nayak-ps rohit-nayak-ps changed the title Do Not Merge: VReplication: add log table VReplication: add table for logging stream errors, state changes and key workflow steps Apr 12, 2021
@rohit-nayak-ps
Copy link
Contributor Author

rohit-nayak-ps commented Apr 12, 2021

Sample output from local example

mysql> select * from _vt.vreplication_log where vrepl_id = 1 order by id asc \G
*************************** 1. row ***************************
        id: 1
  vrepl_id: 1
      type: Stream Created
     state: Stopped
created_at: 2021-04-13 21:26:36
updated_at: 2021-04-13 21:26:36
   message: {"cell":"","db_name":"vt_customer","id":"1","max_replication_lag":"9223372036854775807","max_tps":"9223372036854775807","pos":"","source":"keyspace:\"commerce\" shard:\"0\" filter:\u003crules:\u003cmatch:\"customer\" filter:\"select * from customer\" \u003e rules:\u003cmatch:\"corder\" filter:\"select * from corder\" \u003e \u003e ","state":"Stopped","tablet_types":"","time_updated":"1618341996","transaction_timestamp":"0","workflow":"commerce2customer"}
     count: 1
*************************** 2. row ***************************
        id: 2
  vrepl_id: 1
      type: State Changed
     state: Running
created_at: 2021-04-13 21:26:36
updated_at: 2021-04-13 21:26:36
   message: 
     count: 1
*************************** 3. row ***************************
        id: 3
  vrepl_id: 1
      type: State Changed
     state: Copying
created_at: 2021-04-13 21:26:36
updated_at: 2021-04-13 21:26:36
   message: 
     count: 1
*************************** 4. row ***************************
        id: 4
  vrepl_id: 1
      type: Started Copy Phase
     state: Copying
created_at: 2021-04-13 21:26:36
updated_at: 2021-04-13 21:26:36
   message: Copy phase started for 2 table(s)
     count: 1
*************************** 5. row ***************************
        id: 5
  vrepl_id: 1
      type: Ended Copy Phase
     state: Copying
created_at: 2021-04-13 21:26:37
updated_at: 2021-04-13 21:26:37
   message: Copy phase completed at gtid fdbc7d6b-9c8d-11eb-b544-04ed332e05c2:1-43
     count: 1
*************************** 6. row ***************************
        id: 6
  vrepl_id: 1
      type: State Changed
     state: Running
created_at: 2021-04-13 21:26:37
updated_at: 2021-04-13 21:26:37
   message: 
     count: 1
*************************** 7. row ***************************
        id: 7
  vrepl_id: 1
      type: State Changed
     state: Stopped
created_at: 2021-04-13 21:26:39
updated_at: 2021-04-13 21:26:39
   message: 
     count: 2
*************************** 8. row ***************************
        id: 8
  vrepl_id: 1
      type: Stream Deleted
     state: 
created_at: 2021-04-13 21:26:39
updated_at: 2021-04-13 21:26:39
   message: 
     count: 1
8 rows in set (0.00 sec)

@rohit-nayak-ps rohit-nayak-ps force-pushed the rn-vr-log2 branch 3 times, most recently from f41805b to d865b2f Compare April 15, 2021 08:33
@rohit-nayak-ps rohit-nayak-ps requested review from rafael, ajm188 and a team April 15, 2021 09:20
@rohit-nayak-ps rohit-nayak-ps marked this pull request as ready for review April 15, 2021 09:21
Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing!! I'm very excited to try this out ❤️

// LogStateChange is used when the state of the stream changes
LogStateChange = "State Changed"

// TODO: LogError is not used atm. Currently irrecoverable errors, resumable errors and informational messages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'm not sure if you have any ideas about to do this more advanced error detection/categorization, but I'm happy to brainstorm with you if you want!

)

const (
createVReplicationLog = `CREATE TABLE IF NOT EXISTS _vt.vreplication_log (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our experiments, we were seeing the _vt.vreplication_log table created on the tablet but inaccessible from VReplicationExec:

vitess@abe8006adf7f:/vt/local$ vtctlclient VreplicationExec zone1-0000000400 "select * from _vt.vreplication_log"
VreplicationExec Error: rpc error: code = Unknown desc = TabletManager.VReplicationExec on zone1-0000000400 error: invalid table name: _vt.vreplication_log: invalid table name: _vt.vreplication_log
E0427 15:05:50.761692    9477 main.go:72] remote error: rpc error: code = Unknown desc = TabletManager.VReplicationExec on zone1-0000000400 error: invalid table name: _vt.vreplication_log: invalid table name: _vt.vreplication_log

@ajm188 found that the switch statements in controller_plan.go return "invalid table name" error as they've not been updated to include the _vt.vreplication_log table. We were able to patch it like this (and, we only updated one of the four switch statements for brevity):

diff --git a/go/vt/vttablet/tabletmanager/vreplication/controller_plan.go b/go/vt/vttablet/tabletmanager/vreplication/controller_plan.go
index ed42cf1839..c7ac3d45d4 100644
--- a/go/vt/vttablet/tabletmanager/vreplication/controller_plan.go
+++ b/go/vt/vttablet/tabletmanager/vreplication/controller_plan.go
@@ -133,6 +133,7 @@ func buildUpdatePlan(upd *sqlparser.Update) (*controllerPlan, error) {
                }, nil
        case vreplicationTableName:
                // no-op
+
        default:
                return nil, fmt.Errorf("invalid table name: %v", sqlparser.String(upd.TableExprs))
        }
@@ -222,7 +223,7 @@ func buildDeletePlan(del *sqlparser.Delete) (*controllerPlan, error) {
 
 func buildSelectPlan(sel *sqlparser.Select) (*controllerPlan, error) {
        switch sqlparser.String(sel.From) {
-       case vreplicationTableName, reshardingJournalTableName, copyStateTableName:
+       case vreplicationTableName, reshardingJournalTableName, copyStateTableName, vreplicationLogTableName:
                return &controllerPlan{
                        opcode: selectQuery,
                }, nil
diff --git a/go/vt/vttablet/tabletmanager/vreplication/utils.go b/go/vt/vttablet/tabletmanager/vreplication/utils.go
index dd89ea3bd1..9f542b8d15 100644
--- a/go/vt/vttablet/tabletmanager/vreplication/utils.go
+++ b/go/vt/vttablet/tabletmanager/vreplication/utils.go
@@ -26,7 +26,8 @@ import (
 )
 
 const (
-       createVReplicationLog = `CREATE TABLE IF NOT EXISTS _vt.vreplication_log (
+       vreplicationLogTableName = "_vt.vreplication_log"
+       createVReplicationLog    = `CREATE TABLE IF NOT EXISTS _vt.vreplication_log (
                id BIGINT(20) AUTO_INCREMENT,
                vrepl_id INT NOT NULL,
                type VARBINARY(256) NOT NULL,

... and then see the rows via VReplicationExec:

vitess@61befc15be72:/vt/local$ vtctlclient VreplicationExec zone1-0000000400 "select count(*) from _vt.vreplication_log"
+----------+
| count(*) |
+----------+
|        6 |
+----------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, hadn't thought of VReplicationExec for this table. I think we should add it only for Select and possibly Delete, right? Insert/Update shouldn't be allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, even the delete is hardcoded in multiple places for the vreplication table and will require a little more time and tests :( I am going to just make the same changes you made so it works with Select for now. Delete can be implemented in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating Select for now sounds great!!! 🎉 This will be amazing to have regardless!

// Enum values for type column of _vt.vreplication_log

// LogStreamCreate is used when a row in _vt.vreplication is inserted via VReplicationExec
LogStreamCreate = "Stream Created"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible/reasonable/idiomatic to make a proto enum for this? That way we also get a TypeScript enum generated for us. (Otherwise, I'll have to duplicate these strings on the front-end at some point.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had punted on the enum: I think we discussed that this would be implemented as part of the vtadmin API around these logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting at VTAdmin time makes sense to me!

…icationExec and setMessage/State

Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Copy link
Contributor

@doeg doeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again!!! 🎉 🎉 🎉 Can't wait to show you the end result in the UI!

@rohit-nayak-ps rohit-nayak-ps merged commit 01fb7e5 into vitessio:master Apr 27, 2021
@rohit-nayak-ps rohit-nayak-ps deleted the rn-vr-log2 branch April 27, 2021 21:38
@systay systay added Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants