Summary:
We observed the following scenario leading to one of child tablet peers to have high follower lag and never get updated:
1. Parent tablet peers 1-3 accept (but do not apply) a SPLIT_OP (op_id: 1.4).
2. Parent tablet peers 1-3 accept (but do not apply) a CHANGE_CONFIG_OP (op_id: 1.5) to add a fourth peer.
3. Parent tablet peers 1-3 started applying the SPLIT_OP (1.4) using the committed Raft config with 3 peers.
4. RBS for parent tablet peer 4 started and tablet metadata (`tablet_data_state == TABLET_DATA_READY`) is downloaded.
5. Parent tablet peers 1-3 completed applying the SPLIT_OP (1.4).
6. Parent tablet peers 1-3 applied CHANGE_CONFIG_OP (1.5) and now have committed Raft config with 4 peers.
7. RBS for parent tablet peer 4 downloaded WAL files which have CHANGE_CONFIG_OP (1.5) as committed.
8. Parent tablet peer 4 does local bootstrap and replays SPLIT_OP (1.4) as part of bootstrap. Due to `tablet_data_state` is `TABLET_DATA_READY` but not `TABLET_DATA_SPLIT_COMPLETED` replay does SPLIT_OP apply. During SPLIT_OP apply it uses the last known committed Raft config (with 4 peers) but not the one which was committed before SPLIT_OP.
After that, 4th tablet peer is not a part of tablet Raft group and therefore is not receiving consensus updates from leader.
Implemented the fix to disallow Raft config membership changes (but still allow leadership/role changes) on the parent tablet LEADER if LEADER has received SPLIT_OP and not yet applied it. This ensures split children tablet peers have the same Raft config.
Potentially we can have similar issue when doing RBS from the follower, created https://github.com/yugabyte/yugabyte-db/issues/27056 to investigate and fix if this is a legit issue.
Jira: DB-16021
Original commit: 771385dc6768a131c52df896e86de6baf46d4350 / D43526
Test Plan: TabletSplitITest.SplitWithParentTabletMove
Reviewers: asrivastava, bkolagani
Reviewed By: bkolagani
Subscribers: ybase
Tags: #jenkins-ready
Differential Revision: https://phorge.dev.yugabyte.com/D43988