Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Support tracing UpdateConsensus calls #19417

Closed
1 task done
Tracked by #19626
amitanandaiyer opened this issue Oct 5, 2023 · 1 comment
Closed
1 task done
Tracked by #19626

[DocDB] Support tracing UpdateConsensus calls #19417

amitanandaiyer opened this issue Oct 5, 2023 · 1 comment
Assignees

Comments

@amitanandaiyer
Copy link
Contributor

amitanandaiyer commented Oct 5, 2023

Jira Link: DB-8207

Description

Support tracing UpdateConsensus calls

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@amitanandaiyer amitanandaiyer added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Oct 5, 2023
@amitanandaiyer amitanandaiyer self-assigned this Oct 5, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Oct 5, 2023
amitanandaiyer added a commit that referenced this issue Oct 14, 2023
Summary:
1) Support tracing UpdateConsensus api
2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower.
3) Add a trace msg when appended to the local log

The trace output looks like

```
[ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace:
[ts-1] 1005 11:29:53.219773 (+     0us) async_rpc.cc:550] WriteRpc initiated
[ts-1] 1005 11:29:53.219786 (+    13us) async_rpc.cc:202] SendRpc() called.
[ts-1] 1005 11:29:53.219787 (+     1us) tablet_rpc.cc:113] SelectTabletServer()
[ts-1] 1005 11:29:53.219794 (+     7us) async_rpc.cc:626] SendRpcToTserver
[ts-1] 1005 11:29:53.219837 (+    43us) async_rpc.cc:631] RpcDispatched Asynchronously
[ts-1] 1005 11:29:53.221422 (+  1585us) tablet_rpc.cc:320] Done(OK)
[ts-1] 1005 11:29:53.221430 (+     8us) async_rpc.cc:456] ProcessResponseFromTserver(OK)
[ts-1] 1005 11:29:53.221431 (+     1us) async_rpc.cc:458] Received from server:
[ts-1]  BEGIN
[ts-1] 1005 11:29:53.219815 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.219828 (+    13us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.219855 (+    27us) service_pool.cc:260] Handling call Write
[ts-1] 1005 11:29:53.219868 (+    13us) tablet_service.cc:1958] Start Write
[ts-1] 1005 11:29:53.219909 (+    41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations
[ts-1] 1005 11:29:53.219910 (+     1us) tablet.cc:3405] Acquiring write permit
[ts-1] 1005 11:29:53.219915 (+     5us) tablet.cc:3406] Acquiring write permit done
[ts-1] 1005 11:29:53.220031 (+   116us) shared_lock_manager.cc:321] Locking a batch of 2 keys
[ts-1] 1005 11:29:53.220042 (+    11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys
[ts-1] 1005 11:29:53.220053 (+    11us) conflict_resolution.cc:1305] ResolveTransactionConflicts
[ts-1] 1005 11:29:53.220149 (+    96us) conflict_resolution.cc:337] No conflicts.
[ts-1] 1005 11:29:53.220247 (+    98us) write_query.cc:599] TransactionalConflictsResolved
[ts-1] 1005 11:29:53.220252 (+     5us) conflict_resolution.cc:1331] resolver->Resolve done
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220235 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220269 (+    34us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1] ..  1005 11:29:53.220413 (+   144us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] ..  1005 11:29:53.220453 (+    40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1
[ts-1] ..  1005 11:29:53.221135 (+   682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8:
[ts-1] BEGIN UpdateConsensus
[ts-1] 1005 11:29:53.220607 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.220623 (+    16us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.220647 (+    24us) service_pool.cc:260] Handling call UpdateConsensus
[ts-1] 1005 11:29:53.220719 (+    72us) raft_consensus.cc:1942] Updating replica for 1 ops
[ts-1] 1005 11:29:53.220736 (+    17us) raft_consensus.cc:2053] Early marking committed up to 0.0
[ts-1] 1005 11:29:53.220737 (+     1us) raft_consensus.cc:2061] Triggering prepare for 1 ops
[ts-1] 1005 11:29:53.220830 (+    93us) raft_consensus.cc:2233] Updating last received op as 1.2
[ts-1] 1005 11:29:53.220833 (+     3us) raft_consensus.cc:2246] Marking committed up to 1.1
[ts-1] 1005 11:29:53.220834 (+     1us) raft_consensus.cc:2251] Filling consensus response to leader.
[ts-1] 1005 11:29:53.220835 (+     1us) raft_consensus.cc:2033] UpdateReplica() finished
[ts-1] 1005 11:29:53.220837 (+     2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging
[ts-1] 1005 11:29:53.220899 (+    62us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] 1005 11:29:53.220909 (+    10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220774 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220776 (+     2us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1]
[ts-1] END UpdateConsensus
[ts-1] ..  1005 11:29:53.221184 (+    49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2
[ts-1] ..  1005 11:29:53.221187 (+     3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked
[ts-1] ..  1005 11:29:53.221190 (+     3us) operation_driver.cc:318] Begin ReplicationFinished() call.
[ts-1] ..  1005 11:29:53.221192 (+     2us) write_operation.cc:77] APPLY: Starting
[ts-1] ..  1005 11:29:53.221296 (+   104us) write_operation.cc:95] FINISH: making edits visible
[ts-1] ..  1005 11:29:53.221307 (+    11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys
[ts-1] ..  1005 11:29:53.221316 (+     9us) tablet_service.cc:366] Write completing with status OK
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.220409 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.220444 (+    35us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.220478 (+    34us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.220501 (+    23us) outbound_call.cc:583] Call Sent.
[ts-1] ..  ..  1005 11:29:53.221111 (+   610us) outbound_call.cc:525] Response received.
[ts-1] ..  ..  1005 11:29:53.221124 (+    13us) outbound_call.cc:429] Callback will be called asynchronously.
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.221153 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.221190 (+    37us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.221214 (+    24us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.221231 (+    17us) outbound_call.cc:583] Call Sent.
[ts-1]  END.
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.219800 (+     0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write.
[ts-1] ..  1005 11:29:53.219809 (+     9us) local_call.cc:42] LocalOutboundCall
[ts-1] ..  1005 11:29:53.219811 (+     2us) outbound_call.cc:573] Queued.
[ts-1] ..  1005 11:29:53.219812 (+     1us) outbound_call.cc:583] Call Sent.
[ts-1] ..  1005 11:29:53.221417 (+  1605us) outbound_call.cc:429] Callback will be called asynchronously.
```

Upgrade/Rollback safety:
    Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored.
Jira: DB-8207

Test Plan:
```
export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true"
ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite

```

Reviewers: rthallam, esheng, arybochkin

Reviewed By: arybochkin

Subscribers: ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D29092
amitanandaiyer added a commit that referenced this issue Oct 18, 2023
Summary:
Original commit: 56f165c / D29092
1) Support tracing UpdateConsensus api
2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower.
3) Add a trace msg when appended to the local log

The trace output looks like

```
[ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace:
[ts-1] 1005 11:29:53.219773 (+     0us) async_rpc.cc:550] WriteRpc initiated
[ts-1] 1005 11:29:53.219786 (+    13us) async_rpc.cc:202] SendRpc() called.
[ts-1] 1005 11:29:53.219787 (+     1us) tablet_rpc.cc:113] SelectTabletServer()
[ts-1] 1005 11:29:53.219794 (+     7us) async_rpc.cc:626] SendRpcToTserver
[ts-1] 1005 11:29:53.219837 (+    43us) async_rpc.cc:631] RpcDispatched Asynchronously
[ts-1] 1005 11:29:53.221422 (+  1585us) tablet_rpc.cc:320] Done(OK)
[ts-1] 1005 11:29:53.221430 (+     8us) async_rpc.cc:456] ProcessResponseFromTserver(OK)
[ts-1] 1005 11:29:53.221431 (+     1us) async_rpc.cc:458] Received from server:
[ts-1]  BEGIN
[ts-1] 1005 11:29:53.219815 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.219828 (+    13us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.219855 (+    27us) service_pool.cc:260] Handling call Write
[ts-1] 1005 11:29:53.219868 (+    13us) tablet_service.cc:1958] Start Write
[ts-1] 1005 11:29:53.219909 (+    41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations
[ts-1] 1005 11:29:53.219910 (+     1us) tablet.cc:3405] Acquiring write permit
[ts-1] 1005 11:29:53.219915 (+     5us) tablet.cc:3406] Acquiring write permit done
[ts-1] 1005 11:29:53.220031 (+   116us) shared_lock_manager.cc:321] Locking a batch of 2 keys
[ts-1] 1005 11:29:53.220042 (+    11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys
[ts-1] 1005 11:29:53.220053 (+    11us) conflict_resolution.cc:1305] ResolveTransactionConflicts
[ts-1] 1005 11:29:53.220149 (+    96us) conflict_resolution.cc:337] No conflicts.
[ts-1] 1005 11:29:53.220247 (+    98us) write_query.cc:599] TransactionalConflictsResolved
[ts-1] 1005 11:29:53.220252 (+     5us) conflict_resolution.cc:1331] resolver->Resolve done
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220235 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220269 (+    34us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1] ..  1005 11:29:53.220413 (+   144us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] ..  1005 11:29:53.220453 (+    40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1
[ts-1] ..  1005 11:29:53.221135 (+   682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8:
[ts-1] BEGIN UpdateConsensus
[ts-1] 1005 11:29:53.220607 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.220623 (+    16us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.220647 (+    24us) service_pool.cc:260] Handling call UpdateConsensus
[ts-1] 1005 11:29:53.220719 (+    72us) raft_consensus.cc:1942] Updating replica for 1 ops
[ts-1] 1005 11:29:53.220736 (+    17us) raft_consensus.cc:2053] Early marking committed up to 0.0
[ts-1] 1005 11:29:53.220737 (+     1us) raft_consensus.cc:2061] Triggering prepare for 1 ops
[ts-1] 1005 11:29:53.220830 (+    93us) raft_consensus.cc:2233] Updating last received op as 1.2
[ts-1] 1005 11:29:53.220833 (+     3us) raft_consensus.cc:2246] Marking committed up to 1.1
[ts-1] 1005 11:29:53.220834 (+     1us) raft_consensus.cc:2251] Filling consensus response to leader.
[ts-1] 1005 11:29:53.220835 (+     1us) raft_consensus.cc:2033] UpdateReplica() finished
[ts-1] 1005 11:29:53.220837 (+     2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging
[ts-1] 1005 11:29:53.220899 (+    62us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] 1005 11:29:53.220909 (+    10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220774 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220776 (+     2us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1]
[ts-1] END UpdateConsensus
[ts-1] ..  1005 11:29:53.221184 (+    49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2
[ts-1] ..  1005 11:29:53.221187 (+     3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked
[ts-1] ..  1005 11:29:53.221190 (+     3us) operation_driver.cc:318] Begin ReplicationFinished() call.
[ts-1] ..  1005 11:29:53.221192 (+     2us) write_operation.cc:77] APPLY: Starting
[ts-1] ..  1005 11:29:53.221296 (+   104us) write_operation.cc:95] FINISH: making edits visible
[ts-1] ..  1005 11:29:53.221307 (+    11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys
[ts-1] ..  1005 11:29:53.221316 (+     9us) tablet_service.cc:366] Write completing with status OK
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.220409 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.220444 (+    35us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.220478 (+    34us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.220501 (+    23us) outbound_call.cc:583] Call Sent.
[ts-1] ..  ..  1005 11:29:53.221111 (+   610us) outbound_call.cc:525] Response received.
[ts-1] ..  ..  1005 11:29:53.221124 (+    13us) outbound_call.cc:429] Callback will be called asynchronously.
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.221153 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.221190 (+    37us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.221214 (+    24us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.221231 (+    17us) outbound_call.cc:583] Call Sent.
[ts-1]  END.
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.219800 (+     0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write.
[ts-1] ..  1005 11:29:53.219809 (+     9us) local_call.cc:42] LocalOutboundCall
[ts-1] ..  1005 11:29:53.219811 (+     2us) outbound_call.cc:573] Queued.
[ts-1] ..  1005 11:29:53.219812 (+     1us) outbound_call.cc:583] Call Sent.
[ts-1] ..  1005 11:29:53.221417 (+  1605us) outbound_call.cc:429] Callback will be called asynchronously.
```

Upgrade/Rollback safety:
    Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored.
Jira: DB-8207

Test Plan:
```
export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true"
ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite

```

Reviewers: rthallam, esheng, arybochkin

Reviewed By: arybochkin

Subscribers: bogdan, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29414
@rthallamko3
Copy link
Contributor

Consider backporting all the way to 2.14 after a weeks worth of bake time.

@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Nov 15, 2023
amitanandaiyer added a commit that referenced this issue Dec 20, 2023
Summary:
Original commit: 56f165c / D29092
1) Support tracing UpdateConsensus api
2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower.
3) Add a trace msg when appended to the local log

The trace output looks like

```
[ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace:
[ts-1] 1005 11:29:53.219773 (+     0us) async_rpc.cc:550] WriteRpc initiated
[ts-1] 1005 11:29:53.219786 (+    13us) async_rpc.cc:202] SendRpc() called.
[ts-1] 1005 11:29:53.219787 (+     1us) tablet_rpc.cc:113] SelectTabletServer()
[ts-1] 1005 11:29:53.219794 (+     7us) async_rpc.cc:626] SendRpcToTserver
[ts-1] 1005 11:29:53.219837 (+    43us) async_rpc.cc:631] RpcDispatched Asynchronously
[ts-1] 1005 11:29:53.221422 (+  1585us) tablet_rpc.cc:320] Done(OK)
[ts-1] 1005 11:29:53.221430 (+     8us) async_rpc.cc:456] ProcessResponseFromTserver(OK)
[ts-1] 1005 11:29:53.221431 (+     1us) async_rpc.cc:458] Received from server:
[ts-1]  BEGIN
[ts-1] 1005 11:29:53.219815 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.219828 (+    13us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.219855 (+    27us) service_pool.cc:260] Handling call Write
[ts-1] 1005 11:29:53.219868 (+    13us) tablet_service.cc:1958] Start Write
[ts-1] 1005 11:29:53.219909 (+    41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations
[ts-1] 1005 11:29:53.219910 (+     1us) tablet.cc:3405] Acquiring write permit
[ts-1] 1005 11:29:53.219915 (+     5us) tablet.cc:3406] Acquiring write permit done
[ts-1] 1005 11:29:53.220031 (+   116us) shared_lock_manager.cc:321] Locking a batch of 2 keys
[ts-1] 1005 11:29:53.220042 (+    11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys
[ts-1] 1005 11:29:53.220053 (+    11us) conflict_resolution.cc:1305] ResolveTransactionConflicts
[ts-1] 1005 11:29:53.220149 (+    96us) conflict_resolution.cc:337] No conflicts.
[ts-1] 1005 11:29:53.220247 (+    98us) write_query.cc:599] TransactionalConflictsResolved
[ts-1] 1005 11:29:53.220252 (+     5us) conflict_resolution.cc:1331] resolver->Resolve done
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220235 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220269 (+    34us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1] ..  1005 11:29:53.220413 (+   144us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] ..  1005 11:29:53.220453 (+    40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1
[ts-1] ..  1005 11:29:53.221135 (+   682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8:
[ts-1] BEGIN UpdateConsensus
[ts-1] 1005 11:29:53.220607 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.220623 (+    16us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.220647 (+    24us) service_pool.cc:260] Handling call UpdateConsensus
[ts-1] 1005 11:29:53.220719 (+    72us) raft_consensus.cc:1942] Updating replica for 1 ops
[ts-1] 1005 11:29:53.220736 (+    17us) raft_consensus.cc:2053] Early marking committed up to 0.0
[ts-1] 1005 11:29:53.220737 (+     1us) raft_consensus.cc:2061] Triggering prepare for 1 ops
[ts-1] 1005 11:29:53.220830 (+    93us) raft_consensus.cc:2233] Updating last received op as 1.2
[ts-1] 1005 11:29:53.220833 (+     3us) raft_consensus.cc:2246] Marking committed up to 1.1
[ts-1] 1005 11:29:53.220834 (+     1us) raft_consensus.cc:2251] Filling consensus response to leader.
[ts-1] 1005 11:29:53.220835 (+     1us) raft_consensus.cc:2033] UpdateReplica() finished
[ts-1] 1005 11:29:53.220837 (+     2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging
[ts-1] 1005 11:29:53.220899 (+    62us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] 1005 11:29:53.220909 (+    10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220774 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220776 (+     2us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1]
[ts-1] END UpdateConsensus
[ts-1] ..  1005 11:29:53.221184 (+    49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2
[ts-1] ..  1005 11:29:53.221187 (+     3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked
[ts-1] ..  1005 11:29:53.221190 (+     3us) operation_driver.cc:318] Begin ReplicationFinished() call.
[ts-1] ..  1005 11:29:53.221192 (+     2us) write_operation.cc:77] APPLY: Starting
[ts-1] ..  1005 11:29:53.221296 (+   104us) write_operation.cc:95] FINISH: making edits visible
[ts-1] ..  1005 11:29:53.221307 (+    11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys
[ts-1] ..  1005 11:29:53.221316 (+     9us) tablet_service.cc:366] Write completing with status OK
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.220409 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.220444 (+    35us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.220478 (+    34us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.220501 (+    23us) outbound_call.cc:583] Call Sent.
[ts-1] ..  ..  1005 11:29:53.221111 (+   610us) outbound_call.cc:525] Response received.
[ts-1] ..  ..  1005 11:29:53.221124 (+    13us) outbound_call.cc:429] Callback will be called asynchronously.
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.221153 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.221190 (+    37us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.221214 (+    24us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.221231 (+    17us) outbound_call.cc:583] Call Sent.
[ts-1]  END.
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.219800 (+     0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write.
[ts-1] ..  1005 11:29:53.219809 (+     9us) local_call.cc:42] LocalOutboundCall
[ts-1] ..  1005 11:29:53.219811 (+     2us) outbound_call.cc:573] Queued.
[ts-1] ..  1005 11:29:53.219812 (+     1us) outbound_call.cc:583] Call Sent.
[ts-1] ..  1005 11:29:53.221417 (+  1605us) outbound_call.cc:429] Callback will be called asynchronously.
```

Upgrade/Rollback safety:
    Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored.
Jira: DB-8207

Test Plan:
```
export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true"
ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite

```

Reviewers: rthallam, esheng, arybochkin

Reviewed By: rthallam

Subscribers: bogdan, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30191
amitanandaiyer added a commit that referenced this issue Dec 21, 2023
Summary:
Original commit: 56f165c / D29092
1) Support tracing UpdateConsensus api
2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower.
3) Add a trace msg when appended to the local log

The trace output looks like

```
[ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace:
[ts-1] 1005 11:29:53.219773 (+     0us) async_rpc.cc:550] WriteRpc initiated
[ts-1] 1005 11:29:53.219786 (+    13us) async_rpc.cc:202] SendRpc() called.
[ts-1] 1005 11:29:53.219787 (+     1us) tablet_rpc.cc:113] SelectTabletServer()
[ts-1] 1005 11:29:53.219794 (+     7us) async_rpc.cc:626] SendRpcToTserver
[ts-1] 1005 11:29:53.219837 (+    43us) async_rpc.cc:631] RpcDispatched Asynchronously
[ts-1] 1005 11:29:53.221422 (+  1585us) tablet_rpc.cc:320] Done(OK)
[ts-1] 1005 11:29:53.221430 (+     8us) async_rpc.cc:456] ProcessResponseFromTserver(OK)
[ts-1] 1005 11:29:53.221431 (+     1us) async_rpc.cc:458] Received from server:
[ts-1]  BEGIN
[ts-1] 1005 11:29:53.219815 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.219828 (+    13us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.219855 (+    27us) service_pool.cc:260] Handling call Write
[ts-1] 1005 11:29:53.219868 (+    13us) tablet_service.cc:1958] Start Write
[ts-1] 1005 11:29:53.219909 (+    41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations
[ts-1] 1005 11:29:53.219910 (+     1us) tablet.cc:3405] Acquiring write permit
[ts-1] 1005 11:29:53.219915 (+     5us) tablet.cc:3406] Acquiring write permit done
[ts-1] 1005 11:29:53.220031 (+   116us) shared_lock_manager.cc:321] Locking a batch of 2 keys
[ts-1] 1005 11:29:53.220042 (+    11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys
[ts-1] 1005 11:29:53.220053 (+    11us) conflict_resolution.cc:1305] ResolveTransactionConflicts
[ts-1] 1005 11:29:53.220149 (+    96us) conflict_resolution.cc:337] No conflicts.
[ts-1] 1005 11:29:53.220247 (+    98us) write_query.cc:599] TransactionalConflictsResolved
[ts-1] 1005 11:29:53.220252 (+     5us) conflict_resolution.cc:1331] resolver->Resolve done
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220235 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220269 (+    34us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1] ..  1005 11:29:53.220413 (+   144us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] ..  1005 11:29:53.220453 (+    40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1
[ts-1] ..  1005 11:29:53.221135 (+   682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8:
[ts-1] BEGIN UpdateConsensus
[ts-1] 1005 11:29:53.220607 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.220623 (+    16us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.220647 (+    24us) service_pool.cc:260] Handling call UpdateConsensus
[ts-1] 1005 11:29:53.220719 (+    72us) raft_consensus.cc:1942] Updating replica for 1 ops
[ts-1] 1005 11:29:53.220736 (+    17us) raft_consensus.cc:2053] Early marking committed up to 0.0
[ts-1] 1005 11:29:53.220737 (+     1us) raft_consensus.cc:2061] Triggering prepare for 1 ops
[ts-1] 1005 11:29:53.220830 (+    93us) raft_consensus.cc:2233] Updating last received op as 1.2
[ts-1] 1005 11:29:53.220833 (+     3us) raft_consensus.cc:2246] Marking committed up to 1.1
[ts-1] 1005 11:29:53.220834 (+     1us) raft_consensus.cc:2251] Filling consensus response to leader.
[ts-1] 1005 11:29:53.220835 (+     1us) raft_consensus.cc:2033] UpdateReplica() finished
[ts-1] 1005 11:29:53.220837 (+     2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging
[ts-1] 1005 11:29:53.220899 (+    62us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] 1005 11:29:53.220909 (+    10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220774 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220776 (+     2us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1]
[ts-1] END UpdateConsensus
[ts-1] ..  1005 11:29:53.221184 (+    49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2
[ts-1] ..  1005 11:29:53.221187 (+     3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked
[ts-1] ..  1005 11:29:53.221190 (+     3us) operation_driver.cc:318] Begin ReplicationFinished() call.
[ts-1] ..  1005 11:29:53.221192 (+     2us) write_operation.cc:77] APPLY: Starting
[ts-1] ..  1005 11:29:53.221296 (+   104us) write_operation.cc:95] FINISH: making edits visible
[ts-1] ..  1005 11:29:53.221307 (+    11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys
[ts-1] ..  1005 11:29:53.221316 (+     9us) tablet_service.cc:366] Write completing with status OK
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.220409 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.220444 (+    35us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.220478 (+    34us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.220501 (+    23us) outbound_call.cc:583] Call Sent.
[ts-1] ..  ..  1005 11:29:53.221111 (+   610us) outbound_call.cc:525] Response received.
[ts-1] ..  ..  1005 11:29:53.221124 (+    13us) outbound_call.cc:429] Callback will be called asynchronously.
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.221153 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.221190 (+    37us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.221214 (+    24us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.221231 (+    17us) outbound_call.cc:583] Call Sent.
[ts-1]  END.
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.219800 (+     0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write.
[ts-1] ..  1005 11:29:53.219809 (+     9us) local_call.cc:42] LocalOutboundCall
[ts-1] ..  1005 11:29:53.219811 (+     2us) outbound_call.cc:573] Queued.
[ts-1] ..  1005 11:29:53.219812 (+     1us) outbound_call.cc:583] Call Sent.
[ts-1] ..  1005 11:29:53.221417 (+  1605us) outbound_call.cc:429] Callback will be called asynchronously.
```

Upgrade/Rollback safety:
    Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored.
Jira: DB-8207

Changes from what's on master:
Changes to make things compile w/o LW protos

Test Plan:
```
export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true"
ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite

```

Reviewers: rthallam, esheng, arybochkin

Reviewed By: rthallam

Subscribers: ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30192
amitanandaiyer added a commit that referenced this issue Dec 21, 2023
Summary:
Original commit: 56f165c / D29092

1) Support tracing UpdateConsensus api
2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower.
3) Add a trace msg when appended to the local log

The trace output looks like

```
[ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace:
[ts-1] 1005 11:29:53.219773 (+     0us) async_rpc.cc:550] WriteRpc initiated
[ts-1] 1005 11:29:53.219786 (+    13us) async_rpc.cc:202] SendRpc() called.
[ts-1] 1005 11:29:53.219787 (+     1us) tablet_rpc.cc:113] SelectTabletServer()
[ts-1] 1005 11:29:53.219794 (+     7us) async_rpc.cc:626] SendRpcToTserver
[ts-1] 1005 11:29:53.219837 (+    43us) async_rpc.cc:631] RpcDispatched Asynchronously
[ts-1] 1005 11:29:53.221422 (+  1585us) tablet_rpc.cc:320] Done(OK)
[ts-1] 1005 11:29:53.221430 (+     8us) async_rpc.cc:456] ProcessResponseFromTserver(OK)
[ts-1] 1005 11:29:53.221431 (+     1us) async_rpc.cc:458] Received from server:
[ts-1]  BEGIN
[ts-1] 1005 11:29:53.219815 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.219828 (+    13us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.219855 (+    27us) service_pool.cc:260] Handling call Write
[ts-1] 1005 11:29:53.219868 (+    13us) tablet_service.cc:1958] Start Write
[ts-1] 1005 11:29:53.219909 (+    41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations
[ts-1] 1005 11:29:53.219910 (+     1us) tablet.cc:3405] Acquiring write permit
[ts-1] 1005 11:29:53.219915 (+     5us) tablet.cc:3406] Acquiring write permit done
[ts-1] 1005 11:29:53.220031 (+   116us) shared_lock_manager.cc:321] Locking a batch of 2 keys
[ts-1] 1005 11:29:53.220042 (+    11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys
[ts-1] 1005 11:29:53.220053 (+    11us) conflict_resolution.cc:1305] ResolveTransactionConflicts
[ts-1] 1005 11:29:53.220149 (+    96us) conflict_resolution.cc:337] No conflicts.
[ts-1] 1005 11:29:53.220247 (+    98us) write_query.cc:599] TransactionalConflictsResolved
[ts-1] 1005 11:29:53.220252 (+     5us) conflict_resolution.cc:1331] resolver->Resolve done
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220235 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220269 (+    34us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1] ..  1005 11:29:53.220413 (+   144us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] ..  1005 11:29:53.220453 (+    40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1
[ts-1] ..  1005 11:29:53.221135 (+   682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8:
[ts-1] BEGIN UpdateConsensus
[ts-1] 1005 11:29:53.220607 (+     0us) inbound_call.cc:76] Created InboundCall
[ts-1] 1005 11:29:53.220623 (+    16us) service_pool.cc:182] Inserting onto call queue
[ts-1] 1005 11:29:53.220647 (+    24us) service_pool.cc:260] Handling call UpdateConsensus
[ts-1] 1005 11:29:53.220719 (+    72us) raft_consensus.cc:1942] Updating replica for 1 ops
[ts-1] 1005 11:29:53.220736 (+    17us) raft_consensus.cc:2053] Early marking committed up to 0.0
[ts-1] 1005 11:29:53.220737 (+     1us) raft_consensus.cc:2061] Triggering prepare for 1 ops
[ts-1] 1005 11:29:53.220830 (+    93us) raft_consensus.cc:2233] Updating last received op as 1.2
[ts-1] 1005 11:29:53.220833 (+     3us) raft_consensus.cc:2246] Marking committed up to 1.1
[ts-1] 1005 11:29:53.220834 (+     1us) raft_consensus.cc:2251] Filling consensus response to leader.
[ts-1] 1005 11:29:53.220835 (+     1us) raft_consensus.cc:2033] UpdateReplica() finished
[ts-1] 1005 11:29:53.220837 (+     2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging
[ts-1] 1005 11:29:53.220899 (+    62us) log_cache.cc:273] Appended up to 2 the local log
[ts-1] 1005 11:29:53.220909 (+    10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.220774 (+     0us) operation_driver.cc:175] ExecuteAsync
[ts-1] ..  1005 11:29:53.220776 (+     2us) operation_driver.cc:235] Begin PrepareAndStart
[ts-1]
[ts-1] END UpdateConsensus
[ts-1] ..  1005 11:29:53.221184 (+    49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2
[ts-1] ..  1005 11:29:53.221187 (+     3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked
[ts-1] ..  1005 11:29:53.221190 (+     3us) operation_driver.cc:318] Begin ReplicationFinished() call.
[ts-1] ..  1005 11:29:53.221192 (+     2us) write_operation.cc:77] APPLY: Starting
[ts-1] ..  1005 11:29:53.221296 (+   104us) write_operation.cc:95] FINISH: making edits visible
[ts-1] ..  1005 11:29:53.221307 (+    11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys
[ts-1] ..  1005 11:29:53.221316 (+     9us) tablet_service.cc:366] Write completing with status OK
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.220409 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.220444 (+    35us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.220478 (+    34us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.220501 (+    23us) outbound_call.cc:583] Call Sent.
[ts-1] ..  ..  1005 11:29:53.221111 (+   610us) outbound_call.cc:525] Response received.
[ts-1] ..  ..  1005 11:29:53.221124 (+    13us) outbound_call.cc:429] Callback will be called asynchronously.
[ts-1] ..  ..  Related trace:
[ts-1] ..  ..  1005 11:29:53.221153 (+     0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus.
[ts-1] ..  ..  1005 11:29:53.221190 (+    37us) reactor.cc:938] Scheduled.
[ts-1] ..  ..  1005 11:29:53.221214 (+    24us) outbound_call.cc:573] Queued.
[ts-1] ..  ..  1005 11:29:53.221231 (+    17us) outbound_call.cc:583] Call Sent.
[ts-1]  END.
[ts-1] ..  Related trace:
[ts-1] ..  1005 11:29:53.219800 (+     0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write.
[ts-1] ..  1005 11:29:53.219809 (+     9us) local_call.cc:42] LocalOutboundCall
[ts-1] ..  1005 11:29:53.219811 (+     2us) outbound_call.cc:573] Queued.
[ts-1] ..  1005 11:29:53.219812 (+     1us) outbound_call.cc:583] Call Sent.
[ts-1] ..  1005 11:29:53.221417 (+  1605us) outbound_call.cc:429] Callback will be called asynchronously.
```

Upgrade/Rollback safety:
    Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored.
Jira: DB-8207

do not use _RUNTIME_ macro in 2.14. No LW proto

Test Plan:
```
export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true"
ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite

```

Reviewers: rthallam, esheng, arybochkin

Reviewed By: rthallam

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D30193
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants