New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Support tracing UpdateConsensus calls #19417
Labels
2.14 Backport Required
2.16 Backport Required
2.18 Backport Required
2.20 Backport Required
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/high
High Priority
Comments
amitanandaiyer
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Oct 5, 2023
yugabyte-ci
added
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
and removed
status/awaiting-triage
Issue awaiting triage
labels
Oct 5, 2023
amitanandaiyer
added a commit
that referenced
this issue
Oct 14, 2023
Summary: 1) Support tracing UpdateConsensus api 2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower. 3) Add a trace msg when appended to the local log The trace output looks like ``` [ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace: [ts-1] 1005 11:29:53.219773 (+ 0us) async_rpc.cc:550] WriteRpc initiated [ts-1] 1005 11:29:53.219786 (+ 13us) async_rpc.cc:202] SendRpc() called. [ts-1] 1005 11:29:53.219787 (+ 1us) tablet_rpc.cc:113] SelectTabletServer() [ts-1] 1005 11:29:53.219794 (+ 7us) async_rpc.cc:626] SendRpcToTserver [ts-1] 1005 11:29:53.219837 (+ 43us) async_rpc.cc:631] RpcDispatched Asynchronously [ts-1] 1005 11:29:53.221422 (+ 1585us) tablet_rpc.cc:320] Done(OK) [ts-1] 1005 11:29:53.221430 (+ 8us) async_rpc.cc:456] ProcessResponseFromTserver(OK) [ts-1] 1005 11:29:53.221431 (+ 1us) async_rpc.cc:458] Received from server: [ts-1] BEGIN [ts-1] 1005 11:29:53.219815 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.219828 (+ 13us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.219855 (+ 27us) service_pool.cc:260] Handling call Write [ts-1] 1005 11:29:53.219868 (+ 13us) tablet_service.cc:1958] Start Write [ts-1] 1005 11:29:53.219909 (+ 41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations [ts-1] 1005 11:29:53.219910 (+ 1us) tablet.cc:3405] Acquiring write permit [ts-1] 1005 11:29:53.219915 (+ 5us) tablet.cc:3406] Acquiring write permit done [ts-1] 1005 11:29:53.220031 (+ 116us) shared_lock_manager.cc:321] Locking a batch of 2 keys [ts-1] 1005 11:29:53.220042 (+ 11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys [ts-1] 1005 11:29:53.220053 (+ 11us) conflict_resolution.cc:1305] ResolveTransactionConflicts [ts-1] 1005 11:29:53.220149 (+ 96us) conflict_resolution.cc:337] No conflicts. [ts-1] 1005 11:29:53.220247 (+ 98us) write_query.cc:599] TransactionalConflictsResolved [ts-1] 1005 11:29:53.220252 (+ 5us) conflict_resolution.cc:1331] resolver->Resolve done [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220235 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220269 (+ 34us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] .. 1005 11:29:53.220413 (+ 144us) log_cache.cc:273] Appended up to 2 the local log [ts-1] .. 1005 11:29:53.220453 (+ 40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1 [ts-1] .. 1005 11:29:53.221135 (+ 682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8: [ts-1] BEGIN UpdateConsensus [ts-1] 1005 11:29:53.220607 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.220623 (+ 16us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.220647 (+ 24us) service_pool.cc:260] Handling call UpdateConsensus [ts-1] 1005 11:29:53.220719 (+ 72us) raft_consensus.cc:1942] Updating replica for 1 ops [ts-1] 1005 11:29:53.220736 (+ 17us) raft_consensus.cc:2053] Early marking committed up to 0.0 [ts-1] 1005 11:29:53.220737 (+ 1us) raft_consensus.cc:2061] Triggering prepare for 1 ops [ts-1] 1005 11:29:53.220830 (+ 93us) raft_consensus.cc:2233] Updating last received op as 1.2 [ts-1] 1005 11:29:53.220833 (+ 3us) raft_consensus.cc:2246] Marking committed up to 1.1 [ts-1] 1005 11:29:53.220834 (+ 1us) raft_consensus.cc:2251] Filling consensus response to leader. [ts-1] 1005 11:29:53.220835 (+ 1us) raft_consensus.cc:2033] UpdateReplica() finished [ts-1] 1005 11:29:53.220837 (+ 2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging [ts-1] 1005 11:29:53.220899 (+ 62us) log_cache.cc:273] Appended up to 2 the local log [ts-1] 1005 11:29:53.220909 (+ 10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220774 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220776 (+ 2us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] [ts-1] END UpdateConsensus [ts-1] .. 1005 11:29:53.221184 (+ 49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2 [ts-1] .. 1005 11:29:53.221187 (+ 3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked [ts-1] .. 1005 11:29:53.221190 (+ 3us) operation_driver.cc:318] Begin ReplicationFinished() call. [ts-1] .. 1005 11:29:53.221192 (+ 2us) write_operation.cc:77] APPLY: Starting [ts-1] .. 1005 11:29:53.221296 (+ 104us) write_operation.cc:95] FINISH: making edits visible [ts-1] .. 1005 11:29:53.221307 (+ 11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys [ts-1] .. 1005 11:29:53.221316 (+ 9us) tablet_service.cc:366] Write completing with status OK [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.220409 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.220444 (+ 35us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.220478 (+ 34us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.220501 (+ 23us) outbound_call.cc:583] Call Sent. [ts-1] .. .. 1005 11:29:53.221111 (+ 610us) outbound_call.cc:525] Response received. [ts-1] .. .. 1005 11:29:53.221124 (+ 13us) outbound_call.cc:429] Callback will be called asynchronously. [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.221153 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.221190 (+ 37us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.221214 (+ 24us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.221231 (+ 17us) outbound_call.cc:583] Call Sent. [ts-1] END. [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.219800 (+ 0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write. [ts-1] .. 1005 11:29:53.219809 (+ 9us) local_call.cc:42] LocalOutboundCall [ts-1] .. 1005 11:29:53.219811 (+ 2us) outbound_call.cc:573] Queued. [ts-1] .. 1005 11:29:53.219812 (+ 1us) outbound_call.cc:583] Call Sent. [ts-1] .. 1005 11:29:53.221417 (+ 1605us) outbound_call.cc:429] Callback will be called asynchronously. ``` Upgrade/Rollback safety: Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored. Jira: DB-8207 Test Plan: ``` export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true" ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite ``` Reviewers: rthallam, esheng, arybochkin Reviewed By: arybochkin Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D29092
amitanandaiyer
added a commit
that referenced
this issue
Oct 18, 2023
Summary: Original commit: 56f165c / D29092 1) Support tracing UpdateConsensus api 2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower. 3) Add a trace msg when appended to the local log The trace output looks like ``` [ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace: [ts-1] 1005 11:29:53.219773 (+ 0us) async_rpc.cc:550] WriteRpc initiated [ts-1] 1005 11:29:53.219786 (+ 13us) async_rpc.cc:202] SendRpc() called. [ts-1] 1005 11:29:53.219787 (+ 1us) tablet_rpc.cc:113] SelectTabletServer() [ts-1] 1005 11:29:53.219794 (+ 7us) async_rpc.cc:626] SendRpcToTserver [ts-1] 1005 11:29:53.219837 (+ 43us) async_rpc.cc:631] RpcDispatched Asynchronously [ts-1] 1005 11:29:53.221422 (+ 1585us) tablet_rpc.cc:320] Done(OK) [ts-1] 1005 11:29:53.221430 (+ 8us) async_rpc.cc:456] ProcessResponseFromTserver(OK) [ts-1] 1005 11:29:53.221431 (+ 1us) async_rpc.cc:458] Received from server: [ts-1] BEGIN [ts-1] 1005 11:29:53.219815 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.219828 (+ 13us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.219855 (+ 27us) service_pool.cc:260] Handling call Write [ts-1] 1005 11:29:53.219868 (+ 13us) tablet_service.cc:1958] Start Write [ts-1] 1005 11:29:53.219909 (+ 41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations [ts-1] 1005 11:29:53.219910 (+ 1us) tablet.cc:3405] Acquiring write permit [ts-1] 1005 11:29:53.219915 (+ 5us) tablet.cc:3406] Acquiring write permit done [ts-1] 1005 11:29:53.220031 (+ 116us) shared_lock_manager.cc:321] Locking a batch of 2 keys [ts-1] 1005 11:29:53.220042 (+ 11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys [ts-1] 1005 11:29:53.220053 (+ 11us) conflict_resolution.cc:1305] ResolveTransactionConflicts [ts-1] 1005 11:29:53.220149 (+ 96us) conflict_resolution.cc:337] No conflicts. [ts-1] 1005 11:29:53.220247 (+ 98us) write_query.cc:599] TransactionalConflictsResolved [ts-1] 1005 11:29:53.220252 (+ 5us) conflict_resolution.cc:1331] resolver->Resolve done [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220235 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220269 (+ 34us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] .. 1005 11:29:53.220413 (+ 144us) log_cache.cc:273] Appended up to 2 the local log [ts-1] .. 1005 11:29:53.220453 (+ 40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1 [ts-1] .. 1005 11:29:53.221135 (+ 682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8: [ts-1] BEGIN UpdateConsensus [ts-1] 1005 11:29:53.220607 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.220623 (+ 16us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.220647 (+ 24us) service_pool.cc:260] Handling call UpdateConsensus [ts-1] 1005 11:29:53.220719 (+ 72us) raft_consensus.cc:1942] Updating replica for 1 ops [ts-1] 1005 11:29:53.220736 (+ 17us) raft_consensus.cc:2053] Early marking committed up to 0.0 [ts-1] 1005 11:29:53.220737 (+ 1us) raft_consensus.cc:2061] Triggering prepare for 1 ops [ts-1] 1005 11:29:53.220830 (+ 93us) raft_consensus.cc:2233] Updating last received op as 1.2 [ts-1] 1005 11:29:53.220833 (+ 3us) raft_consensus.cc:2246] Marking committed up to 1.1 [ts-1] 1005 11:29:53.220834 (+ 1us) raft_consensus.cc:2251] Filling consensus response to leader. [ts-1] 1005 11:29:53.220835 (+ 1us) raft_consensus.cc:2033] UpdateReplica() finished [ts-1] 1005 11:29:53.220837 (+ 2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging [ts-1] 1005 11:29:53.220899 (+ 62us) log_cache.cc:273] Appended up to 2 the local log [ts-1] 1005 11:29:53.220909 (+ 10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220774 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220776 (+ 2us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] [ts-1] END UpdateConsensus [ts-1] .. 1005 11:29:53.221184 (+ 49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2 [ts-1] .. 1005 11:29:53.221187 (+ 3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked [ts-1] .. 1005 11:29:53.221190 (+ 3us) operation_driver.cc:318] Begin ReplicationFinished() call. [ts-1] .. 1005 11:29:53.221192 (+ 2us) write_operation.cc:77] APPLY: Starting [ts-1] .. 1005 11:29:53.221296 (+ 104us) write_operation.cc:95] FINISH: making edits visible [ts-1] .. 1005 11:29:53.221307 (+ 11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys [ts-1] .. 1005 11:29:53.221316 (+ 9us) tablet_service.cc:366] Write completing with status OK [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.220409 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.220444 (+ 35us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.220478 (+ 34us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.220501 (+ 23us) outbound_call.cc:583] Call Sent. [ts-1] .. .. 1005 11:29:53.221111 (+ 610us) outbound_call.cc:525] Response received. [ts-1] .. .. 1005 11:29:53.221124 (+ 13us) outbound_call.cc:429] Callback will be called asynchronously. [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.221153 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.221190 (+ 37us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.221214 (+ 24us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.221231 (+ 17us) outbound_call.cc:583] Call Sent. [ts-1] END. [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.219800 (+ 0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write. [ts-1] .. 1005 11:29:53.219809 (+ 9us) local_call.cc:42] LocalOutboundCall [ts-1] .. 1005 11:29:53.219811 (+ 2us) outbound_call.cc:573] Queued. [ts-1] .. 1005 11:29:53.219812 (+ 1us) outbound_call.cc:583] Call Sent. [ts-1] .. 1005 11:29:53.221417 (+ 1605us) outbound_call.cc:429] Callback will be called asynchronously. ``` Upgrade/Rollback safety: Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored. Jira: DB-8207 Test Plan: ``` export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true" ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite ``` Reviewers: rthallam, esheng, arybochkin Reviewed By: arybochkin Subscribers: bogdan, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29414
Consider backporting all the way to 2.14 after a weeks worth of bake time. |
yugabyte-ci
added
priority/high
High Priority
and removed
priority/medium
Medium priority issue
labels
Nov 15, 2023
amitanandaiyer
added a commit
that referenced
this issue
Dec 20, 2023
Summary: Original commit: 56f165c / D29092 1) Support tracing UpdateConsensus api 2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower. 3) Add a trace msg when appended to the local log The trace output looks like ``` [ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace: [ts-1] 1005 11:29:53.219773 (+ 0us) async_rpc.cc:550] WriteRpc initiated [ts-1] 1005 11:29:53.219786 (+ 13us) async_rpc.cc:202] SendRpc() called. [ts-1] 1005 11:29:53.219787 (+ 1us) tablet_rpc.cc:113] SelectTabletServer() [ts-1] 1005 11:29:53.219794 (+ 7us) async_rpc.cc:626] SendRpcToTserver [ts-1] 1005 11:29:53.219837 (+ 43us) async_rpc.cc:631] RpcDispatched Asynchronously [ts-1] 1005 11:29:53.221422 (+ 1585us) tablet_rpc.cc:320] Done(OK) [ts-1] 1005 11:29:53.221430 (+ 8us) async_rpc.cc:456] ProcessResponseFromTserver(OK) [ts-1] 1005 11:29:53.221431 (+ 1us) async_rpc.cc:458] Received from server: [ts-1] BEGIN [ts-1] 1005 11:29:53.219815 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.219828 (+ 13us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.219855 (+ 27us) service_pool.cc:260] Handling call Write [ts-1] 1005 11:29:53.219868 (+ 13us) tablet_service.cc:1958] Start Write [ts-1] 1005 11:29:53.219909 (+ 41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations [ts-1] 1005 11:29:53.219910 (+ 1us) tablet.cc:3405] Acquiring write permit [ts-1] 1005 11:29:53.219915 (+ 5us) tablet.cc:3406] Acquiring write permit done [ts-1] 1005 11:29:53.220031 (+ 116us) shared_lock_manager.cc:321] Locking a batch of 2 keys [ts-1] 1005 11:29:53.220042 (+ 11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys [ts-1] 1005 11:29:53.220053 (+ 11us) conflict_resolution.cc:1305] ResolveTransactionConflicts [ts-1] 1005 11:29:53.220149 (+ 96us) conflict_resolution.cc:337] No conflicts. [ts-1] 1005 11:29:53.220247 (+ 98us) write_query.cc:599] TransactionalConflictsResolved [ts-1] 1005 11:29:53.220252 (+ 5us) conflict_resolution.cc:1331] resolver->Resolve done [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220235 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220269 (+ 34us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] .. 1005 11:29:53.220413 (+ 144us) log_cache.cc:273] Appended up to 2 the local log [ts-1] .. 1005 11:29:53.220453 (+ 40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1 [ts-1] .. 1005 11:29:53.221135 (+ 682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8: [ts-1] BEGIN UpdateConsensus [ts-1] 1005 11:29:53.220607 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.220623 (+ 16us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.220647 (+ 24us) service_pool.cc:260] Handling call UpdateConsensus [ts-1] 1005 11:29:53.220719 (+ 72us) raft_consensus.cc:1942] Updating replica for 1 ops [ts-1] 1005 11:29:53.220736 (+ 17us) raft_consensus.cc:2053] Early marking committed up to 0.0 [ts-1] 1005 11:29:53.220737 (+ 1us) raft_consensus.cc:2061] Triggering prepare for 1 ops [ts-1] 1005 11:29:53.220830 (+ 93us) raft_consensus.cc:2233] Updating last received op as 1.2 [ts-1] 1005 11:29:53.220833 (+ 3us) raft_consensus.cc:2246] Marking committed up to 1.1 [ts-1] 1005 11:29:53.220834 (+ 1us) raft_consensus.cc:2251] Filling consensus response to leader. [ts-1] 1005 11:29:53.220835 (+ 1us) raft_consensus.cc:2033] UpdateReplica() finished [ts-1] 1005 11:29:53.220837 (+ 2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging [ts-1] 1005 11:29:53.220899 (+ 62us) log_cache.cc:273] Appended up to 2 the local log [ts-1] 1005 11:29:53.220909 (+ 10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220774 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220776 (+ 2us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] [ts-1] END UpdateConsensus [ts-1] .. 1005 11:29:53.221184 (+ 49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2 [ts-1] .. 1005 11:29:53.221187 (+ 3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked [ts-1] .. 1005 11:29:53.221190 (+ 3us) operation_driver.cc:318] Begin ReplicationFinished() call. [ts-1] .. 1005 11:29:53.221192 (+ 2us) write_operation.cc:77] APPLY: Starting [ts-1] .. 1005 11:29:53.221296 (+ 104us) write_operation.cc:95] FINISH: making edits visible [ts-1] .. 1005 11:29:53.221307 (+ 11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys [ts-1] .. 1005 11:29:53.221316 (+ 9us) tablet_service.cc:366] Write completing with status OK [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.220409 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.220444 (+ 35us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.220478 (+ 34us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.220501 (+ 23us) outbound_call.cc:583] Call Sent. [ts-1] .. .. 1005 11:29:53.221111 (+ 610us) outbound_call.cc:525] Response received. [ts-1] .. .. 1005 11:29:53.221124 (+ 13us) outbound_call.cc:429] Callback will be called asynchronously. [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.221153 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.221190 (+ 37us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.221214 (+ 24us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.221231 (+ 17us) outbound_call.cc:583] Call Sent. [ts-1] END. [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.219800 (+ 0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write. [ts-1] .. 1005 11:29:53.219809 (+ 9us) local_call.cc:42] LocalOutboundCall [ts-1] .. 1005 11:29:53.219811 (+ 2us) outbound_call.cc:573] Queued. [ts-1] .. 1005 11:29:53.219812 (+ 1us) outbound_call.cc:583] Call Sent. [ts-1] .. 1005 11:29:53.221417 (+ 1605us) outbound_call.cc:429] Callback will be called asynchronously. ``` Upgrade/Rollback safety: Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored. Jira: DB-8207 Test Plan: ``` export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true" ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite ``` Reviewers: rthallam, esheng, arybochkin Reviewed By: rthallam Subscribers: bogdan, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30191
amitanandaiyer
added a commit
that referenced
this issue
Dec 21, 2023
Summary: Original commit: 56f165c / D29092 1) Support tracing UpdateConsensus api 2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower. 3) Add a trace msg when appended to the local log The trace output looks like ``` [ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace: [ts-1] 1005 11:29:53.219773 (+ 0us) async_rpc.cc:550] WriteRpc initiated [ts-1] 1005 11:29:53.219786 (+ 13us) async_rpc.cc:202] SendRpc() called. [ts-1] 1005 11:29:53.219787 (+ 1us) tablet_rpc.cc:113] SelectTabletServer() [ts-1] 1005 11:29:53.219794 (+ 7us) async_rpc.cc:626] SendRpcToTserver [ts-1] 1005 11:29:53.219837 (+ 43us) async_rpc.cc:631] RpcDispatched Asynchronously [ts-1] 1005 11:29:53.221422 (+ 1585us) tablet_rpc.cc:320] Done(OK) [ts-1] 1005 11:29:53.221430 (+ 8us) async_rpc.cc:456] ProcessResponseFromTserver(OK) [ts-1] 1005 11:29:53.221431 (+ 1us) async_rpc.cc:458] Received from server: [ts-1] BEGIN [ts-1] 1005 11:29:53.219815 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.219828 (+ 13us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.219855 (+ 27us) service_pool.cc:260] Handling call Write [ts-1] 1005 11:29:53.219868 (+ 13us) tablet_service.cc:1958] Start Write [ts-1] 1005 11:29:53.219909 (+ 41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations [ts-1] 1005 11:29:53.219910 (+ 1us) tablet.cc:3405] Acquiring write permit [ts-1] 1005 11:29:53.219915 (+ 5us) tablet.cc:3406] Acquiring write permit done [ts-1] 1005 11:29:53.220031 (+ 116us) shared_lock_manager.cc:321] Locking a batch of 2 keys [ts-1] 1005 11:29:53.220042 (+ 11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys [ts-1] 1005 11:29:53.220053 (+ 11us) conflict_resolution.cc:1305] ResolveTransactionConflicts [ts-1] 1005 11:29:53.220149 (+ 96us) conflict_resolution.cc:337] No conflicts. [ts-1] 1005 11:29:53.220247 (+ 98us) write_query.cc:599] TransactionalConflictsResolved [ts-1] 1005 11:29:53.220252 (+ 5us) conflict_resolution.cc:1331] resolver->Resolve done [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220235 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220269 (+ 34us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] .. 1005 11:29:53.220413 (+ 144us) log_cache.cc:273] Appended up to 2 the local log [ts-1] .. 1005 11:29:53.220453 (+ 40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1 [ts-1] .. 1005 11:29:53.221135 (+ 682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8: [ts-1] BEGIN UpdateConsensus [ts-1] 1005 11:29:53.220607 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.220623 (+ 16us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.220647 (+ 24us) service_pool.cc:260] Handling call UpdateConsensus [ts-1] 1005 11:29:53.220719 (+ 72us) raft_consensus.cc:1942] Updating replica for 1 ops [ts-1] 1005 11:29:53.220736 (+ 17us) raft_consensus.cc:2053] Early marking committed up to 0.0 [ts-1] 1005 11:29:53.220737 (+ 1us) raft_consensus.cc:2061] Triggering prepare for 1 ops [ts-1] 1005 11:29:53.220830 (+ 93us) raft_consensus.cc:2233] Updating last received op as 1.2 [ts-1] 1005 11:29:53.220833 (+ 3us) raft_consensus.cc:2246] Marking committed up to 1.1 [ts-1] 1005 11:29:53.220834 (+ 1us) raft_consensus.cc:2251] Filling consensus response to leader. [ts-1] 1005 11:29:53.220835 (+ 1us) raft_consensus.cc:2033] UpdateReplica() finished [ts-1] 1005 11:29:53.220837 (+ 2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging [ts-1] 1005 11:29:53.220899 (+ 62us) log_cache.cc:273] Appended up to 2 the local log [ts-1] 1005 11:29:53.220909 (+ 10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220774 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220776 (+ 2us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] [ts-1] END UpdateConsensus [ts-1] .. 1005 11:29:53.221184 (+ 49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2 [ts-1] .. 1005 11:29:53.221187 (+ 3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked [ts-1] .. 1005 11:29:53.221190 (+ 3us) operation_driver.cc:318] Begin ReplicationFinished() call. [ts-1] .. 1005 11:29:53.221192 (+ 2us) write_operation.cc:77] APPLY: Starting [ts-1] .. 1005 11:29:53.221296 (+ 104us) write_operation.cc:95] FINISH: making edits visible [ts-1] .. 1005 11:29:53.221307 (+ 11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys [ts-1] .. 1005 11:29:53.221316 (+ 9us) tablet_service.cc:366] Write completing with status OK [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.220409 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.220444 (+ 35us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.220478 (+ 34us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.220501 (+ 23us) outbound_call.cc:583] Call Sent. [ts-1] .. .. 1005 11:29:53.221111 (+ 610us) outbound_call.cc:525] Response received. [ts-1] .. .. 1005 11:29:53.221124 (+ 13us) outbound_call.cc:429] Callback will be called asynchronously. [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.221153 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.221190 (+ 37us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.221214 (+ 24us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.221231 (+ 17us) outbound_call.cc:583] Call Sent. [ts-1] END. [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.219800 (+ 0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write. [ts-1] .. 1005 11:29:53.219809 (+ 9us) local_call.cc:42] LocalOutboundCall [ts-1] .. 1005 11:29:53.219811 (+ 2us) outbound_call.cc:573] Queued. [ts-1] .. 1005 11:29:53.219812 (+ 1us) outbound_call.cc:583] Call Sent. [ts-1] .. 1005 11:29:53.221417 (+ 1605us) outbound_call.cc:429] Callback will be called asynchronously. ``` Upgrade/Rollback safety: Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored. Jira: DB-8207 Changes from what's on master: Changes to make things compile w/o LW protos Test Plan: ``` export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true" ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite ``` Reviewers: rthallam, esheng, arybochkin Reviewed By: rthallam Subscribers: ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30192
amitanandaiyer
added a commit
that referenced
this issue
Dec 21, 2023
Summary: Original commit: 56f165c / D29092 1) Support tracing UpdateConsensus api 2) define collect_update_consensus_traces This flag needs to be enabled to collect the traces for UpdateConsensus from the remote follower. 3) Add a trace msg when appended to the local log The trace output looks like ``` [ts-1] I1005 11:29:53.233863 1840902144 async_rpc.cc:192] Write(tablet: 4904acc47aac4f89a828e23987afef25, num_ops: 1, num_attempts: 1, txn: 36b494ed-2d70-425e-97de-037b3b69ccb6, subtxn: [none]) took 14174us. Trace: [ts-1] 1005 11:29:53.219773 (+ 0us) async_rpc.cc:550] WriteRpc initiated [ts-1] 1005 11:29:53.219786 (+ 13us) async_rpc.cc:202] SendRpc() called. [ts-1] 1005 11:29:53.219787 (+ 1us) tablet_rpc.cc:113] SelectTabletServer() [ts-1] 1005 11:29:53.219794 (+ 7us) async_rpc.cc:626] SendRpcToTserver [ts-1] 1005 11:29:53.219837 (+ 43us) async_rpc.cc:631] RpcDispatched Asynchronously [ts-1] 1005 11:29:53.221422 (+ 1585us) tablet_rpc.cc:320] Done(OK) [ts-1] 1005 11:29:53.221430 (+ 8us) async_rpc.cc:456] ProcessResponseFromTserver(OK) [ts-1] 1005 11:29:53.221431 (+ 1us) async_rpc.cc:458] Received from server: [ts-1] BEGIN [ts-1] 1005 11:29:53.219815 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.219828 (+ 13us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.219855 (+ 27us) service_pool.cc:260] Handling call Write [ts-1] 1005 11:29:53.219868 (+ 13us) tablet_service.cc:1958] Start Write [ts-1] 1005 11:29:53.219909 (+ 41us) tablet.cc:1883] AcquireLocksAndPerformDocOperations [ts-1] 1005 11:29:53.219910 (+ 1us) tablet.cc:3405] Acquiring write permit [ts-1] 1005 11:29:53.219915 (+ 5us) tablet.cc:3406] Acquiring write permit done [ts-1] 1005 11:29:53.220031 (+ 116us) shared_lock_manager.cc:321] Locking a batch of 2 keys [ts-1] 1005 11:29:53.220042 (+ 11us) shared_lock_manager.cc:337] Acquired a lock batch of 2 keys [ts-1] 1005 11:29:53.220053 (+ 11us) conflict_resolution.cc:1305] ResolveTransactionConflicts [ts-1] 1005 11:29:53.220149 (+ 96us) conflict_resolution.cc:337] No conflicts. [ts-1] 1005 11:29:53.220247 (+ 98us) write_query.cc:599] TransactionalConflictsResolved [ts-1] 1005 11:29:53.220252 (+ 5us) conflict_resolution.cc:1331] resolver->Resolve done [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220235 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220269 (+ 34us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] .. 1005 11:29:53.220413 (+ 144us) log_cache.cc:273] Appended up to 2 the local log [ts-1] .. 1005 11:29:53.220453 (+ 40us) raft_consensus.cc:1396] Marking majority replicated up to 1.1 [ts-1] .. 1005 11:29:53.221135 (+ 682us) consensus_peers.cc:511] Received response from 231be5482e9344b09245b3c6f1ff77e8: [ts-1] BEGIN UpdateConsensus [ts-1] 1005 11:29:53.220607 (+ 0us) inbound_call.cc:76] Created InboundCall [ts-1] 1005 11:29:53.220623 (+ 16us) service_pool.cc:182] Inserting onto call queue [ts-1] 1005 11:29:53.220647 (+ 24us) service_pool.cc:260] Handling call UpdateConsensus [ts-1] 1005 11:29:53.220719 (+ 72us) raft_consensus.cc:1942] Updating replica for 1 ops [ts-1] 1005 11:29:53.220736 (+ 17us) raft_consensus.cc:2053] Early marking committed up to 0.0 [ts-1] 1005 11:29:53.220737 (+ 1us) raft_consensus.cc:2061] Triggering prepare for 1 ops [ts-1] 1005 11:29:53.220830 (+ 93us) raft_consensus.cc:2233] Updating last received op as 1.2 [ts-1] 1005 11:29:53.220833 (+ 3us) raft_consensus.cc:2246] Marking committed up to 1.1 [ts-1] 1005 11:29:53.220834 (+ 1us) raft_consensus.cc:2251] Filling consensus response to leader. [ts-1] 1005 11:29:53.220835 (+ 1us) raft_consensus.cc:2033] UpdateReplica() finished [ts-1] 1005 11:29:53.220837 (+ 2us) raft_consensus.cc:2169] Waiting on the replicates to finish logging [ts-1] 1005 11:29:53.220899 (+ 62us) log_cache.cc:273] Appended up to 2 the local log [ts-1] 1005 11:29:53.220909 (+ 10us) raft_consensus.cc:2194] Finished waiting on the replicates to finish logging [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.220774 (+ 0us) operation_driver.cc:175] ExecuteAsync [ts-1] .. 1005 11:29:53.220776 (+ 2us) operation_driver.cc:235] Begin PrepareAndStart [ts-1] [ts-1] END UpdateConsensus [ts-1] .. 1005 11:29:53.221184 (+ 49us) raft_consensus.cc:1396] Marking majority replicated up to 1.2 [ts-1] .. 1005 11:29:53.221187 (+ 3us) replica_state.cc:883] Begin ApplyPendingOperationsUnlocked [ts-1] .. 1005 11:29:53.221190 (+ 3us) operation_driver.cc:318] Begin ReplicationFinished() call. [ts-1] .. 1005 11:29:53.221192 (+ 2us) write_operation.cc:77] APPLY: Starting [ts-1] .. 1005 11:29:53.221296 (+ 104us) write_operation.cc:95] FINISH: making edits visible [ts-1] .. 1005 11:29:53.221307 (+ 11us) shared_lock_manager.cc:361] Unlocking a batch of 2 keys [ts-1] .. 1005 11:29:53.221316 (+ 9us) tablet_service.cc:366] Write completing with status OK [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.220409 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.220444 (+ 35us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.220478 (+ 34us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.220501 (+ 23us) outbound_call.cc:583] Call Sent. [ts-1] .. .. 1005 11:29:53.221111 (+ 610us) outbound_call.cc:525] Response received. [ts-1] .. .. 1005 11:29:53.221124 (+ 13us) outbound_call.cc:429] Callback will be called asynchronously. [ts-1] .. .. Related trace: [ts-1] .. .. 1005 11:29:53.221153 (+ 0us) outbound_call.cc:232] yb.consensus.ConsensusService.UpdateConsensus. [ts-1] .. .. 1005 11:29:53.221190 (+ 37us) reactor.cc:938] Scheduled. [ts-1] .. .. 1005 11:29:53.221214 (+ 24us) outbound_call.cc:573] Queued. [ts-1] .. .. 1005 11:29:53.221231 (+ 17us) outbound_call.cc:583] Call Sent. [ts-1] END. [ts-1] .. Related trace: [ts-1] .. 1005 11:29:53.219800 (+ 0us) outbound_call.cc:232] yb.tserver.TabletServerService.Write. [ts-1] .. 1005 11:29:53.219809 (+ 9us) local_call.cc:42] LocalOutboundCall [ts-1] .. 1005 11:29:53.219811 (+ 2us) outbound_call.cc:573] Queued. [ts-1] .. 1005 11:29:53.219812 (+ 1us) outbound_call.cc:583] Call Sent. [ts-1] .. 1005 11:29:53.221417 (+ 1605us) outbound_call.cc:429] Callback will be called asynchronously. ``` Upgrade/Rollback safety: Should be ok to upgrade/rollback. Only new/optional fields are added in this change. Protobuf's default handling of ignoring the fields (at an older server) works fine. The trace buffer will be returned only if both leader and the follower have this change. Otherwise, the `request.trace_requested` will be ignored. Jira: DB-8207 do not use _RUNTIME_ macro in 2.14. No LW proto Test Plan: ``` export YB_EXTRA_TSERVER_FLAGS="--collect_end_to_end_traces=true --ybclient_print_trace_every_n=1 --collect_update_consensus_traces=true" ybd --cxx-test cassandra_cpp_driver-test --gtest_filter CppCassandraDriverTest.TransactionalWrite ``` Reviewers: rthallam, esheng, arybochkin Reviewed By: rthallam Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D30193
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.14 Backport Required
2.16 Backport Required
2.18 Backport Required
2.20 Backport Required
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/high
High Priority
Jira Link: DB-8207
Description
Support tracing UpdateConsensus calls
Issue Type
kind/enhancement
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: