Skip to content

2.25.0.0-b302

@basavaraj29 basavaraj29 tagged this 14 Nov 05:42
Summary:
In one of the customer environments, we saw a case where RBS threads get stuck on the client side with the following stack
```
<tr><td>rpc_tp_TabletServer-high-pri_13-231</td><td>6.940s</td><td>0.000s</td><td>0.000s</td><td rowspan="43"><pre>    @     0x7ffff6bafffc  __GI_nanosleep
    @          0x3b52cca  yb::RateLimiter::UpdateTimeSlotSizeAndMaybeSleep()
    @          0x37dad42  yb::tserver::RemoteBootstrapFileDownloader::DownloadFile<>()
    @          0x37d9f00  yb::tserver::RemoteBootstrapFileDownloader::DownloadFile()
    @          0x37d5dfe  yb::tserver::RemoteBootstrapClient::FetchAll()
    @          0x384f845  yb::tserver::TSTabletManager::StartRemoteBootstrap()
    @          0x3837b65  yb::tserver::ConsensusServiceImpl::StartRemoteBootstrap()
    @          0x2ee5482  std::__1::__function::__func<>::operator()()
    @          0x2eec415  yb::consensus::ConsensusServiceIf::Handle()
    @          0x3584537  yb::rpc::ServicePoolImpl::Handle()
    @          0x34d2ec4  yb::rpc::InboundCall::InboundCallTask::Run()
    @          0x359351c  yb::rpc::(anonymous namespace)::Worker::Execute()
    @          0x3b805d3  yb::Thread::SuperviseThread()
    @     0x7ffff66de693  start_thread
    @     0x7ffff6be041c  __clone
```

```
<tr><td>rpc_tp_TabletServer-high-pri_377-12804</td><td>979.670s</td><td>0.000s</td><td>0.000s</td><td rowspan="1"><pre>    @     0x7ffff6bafffc  __GI_nanosleep
    @          0x3b52cca  yb::RateLimiter::UpdateTimeSlotSizeAndMaybeSleep()
```

This diff enhances the visibility for any unexpected rate limiter sleep times that could have creeped in either due to math overflow/or other clock issues. Note that we don't see any possible math overflow issues though.
Jira: DB-13252

Test Plan:
Jenkins

Manually tested the logging part, can see the following.
RBS source side logs
```
I1109 14:06:45.567804 1885761536 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.228s end: 3903865.228s data_size: 135993 target_rate: 134217728 time_slot_ms: 100 sleep_time_ms: 1 } Last few iteration stats: []
I1109 14:06:45.569423 1893216256 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.229s end: 3903865.231s data_size: 16 target_rate: 134217728 time_slot_ms: 100 sleep_time_ms: 0 } Last few iteration stats: [{ start:
3903865.228s end: 3903865.228s data_size: 135993 target_rate: 134217728 time_slot_ms: 100 sleep_time_ms: 1 }]
I1109 14:06:45.570024 1892069376 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.231s end: 3903865.231s data_size: 475 target_rate: 134217728 time_slot_ms: 200 sleep_time_ms: 0 } Last few iteration stats: [{ start:
3903865.228s end: 3903865.228s data_size: 135993 target_rate: 134217728 time_slot_ms: 100 sleep_time_ms: 1 }, { start: 3903865.229s end: 3903865.231s data_size: 16 target_rate: 134217728 time_slot_ms: 100 sleep_time_ms: 0 }]
I1109 14:06:45.773761 1895510016 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.231s end: 3903865.237s data_size: 26843545 target_rate: 134217728 time_slot_ms: 200 sleep_time_ms: 194 } Last few iteration stats: [{ start:
3903865.228s end: 3903865.228s data_size: 135993 target_rate: 134217728 time_slot_ms: 100 sleep_time_ms: 1 }, { start: 3903865.229s end: 3903865.231s data_size: 16 target_rate: 134217728 time_slot_ms: 100 sleep_time_ms: 0 }, { start:
3903865.231s end: 3903865.231s data_size: 475 target_rate: 134217728 time_slot_ms: 200 sleep_time_ms: 0 }]
...
 ```

RBS client side logs. Note that the client side starts one rate limiter per file, while the server side the same rate limiter throughout the rbs session.
```
I1109 14:06:45.568485 1914236928 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.228s end: 3903865.230s data_size: 136012 target_rate: 268435456 time_slot_ms: 100 sleep_time_ms: 0 } Last few iteration stats: []
I1109 14:06:45.569579 1914236928 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.231s end: 3903865.231s data_size: 29 target_rate: 268435456 time_slot_ms: 100 sleep_time_ms: 0 } Last few iteration stats: []
I1109 14:06:45.570200 1914236928 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.231s end: 3903865.232s data_size: 491 target_rate: 268435456 time_slot_ms: 100 sleep_time_ms: 0 } Last few iteration stats: []
I1109 14:06:45.866606 1914236928 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.232s end: 3903865.528s data_size: 26843567 target_rate: 268435456 time_slot_ms: 100 sleep_time_ms: 0 } Last few iteration stats: []
I1109 14:06:45.959517 1914236928 rate_limiter.cc:111]  Current iteration stats: { start: 3903865.612s end: 3903865.621s data_size: 2275725 target_rate: 268435456 time_slot_ms: 200 sleep_time_ms: 0 } Last few iteration stats: [{ start:
3903865.232s end: 3903865.528s data_size: 26843567 target_rate: 268435456 time_slot_ms: 100 sleep_time_ms: 0 }]
```

Reviewers: amitanand, rthallam, slingam

Reviewed By: amitanand

Subscribers: svc_phabricator, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39857
Assets 2
Loading