[Jepsen] During network partitions, CQL requests never time out #822

aphyr · 2019-01-30T21:08:12Z

During a network partition, YugaByte CE 1.1.10.0 does not appear to return immediate failures to client requests; instead, requests appear to time out after 10+ seconds. This behavior could pose problems for production clients: during a fault, increasing latencies by 1-2 orders of magnitude could tie up work queues or available concurrency on nodes, causing cascading failures. High latencies for operations on partitioned shards might also delay or starve operations on shards with a healthy majority component available, reducing goodput. Finally, timeouts are indeterminate results, which force clients to deal with increased ambiguity--did these requests succeed or fail? Clients might retry ambiguous failures multiple times, overloading a struggling cluster.

This plot shows the latency of client operations (writes, in this case) during a network partition isolating 2 nodes from 3 others in a 5-node Jepsen cluster. Time flows horizontally; latency is plotted vertically. Yellow are indeterminate results (e.g. timeouts), and pink operations are known failures. The grey regions indicate when a network partition was in effect.

This client has a 10-second timeout in effect, so we know YB's server latency is at least 10 seconds under these circumstances--it might be higher.

YugaByte could mitigate these issues by returning definite failure messages to clients immediately when a leader's lease has expired, no leader has been recently reachable, no leader is known, and so on, but continuing to make requests for a leader in the background. Once communication has been re-established, requests may flow again.

aphyr · 2019-02-01T23:56:41Z

A bit more data here: I dug into the Cassandra client timeouts, and discovered that these 12-second timeouts were client-imposed. In fact, Yugabyte's CQL layer refuses to time out any client request--even with up to 500 seconds of network partitions, leader elections, node failures, etc.

This looks to be because the CQL layer specifies MonoTime::Max() for timeouts: https://github.com/YugaByte/yugabyte-db/blob/0274a1c229b8e358d096c7010931402123345de2/src/yb/yql/cql/cqlserver/cql_rpc.cc#L296

kmuthukk · 2019-02-02T01:13:24Z

hi @robertpang , @spolitov

While the request between YQL & TServer (at the ybclient layer) is indeed timing out based on the client_read_write_timeout_ms, @m-iancu noticed that the NeedsRestart() logic returning true seems to keep the request getting retried for ever.

https://github.com/YugaByte/yugabyte-db/blob/0274a1c229b8e358d096c7010931402123345de2/src/yb/yql/cql/ql/exec/executor.cc#L1966

Thoughts?

Here's a simple repro that @amitanandaiyer had:

create a 3 node yb-ctl cluster
on cqlsh

create keyspace x; 
CREATE TABLE x.test (hk int, pk1 int, pk2 int, payload int, PRIMARY KEY ((hk), pk1, pk2));
insert into x.test (hk, pk1, pk2, payload) values (1, 2, 3, 4);
select * from x.test;

now kill 2 tservers.
(ideally ts2 and ts3 so the cqlsh is still talking to ts1)
issue the insert again insert into x.test (hk, pk1, pk2, payload) values (1, 2, 3, 4);
this will hang forever
and can be verified by looking at links 127.0.0.1:12000/rpcz
cqlsh will timeout. But, the request keeps showing up in the /rpcz until the dead tablet servers are brought back.

kmuthukk · 2019-02-02T16:03:15Z

@robertpang

Given that there are retries in the layer between YQL and TServer, and timeout enforced by the client_read_write_timeout_ms gflag, in the executor layer's NeedsRestart logic:

https://github.com/YugaByte/yugabyte-db/blob/8c921ebe73988be3778f0c9029fa8b8567cfd4ba/src/yb/yql/cql/ql/exec/executor.cc#L1206

do we need to restart on a timeout again:

Could we change the NeedsRestart() logic from:

return s.IsTryAgain() || s.IsExpired() || s.IsTimedOut();

to:

return s.IsTryAgain() || s.IsExpired();

?

…ut retries already happen at ybclient layer Summary: Even when client request (e.g., from cqlsh) times out, the request is getting retried for ever in the ql/exec/executor.cc layer, and we see that the request keeps being retried in the system (till e.g., the partition heals). Timeout related restarts are already handled at the ybclient RPC layer (used for YQL to TServer communication). Each request is already tried/retried for an overall `client_read_write_timeout_ms` amount of time (60s default), with an individual RPCs default timeout being `retryable_rpc_single_call_timeout_ms` (default 2500ms). Given the above, the NeedsRestart() logic in the executor layer shouldn't need also need consider timeouts as a reason for restart. Test Plan: Look for test failures. Added new test. Reviewers: kannan, sergei, mihnea, robert Reviewed By: robert Differential Revision: https://phabricator.dev.yugabyte.com/D6097

kmuthukk · 2019-02-06T16:36:57Z

Fixed in 630955b

kmuthukk added the kind/bug This issue is a bug label Feb 2, 2019

kmuthukk added this to To do in Jepsen Testing via automation Feb 2, 2019

kmuthukk added this to To Do in YBase features via automation Feb 2, 2019

kmuthukk assigned spolitov and robertpang Feb 2, 2019

kmuthukk added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Feb 2, 2019

aphyr changed the title ~~Requests time out during network partitions, instead of failing fast~~ During network partitions, CQL requests never time out Feb 6, 2019

kmuthukk assigned amitanandaiyer and unassigned spolitov Feb 6, 2019

kmuthukk closed this as completed Feb 6, 2019

YBase features automation moved this from To Do to Done Feb 6, 2019

Jepsen Testing automation moved this from To do to Done Feb 6, 2019

mbautin changed the title ~~During network partitions, CQL requests never time out~~ [Jepsen] During network partitions, CQL requests never time out Feb 7, 2019

kmuthukk mentioned this issue Mar 19, 2019

[YCQL] - A write heavy workload has stalled the cluster #1025

Closed

yugabyte-ci added the community/request Issues created by external users label Jul 17, 2019

yugabyte-ci unassigned robertpang Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Jepsen] During network partitions, CQL requests never time out #822

[Jepsen] During network partitions, CQL requests never time out #822

aphyr commented Jan 30, 2019 •

edited

aphyr commented Feb 1, 2019

kmuthukk commented Feb 2, 2019

kmuthukk commented Feb 2, 2019

kmuthukk commented Feb 6, 2019

[Jepsen] During network partitions, CQL requests never time out #822

[Jepsen] During network partitions, CQL requests never time out #822

Comments

aphyr commented Jan 30, 2019 • edited

aphyr commented Feb 1, 2019

kmuthukk commented Feb 2, 2019

kmuthukk commented Feb 2, 2019

kmuthukk commented Feb 6, 2019

aphyr commented Jan 30, 2019 •

edited