ReadTimeoutException error on client during performing multiple Add/Drop columns commands during long period of time #15312

temichus · 2023-09-07T08:54:51Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

new longevity test that performs only new add/drop column nemesis PR: scylladb/scylla-cluster-tests#6530

longevity test config:

test_duration: 260

prepare_write_cmd:  "cassandra-stress write cl=QUORUM n=20971520 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=80 -pop seq=1..20971520 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"

stress_cmd: ["cassandra-stress write cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=40 -pop 'dist=uniform(1..20971520)' -col 'n=FIXED(10) size=FIXED(512)' -log interval=5",
             "cassandra-stress read  cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=40 -pop 'dist=uniform(1..20971520)' -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"
             ]

n_db_nodes: 3
n_loaders: 2
n_monitor_nodes: 1
round_robin: true

instance_type_db: 'i4i.xlarge'

nemesis_class_name: 'ParallelAddDropColumnMonkey'
nemesis_interval: 5
nemesis_filter_seeds: false

seeds_num: 1

user_prefix: 'longevity-ParallelAddDropColumnMonkey-3h'

Impact

after about 2 hours of running, seastar_io_queue_delay in more than 1c and client the client gets error:

java.io.IOException: Operation x10 on key(s) [50333439344b4d385030]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded). In case this was generated during read repair, the consistency level is not representative of the actual consistency.

according to @gleb-cloudius "It should be million times smaller"

How frequently does it reproduce?

at 5 runs on i4i.xlarge nodes, 4 fails 1 pass
seastar_io_queue_delay in passed run(4bc31d25-f6c4-4c39-95f1-b0f5e95db8b9)
at runs on i3en.large - 100% fails (according to @gleb-cloudius : "The instance has small memory (16G) so it probably flashes memtables a lot.")

Comments from email corresponding with developers

@xemul:
"There's a known issue found in the scheduler (seastar#1641, and a work
around discussion at seastar#1766) so they could be better. Starvation
time is too large, but it can be due to that issue."

Installation details

Kernel Version: 5.15.0-1043-aws
Scylla version (or git commit hash): 5.4.0~dev-20230904.9a3d57256a60 with build-id 4c60ca8c4afe862855242a77b77b35a377e373dc

Cluster size: 3 nodes (i4i.xlarge)

Scylla Nodes used in this run:

longevity-ParallelAddDropColumnMonk-db-node-527c69a1-3 (54.246.205.180 | 10.4.0.244) (shards: 4)
longevity-ParallelAddDropColumnMonk-db-node-527c69a1-2 (34.246.193.241 | 10.4.1.30) (shards: 4)
longevity-ParallelAddDropColumnMonk-db-node-527c69a1-1 (34.243.195.118 | 10.4.2.208) (shards: 4)

OS / Image: ami-09a5684fbf0955c84 (aws: undefined_region)

Test: longevity-schema-changes-3h-test
Test id: 527c69a1-d03d-42ba-9b38-eba7b4e782c8
Test name: scylla-staging/artsiom_mishuta/longevity-schema-changes-3h-test
Test config file(s):

longevity-ParallelAddDropColumnMonkey-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 527c69a1-d03d-42ba-9b38-eba7b4e782c8
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 527c69a1-d03d-42ba-9b38-eba7b4e782c8

Logs:

db-cluster-527c69a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/527c69a1-d03d-42ba-9b38-eba7b4e782c8/20230906_150923/db-cluster-527c69a1.tar.gz
sct-runner-events-527c69a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/527c69a1-d03d-42ba-9b38-eba7b4e782c8/20230906_150923/sct-runner-events-527c69a1.tar.gz
sct-527c69a1.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/527c69a1-d03d-42ba-9b38-eba7b4e782c8/20230906_150923/sct-527c69a1.log.tar.gz
loader-set-527c69a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/527c69a1-d03d-42ba-9b38-eba7b4e782c8/20230906_150923/loader-set-527c69a1.tar.gz
monitor-set-527c69a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/527c69a1-d03d-42ba-9b38-eba7b4e782c8/20230906_150923/monitor-set-527c69a1.tar.gz
parallel-timelines-report-527c69a1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/527c69a1-d03d-42ba-9b38-eba7b4e782c8/20230906_150923/parallel-timelines-report-527c69a1.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

mykaul · 2023-09-07T09:07:04Z

@xemul - can you please take a look?

xemul · 2023-09-07T10:40:16Z

io-properties:

disks:
- mountpoint: /var/lib/scylla
  read_bandwidth: 763580096
  read_iops: 109954
  write_bandwidth: 561926784
  write_iops: 61008

So disk is able ~{700Mb/s, 100kiops} reads and ~{500Mb/s, 60kiops} writes.

Commitlog consumes 200Mbs / 2kiops (writes)
Compaction comsumes 500Mbs / 5kiops (reads + writes), we can assume it's roughly in half-by-half
Memtable flush consumes 200Mbs / 2kiops (writes)
Statement (query) consumes 50Mb/s / 10k iops (reads)
Summary: 650Mbs / 6.5kiops writes and 300Mb/s / 12.5kiops reads

That's on the edge of disk abilities, it just cannot (well -- not allowed to by scheduler) do more.

xemul · 2023-09-07T10:40:47Z

scylladb/seastar#1641 still applies, though, and it does makes things worse

xemul · 2023-09-07T10:42:15Z

Some nodes consume all the available CPU

xemul · 2023-09-07T10:45:09Z

Also streaming is causing troubles by "stalling" CPUs with quota violations, which is bad, it makes Io scheduler's throttling notably worse

but it doesn't do as much IO according to its IO metrics (almost idling)

xemul · 2023-09-07T10:49:43Z

And the last worrisome thing is IO starvation times

They are pretty big, with huge queue length values it should've been much smaller. But it can be due to scylladb/seastar#1641

kostja · 2023-09-21T09:57:39Z

@temichus please re-try with the latest master, the problem should be addressed with Kamil's fix for schema versioning.

kbr-scylla · 2023-09-21T15:28:48Z

@temichus also please start Scylla with --logging-log-level migration_manager=trace in the next run

temichus · 2023-09-25T17:13:00Z

@kostja @kbr-scylla

the issue was reproduced (5 runs 4 fails)
as requested test was run with append_scylla_args: --logger-log-level migration_manager=trace

Installation details

Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 5.4.0~dev-20230921.a56a4b6226e6 with build-id 616f734e7c7fb5e3ee8898792b3c415d2574a132

Cluster size: 3 nodes (i4i.xlarge)

Scylla Nodes used in this run:

longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3 (34.245.135.223 | 10.4.3.118) (shards: 4)
longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-2 (3.250.193.98 | 10.4.2.192) (shards: 4)
longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1 (52.209.146.47 | 10.4.3.22) (shards: 4)

OS / Image: ami-00f051bf1c684c01a (aws: undefined_region)

Test: longevity-schema-changes-3h-test
Test id: 4c9412b4-9dc5-443c-bb43-883f990f9283
Test name: scylla-staging/artsiom_mishuta/longevity-schema-changes-3h-test
Test config file(s):

longevity-ParallelAddDropColumnMonkey-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 4c9412b4-9dc5-443c-bb43-883f990f9283
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 4c9412b4-9dc5-443c-bb43-883f990f9283

Logs:

db-cluster-4c9412b4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4c9412b4-9dc5-443c-bb43-883f990f9283/20230925_160121/db-cluster-4c9412b4.tar.gz
sct-runner-events-4c9412b4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4c9412b4-9dc5-443c-bb43-883f990f9283/20230925_160121/sct-runner-events-4c9412b4.tar.gz
sct-4c9412b4.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4c9412b4-9dc5-443c-bb43-883f990f9283/20230925_160121/sct-4c9412b4.log.tar.gz
loader-set-4c9412b4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4c9412b4-9dc5-443c-bb43-883f990f9283/20230925_160121/loader-set-4c9412b4.tar.gz
monitor-set-4c9412b4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4c9412b4-9dc5-443c-bb43-883f990f9283/20230925_160121/monitor-set-4c9412b4.tar.gz
parallel-timelines-report-4c9412b4.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4c9412b4-9dc5-443c-bb43-883f990f9283/20230925_160121/parallel-timelines-report-4c9412b4.tar.gz

Jenkins job URL
Argus

kostja · 2023-09-25T20:15:57Z

@kbr-scylla

kbr-scylla · 2023-09-26T12:29:14Z

I'm looking at the logs.

One thing I noticed is that you don't provide original Scylla node logs with millisecond resolution timestamps like this:

INFO  2023-09-26 14:28:28,315 [shard 0:stre] schema_tables - Schema version changed to 322b57d2-5c68-11ee-cd0e-cb7b52c37710

the milliseconds would be useful for my investigation here TBH... but maybe I'll find something out without it

kbr-scylla · 2023-09-26T13:21:35Z

Why don't the metrics have read timeout data?

cc @amnonh

gleb-cloudius · 2023-09-26T13:24:29Z

Why don't the metrics have read timeout data? cc @amnonh

May be because we do not sent zero metrics any longer?

kbr-scylla · 2023-09-26T13:26:48Z

They should be nonzero though, after all we're getting ReadTimeoutExceptions.
Also, for example, hint metrics look like this:

kbr-scylla · 2023-09-26T13:59:30Z

I don't have read timeout metrics, but I have coordinator read p99 latency metrics. Sometimes those reach 2s. These events don't seem to correlate with schema read barriers. For example, around 13:48:15 - 13:48:30, on 10.4.3.118 shard 1:

Around this time I didn't see any "Performing read barrier" / "Mark schema ... as synced" logs on this node. The closest barrier messages are:

2023-09-25T13:40:34+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3    !DEBUG | scylla[5362]:  [shard 2:stat] migration_manager - Mark schema 1a573eb0-5ba9-11ee-a9cb-13ebead80e4f as synced

before, and

2023-09-25T13:50:28+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3    !DEBUG | scylla[5362]:  [shard 1:stat] migration_manager - Performing raft read barrier because schema is not synced, version: 7c605500-5baa-11ee-a07d-a112f1832725

after.

I only looked at few such data points though.

Also it seems that 2s is for some reason a favorite value for this metric. Why 2s in particular? Is it somehow related to write timeout setting which is 2s? But we're looking at reads here, which have 5s timeout setting. Maybe it's read repair, and it's using 2s for the internal write?

kbr-scylla · 2023-09-26T13:59:59Z

I don't have read timeout metrics

Also the stressor app doesn't provide timestamps when the timeouts happened. Very convenient.

kbr-scylla · 2023-09-26T14:09:59Z

Ok there's this one event when read p99 latency jumps to >5s on 10.4.3.22, so above timeout threshold:

the data points seem to be gathered every ~15 seconds... If I understand correctly, these high latency reads were observed anywhere between 15:07:00 and 15:07:30.

And indeed, we can see some read barriers finishing at 15:07:00:

2023-09-25T15:07:00+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1    !DEBUG | scylla[5299]:  [shard 2:stat] migration_manager - Mark schema 2c75cd30-5bb5-11ee-b4f0-8218bc6e52ca as synced

and there is a batch of Performing raft read barrier before this batch of Mark schema..., but that batch is starting at 15:06:58, so 2 seconds earlier. So even if there were some barrers took 1 or 2 seconds, they don't explain the 5s latency.

kbr-scylla · 2023-09-26T14:12:55Z

Lots of messages like these around this time:

2023-09-25T15:07:00+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stat] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1    !DEBUG | scylla[5299]:  [shard 0:stre] migration_manager - Applying schema mutations from 10.4.2.192:0
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 14 live rows and 2474 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 14 live rows and 2474 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)

looks like these reads are happening during schema merging. cc @tgrabiec

kbr-scylla · 2023-09-26T14:25:09Z

and there is a batch of Performing raft read barrier before this batch of Mark schema..., but that batch is starting at 15:06:58, so 2 seconds earlier. So even if there were some barrers took 1 or 2 seconds, they don't explain the 5s latency.

That's only on the coordinator node though, and the read requires a quorum. Perhaps the two other nodes are waiting on barriers.

Indeed I found some very long running barriers on 10.4.3.118 during this time:

2023-09-25T15:06:44+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3    !DEBUG | scylla[5362]:  [shard 0:stat] migration_manager - Performing raft read barrier because schema is not synced, version: 23b4be40-5bb5-11ee-a07d-a112f1832725

...

2023-09-25T15:06:59+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3    !DEBUG | scylla[5362]:  [shard 0:stat] migration_manager - Mark schema 23b4be40-5bb5-11ee-a07d-a112f1832725 as synced

That's 15 seconds to do a read barrier!

I didn't find such long barrier on the third node (10.4.2.192), so theoretically we should have a QUORUM... But who knows, the situation is complex when we consider speculative reads. Maybe the metric used to perform the speculative read was already high, we picked 10.4.3.118 for the read, and when we decided to do the speculative read to other replica, it was too late.

kbr-scylla · 2023-09-26T14:26:43Z

A read barrier latency metric could be useful perhaps...

mykaul · 2023-09-26T14:32:42Z

That's 15 seconds to do a read barrier!

Do we want to print in the log how long it took to sync the schema? (instead / in addition to a metric? I think a log entry is more useful?)

gleb-cloudius · 2023-09-26T14:37:10Z

Try looking at read_quorum, read_quorum_reply metrics. May be there is something interesting there.

kbr-scylla · 2023-09-26T15:00:38Z

Used sum(rate(scylla_raft_messages_received{message_type="read_quorum", instance=~"[[node]]|^$", shard=~"[[shard]]|$^"}[1m])) etc.

kbr-scylla · 2023-09-26T15:04:08Z

I'm trying to build the metric so it shows it for every node, but apparently I cannot into grafana

kbr-scylla · 2023-09-26T15:05:51Z

The node sending read_quorum and receiving read_quorum_reply is .22

kbr-scylla · 2023-09-26T15:08:01Z

Around 15:07:00

the spike is 15:07:00

tgrabiec · 2023-09-26T15:18:37Z

Lots of messages like these around this time:

2023-09-25T15:07:00+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stat] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1    !DEBUG | scylla[5299]:  [shard 0:stre] migration_manager - Applying schema mutations from 10.4.2.192:0
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 14 live rows and 2474 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 14 live rows and 2474 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)

looks like these reads are happening during schema merging. cc @tgrabiec

That is sort of expected for such a workload.

We could improve it by expiring tombstones sooner with raft, by using tombstone gc mode = repair, and treating state machine catch up as repair.

mykaul · 2023-09-27T08:40:20Z

We need to deal with this. After all there are databases out there which sync every write, and they give reasonable performance. I don't think it's acceptable that Scylla only gives reasonable performance in commitlog periodic mode.

Never heard of such databases. Everyone have their tricks to overcome this 'when and how frequently I sync'.

kbr-scylla · 2023-09-27T08:45:48Z

Are you saying that no databases give actual durability guarantees?
By sync every write I didn't mean there's 1-1 correspondence between writes and syncs, but that every write is covered by a sync before the database ACKs. Tricks like batching multiple writes into a single sync are ok as long as we make sure that we don't ack before syncing to disk (AFAIR Scylla is also batching).

gleb-cloudius · 2023-09-27T08:46:35Z

I wonder if we have per commitlog metrics.

@xemul looked at related metrics for earlier run #15312 (comment)

I know we have commitlog metrics of course. But now we have two commitlogs, so my question is can we see metrics for each one separately. One for the statement may starve, but one for the raft/schema/local tables can be fine.

mykaul · 2023-09-27T08:57:03Z

Are you saying that no databases give actual durability guarantees? By sync every write I didn't mean there's 1-1 correspondence between writes and syncs, but that every write is covered by a sync before the database ACKs. Tricks like batching multiple writes into a single sync are ok as long as we make sure that we don't ack before syncing to disk (AFAIR Scylla is also batching).

Example - https://docs.portworx.com/portworx-enterprise/concepts/io-profiles#the-db_remote-profile

kbr-scylla · 2023-09-27T09:14:52Z

I restored monitor stack again, but it looks like there is no data?
https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/630/console
http://44.204.248.220:3000/

gleb-cloudius · 2023-09-27T09:18:10Z

I restored monitor stack again, but it looks like there is no data? https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/630/console http://44.204.248.220:3000/

There is data for 23/09/06

kbr-scylla · 2023-09-27T09:22:17Z

Damn it, I restored monitor from the original post.
Well, it's good enough too.

kbr-scylla · 2023-09-27T14:12:04Z

So here's an interesting turn of events

Apparently schema commitlog is running on a separate scheduling group, the default scheduling group, called "main". The regular commit log uses this code:

db::commitlog::config db::commitlog::config::from_db_config(const db::config& cfg, seastar::scheduling_group sg, size_t shard_available_memory) {
    config c;

    c.sched_group = std::move(sg);

future<>
database::init_commitlog() {
    if (_commitlog) {
        return make_ready_future<>();
    }

    return db::commitlog::create_commitlog(db::commitlog::config::from_db_config(_cfg, _dbcfg.commitlog_scheduling_group, _dbcfg.available_memory)).then([this](db::commitlog&& log) {

but schema commitlog uses this:

void database::maybe_init_schema_commitlog() {
    assert(this_shard_id() == 0);

    if (!_feat.schema_commitlog && !_cfg.force_schema_commit_log()) {
        dblog.info("Not using schema commit log.");
        _listeners.push_back(_feat.schema_commitlog.when_enabled([] {
            dblog.warn("All nodes can now switch to use the schema commit log. Restart is needed for this to take effect.");
        }));
        _uses_schema_commitlog = false;
        return;
    }

    dblog.info("Using schema commit log.");
    _uses_schema_commitlog = true;

    db::commitlog::config c;
    c.commit_log_location = _cfg.schema_commitlog_directory();
    c.fname_prefix = db::schema_tables::COMMITLOG_FILENAME_PREFIX;
    c.metrics_category_name = "schema-commitlog";
    c.commitlog_total_space_in_mb = 2 * _cfg.schema_commitlog_segment_size_in_mb();
    c.commitlog_segment_size_in_mb = _cfg.schema_commitlog_segment_size_in_mb();
    c.mode = db::commitlog::sync_mode::BATCH;
    c.extensions = &_cfg.extensions();
    c.use_o_dsync = _cfg.commitlog_use_o_dsync();
    c.allow_going_over_size_limit = true; // for lower latency

    _schema_commitlog = std::make_unique<db::commitlog>(db::commitlog::create_commitlog(std::move(c)).get0());

c.sched_grup is never set, so we're using the default one.

And from the metrics, the default scheduling group seems to be doing fine?

kbr-scylla · 2023-09-27T14:15:06Z

We need those raft=trace logs

kbr-scylla · 2023-09-27T14:34:19Z

According to @xemul --- it could still be that we're waiting for the scheduler, scylladb/seastar#1641

but I don't know how to determine if that's the case

kbr-scylla · 2023-09-27T14:48:10Z

I opened #15566

temichus · 2023-09-27T15:03:24Z

migration_manager=trace:group0_state_machine=trace:raft=trace

@temichus sorry, it was supposed to be migration_manager=trace:group0_raft_sm=trace:raft=trace

@kbr-scylla

here it is:

run with append_scylla_args: --logger-log-level migration_manager=trace:group0_raft_sm=trace:raft=trace

Installation details

Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 5.4.0~dev-20230927.0f22e8d196af with build-id 2c911e6e2b12c7d0c19f67f76711e0c1adfea3cb

Cluster size: 3 nodes (i4i.xlarge)

Scylla Nodes used in this run:

longevity-ParallelAddDropColumnMonk-db-node-fa6fb8ac-3 (34.244.177.244 | 10.4.2.20) (shards: 4)
longevity-ParallelAddDropColumnMonk-db-node-fa6fb8ac-2 (54.246.25.124 | 10.4.3.103) (shards: 4)
longevity-ParallelAddDropColumnMonk-db-node-fa6fb8ac-1 (34.244.157.242 | 10.4.2.3) (shards: 4)

OS / Image: ami-0c25786faf310fa10 (aws: undefined_region)

Test: longevity-schema-changes-3h-test
Test id: fa6fb8ac-a93b-4d4f-ba35-ea317dccb504
Test name: scylla-staging/artsiom_mishuta/longevity-schema-changes-3h-test
Test config file(s):

longevity-ParallelAddDropColumnMonkey-3h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor fa6fb8ac-a93b-4d4f-ba35-ea317dccb504
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs fa6fb8ac-a93b-4d4f-ba35-ea317dccb504

Logs:

db-cluster-fa6fb8ac.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fa6fb8ac-a93b-4d4f-ba35-ea317dccb504/20230927_144509/db-cluster-fa6fb8ac.tar.gz
sct-runner-events-fa6fb8ac.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fa6fb8ac-a93b-4d4f-ba35-ea317dccb504/20230927_144509/sct-runner-events-fa6fb8ac.tar.gz
sct-fa6fb8ac.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fa6fb8ac-a93b-4d4f-ba35-ea317dccb504/20230927_144509/sct-fa6fb8ac.log.tar.gz
loader-set-fa6fb8ac.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fa6fb8ac-a93b-4d4f-ba35-ea317dccb504/20230927_144509/loader-set-fa6fb8ac.tar.gz
monitor-set-fa6fb8ac.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fa6fb8ac-a93b-4d4f-ba35-ea317dccb504/20230927_144509/monitor-set-fa6fb8ac.tar.gz
parallel-timelines-report-fa6fb8ac.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/fa6fb8ac-a93b-4d4f-ba35-ea317dccb504/20230927_144509/parallel-timelines-report-fa6fb8ac.tar.gz

Jenkins job URL
Argus

kbr-scylla · 2023-10-03T17:24:07Z

I identified multiple performance issues with group 0 read barriers (which indeed seem to be causing these timeouts) during my investigations, created issues to track them easier, it should be possible to tackle them one by one
#15620
#15621
#15622

and those are probably not all problems...

kbr-scylla · 2023-10-03T17:26:55Z

I'm running the test with modified Scylla with additional/modified logging:
https://github.com/kbr-scylla/scylladb/commits/slow-barriers-debug
using this job:
https://jenkins.scylladb.com/job/scylla-staging/job/KamilBraun/job/longevity-schema-changes-3h-test-artsiom_mishuta-clone/

* seastar bab1625c...17183ed4 (73): > thread_pool: Reference reactor, not point to > sstring: inherit publicly from string_view formatter > circleci: use conditional steps > weak_ptr: include used header > build: disable the -Wunused-* warnings for checkheaders > resource: move variable into smaller lexical scope > resource: use structured binding when appropriate > httpd: Added server and client addresses to request structure > io_queue: do not dereference moved-away shared pointer > treewide: explicitly define ctor and assignment operator > memory: use `err` for the error string > doc: Add document describing all the math behind IO scheduler > io_queue: Add flow-rate based self slowdown backlink > io_queue: Make main throttler uncapped > io_queue: Add queue-wide metrics > io_queue: Introduce "flow monitor" > io_queue: Count total number of dispatched and completed requests so far > io_queue: Introduce io_group::io_latency_goal() > tests: test the vector overload for when_all_succeed > core: add a vector overload to when_all_succeed > loop: Fix iterator_range_estimate_vector_capacity for random iters > loop: Add test for iterator_range_estimate_vector_capacity > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present > memory: s/throw()/noexcept/ > build: enable -Wdeprecated compiler option > reactor: mark kernel_completion's dtor protected > tests: always wait for promise > http, json, net: define-generated copy ctor for polymorphic types > treewide: do not define constexpr static out-of-line > reactor: do not define dtor of kernel_completion > http/exception: stop using dynamic exception specification > metrics: replace vector with deque > metrics: change metadata vector to deque > utils/backtrace.hh: make simple_backtrace formattable > reactor: Unfriend disk_config_params > reactor: Move add_to_flush_poller() to internal namespace > reactor: Unfriend a bunch of sched group template calls > rpc_test: Test rpc send glitches > net: Implement batch flush support for existing sockets > iostream: Configure batch flushes if sink can do it > net: Added remote address accessors > circleci: update the image to CircleCI "standard" image > build: do not add header check target if no headers to check > build: pass target name to seastar_check_self_contained > build: detect glibc features using CMake > build: extract bits checking libc into CheckLibc.cmake > http/exception: add formatter for httpd::base_exception > http/client: Mark write_body() const > http/client: Introduce request::_bytes_written > http/client: Mark maybe_wait_for_continue() const > http/client: Mark send_request_head() const > http/client: Detach setup_request() > http/api_docs: copy in api_docs's copy constructor > script: do not inherit from object > scripts: addr2line: change StdinBacktraceIterator to a function > scripts: addr2line: use yield instead defining a class > tests: skip tests that require backtrace if execinfo.h is not found > backtrace: check for existence of execinfo.h > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used > core: add sleep_abortable instantiation for manual_clock > tls: Return EPIPE exception when writing to shutdown socket > http/client: Don't cache connection if server advertises it > http/client: Mark connection as "keep in cache" > core: fix strerror_r usage from glibc extension > reactor: access sigevent.sigev_notify_thread_id with a macro > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np > reactor: replace __mode_t with mode_t > reactor: change sys/poll.h to posix poll.h > rpc: Add unit test for per-domain metrics > rpc: Report client connections metrics > rpc: Count dead client stats > rpc: Add seastar::rpc::metrics > rpc: Make public queues length getters io-scheduler fixes refs: scylladb#15312 refs: scylladb#11805 http client fixes refs: scylladb#13736 refs: scylladb#15509 rpc fixes refs: scylladb#15462

* seastar bab1625c...17183ed4 (73): > thread_pool: Reference reactor, not point to > sstring: inherit publicly from string_view formatter > circleci: use conditional steps > weak_ptr: include used header > build: disable the -Wunused-* warnings for checkheaders > resource: move variable into smaller lexical scope > resource: use structured binding when appropriate > httpd: Added server and client addresses to request structure > io_queue: do not dereference moved-away shared pointer > treewide: explicitly define ctor and assignment operator > memory: use `err` for the error string > doc: Add document describing all the math behind IO scheduler > io_queue: Add flow-rate based self slowdown backlink > io_queue: Make main throttler uncapped > io_queue: Add queue-wide metrics > io_queue: Introduce "flow monitor" > io_queue: Count total number of dispatched and completed requests so far > io_queue: Introduce io_group::io_latency_goal() > tests: test the vector overload for when_all_succeed > core: add a vector overload to when_all_succeed > loop: Fix iterator_range_estimate_vector_capacity for random iters > loop: Add test for iterator_range_estimate_vector_capacity > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present > memory: s/throw()/noexcept/ > build: enable -Wdeprecated compiler option > reactor: mark kernel_completion's dtor protected > tests: always wait for promise > http, json, net: define-generated copy ctor for polymorphic types > treewide: do not define constexpr static out-of-line > reactor: do not define dtor of kernel_completion > http/exception: stop using dynamic exception specification > metrics: replace vector with deque > metrics: change metadata vector to deque > utils/backtrace.hh: make simple_backtrace formattable > reactor: Unfriend disk_config_params > reactor: Move add_to_flush_poller() to internal namespace > reactor: Unfriend a bunch of sched group template calls > rpc_test: Test rpc send glitches > net: Implement batch flush support for existing sockets > iostream: Configure batch flushes if sink can do it > net: Added remote address accessors > circleci: update the image to CircleCI "standard" image > build: do not add header check target if no headers to check > build: pass target name to seastar_check_self_contained > build: detect glibc features using CMake > build: extract bits checking libc into CheckLibc.cmake > http/exception: add formatter for httpd::base_exception > http/client: Mark write_body() const > http/client: Introduce request::_bytes_written > http/client: Mark maybe_wait_for_continue() const > http/client: Mark send_request_head() const > http/client: Detach setup_request() > http/api_docs: copy in api_docs's copy constructor > script: do not inherit from object > scripts: addr2line: change StdinBacktraceIterator to a function > scripts: addr2line: use yield instead defining a class > tests: skip tests that require backtrace if execinfo.h is not found > backtrace: check for existence of execinfo.h > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used > core: add sleep_abortable instantiation for manual_clock > tls: Return EPIPE exception when writing to shutdown socket > http/client: Don't cache connection if server advertises it > http/client: Mark connection as "keep in cache" > core: fix strerror_r usage from glibc extension > reactor: access sigevent.sigev_notify_thread_id with a macro > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np > reactor: replace __mode_t with mode_t > reactor: change sys/poll.h to posix poll.h > rpc: Add unit test for per-domain metrics > rpc: Report client connections metrics > rpc: Count dead client stats > rpc: Add seastar::rpc::metrics > rpc: Make public queues length getters io-scheduler fixes refs: #15312 refs: #11805 http client fixes refs: #13736 refs: #15509 rpc fixes refs: #15462 Closes #15774

temichus · 2023-12-18T08:53:44Z

@kostja, @kbr-scylla
retested on 'scylla_version': '5.5.0~dev-0.20231214.37868e5fdc62 with build-id 5b37b2b68d2bf07185a01ee0bcf85d2388c9eebf
'test_id': 'a6f057a9-624c-4f83-ad9d-cfc29736d1b7'
Issue reproduced.

seastar_io_queue_delay even higher:

but Cassandra stress has no issue: IOException
it has only:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during SIMPLE write query at consistency QUORUM (2 replica were required but only 1 acknowledged the write)

kbr-scylla · 2023-12-18T09:38:55Z

Thanks @temichus.

I'm aware that it's still failing. There are still some optimizations that have to be done inside migration_manager::get_schema_for_write. But now I'm prioritizing delivering of Raft topology over optimizing schema changes, so the latter thread is stuck.

The biggest issue was fixed (#15622) so it shouldn't have a large impact in real-world scenarios, where schema changes are rare.

kbr-scylla · 2023-12-18T09:39:35Z

@kostja I can tell you right now that it won't be done for 6.0.

kostja · 2023-12-21T21:12:11Z

@kostja I can tell you right now that it won't be done for 6.0.

The part we wanted for 6.0 is in, so thank you.

kostja assigned kostja and kbr-scylla Sep 21, 2023

kostja added the area/schema changes label Sep 21, 2023

kbr-scylla mentioned this issue Sep 27, 2023

Add read barrier / waiting for schema duration metric #15558

Open

temichus mentioned this issue Oct 5, 2023

fix(syslog): set millisecond resolution scylladb/scylla-cluster-tests#6657

Merged

7 tasks

xemul mentioned this issue Oct 19, 2023

Update seastar submodule #15774

Closed

kostja added this to the 6.0 milestone Dec 15, 2023

kostja modified the milestones: 6.0, 6.1 Dec 21, 2023

ReadTimeoutException error on client during performing multiple Add/Drop columns commands during long period of time #15312

ReadTimeoutException error on client during performing multiple Add/Drop columns commands during long period of time #15312

Comments

temichus commented Sep 7, 2023

Issue description

Impact

How frequently does it reproduce?

Comments from email corresponding with developers

Installation details

Logs:

mykaul commented Sep 7, 2023

xemul commented Sep 7, 2023

xemul commented Sep 7, 2023

xemul commented Sep 7, 2023

xemul commented Sep 7, 2023

xemul commented Sep 7, 2023

kostja commented Sep 21, 2023

kbr-scylla commented Sep 21, 2023

temichus commented Sep 25, 2023 • edited

Installation details

Logs:

kostja commented Sep 25, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

gleb-cloudius commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023 • edited

kbr-scylla commented Sep 26, 2023

mykaul commented Sep 26, 2023

gleb-cloudius commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

kbr-scylla commented Sep 26, 2023

tgrabiec commented Sep 26, 2023

mykaul commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

gleb-cloudius commented Sep 27, 2023

mykaul commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

gleb-cloudius commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

kbr-scylla commented Sep 27, 2023

temichus commented Sep 27, 2023 • edited

Installation details

Logs:

kbr-scylla commented Oct 3, 2023

kbr-scylla commented Oct 3, 2023

temichus commented Dec 18, 2023 • edited

kbr-scylla commented Dec 18, 2023

kbr-scylla commented Dec 18, 2023

kostja commented Dec 21, 2023

temichus commented Sep 25, 2023 •

edited

kbr-scylla commented Sep 26, 2023 •

edited

temichus commented Sep 27, 2023 •

edited

temichus commented Dec 18, 2023 •

edited