Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadTimeoutException error on client during performing multiple Add/Drop columns commands during long period of time #15312

Open
2 tasks
temichus opened this issue Sep 7, 2023 · 65 comments
Assignees
Milestone

Comments

@temichus
Copy link
Contributor

temichus commented Sep 7, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

new longevity test that performs only new add/drop column nemesis PR: scylladb/scylla-cluster-tests#6530

longevity test config:
test_duration: 260

prepare_write_cmd:  "cassandra-stress write cl=QUORUM n=20971520 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=80 -pop seq=1..20971520 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"

stress_cmd: ["cassandra-stress write cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=40 -pop 'dist=uniform(1..20971520)' -col 'n=FIXED(10) size=FIXED(512)' -log interval=5",
             "cassandra-stress read  cl=QUORUM duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=40 -pop 'dist=uniform(1..20971520)' -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"
             ]

n_db_nodes: 3
n_loaders: 2
n_monitor_nodes: 1
round_robin: true

instance_type_db: 'i4i.xlarge'

nemesis_class_name: 'ParallelAddDropColumnMonkey'
nemesis_interval: 5
nemesis_filter_seeds: false

seeds_num: 1

user_prefix: 'longevity-ParallelAddDropColumnMonkey-3h'

Impact

after about 2 hours of running, seastar_io_queue_delay in more than 1c and client the client gets error:

java.io.IOException: Operation x10 on key(s) [50333439344b4d385030]: Error executing: (ReadTimeoutException): Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded). In case this was generated during read repair, the consistency level is not representative of the actual consistency.

image
according to @gleb-cloudius "It should be million times smaller"

How frequently does it reproduce?

  • at 5 runs on i4i.xlarge nodes, 4 fails 1 pass
    seastar_io_queue_delay in passed run(4bc31d25-f6c4-4c39-95f1-b0f5e95db8b9)
    image

  • at runs on i3en.large - 100% fails (according to @gleb-cloudius : "The instance has small memory (16G) so it probably flashes memtables a lot.")

Comments from email corresponding with developers

@xemul:
"There's a known issue found in the scheduler (seastar#1641, and a work
around discussion at seastar#1766) so they could be better. Starvation
time is too large, but it can be due to that issue."

Installation details

Kernel Version: 5.15.0-1043-aws
Scylla version (or git commit hash): 5.4.0~dev-20230904.9a3d57256a60 with build-id 4c60ca8c4afe862855242a77b77b35a377e373dc

Cluster size: 3 nodes (i4i.xlarge)

Scylla Nodes used in this run:

  • longevity-ParallelAddDropColumnMonk-db-node-527c69a1-3 (54.246.205.180 | 10.4.0.244) (shards: 4)
  • longevity-ParallelAddDropColumnMonk-db-node-527c69a1-2 (34.246.193.241 | 10.4.1.30) (shards: 4)
  • longevity-ParallelAddDropColumnMonk-db-node-527c69a1-1 (34.243.195.118 | 10.4.2.208) (shards: 4)

OS / Image: ami-09a5684fbf0955c84 (aws: undefined_region)

Test: longevity-schema-changes-3h-test
Test id: 527c69a1-d03d-42ba-9b38-eba7b4e782c8
Test name: scylla-staging/artsiom_mishuta/longevity-schema-changes-3h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 527c69a1-d03d-42ba-9b38-eba7b4e782c8
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 527c69a1-d03d-42ba-9b38-eba7b4e782c8

Logs:

Jenkins job URL
Argus

@mykaul
Copy link
Contributor

mykaul commented Sep 7, 2023

@xemul - can you please take a look?

@xemul
Copy link
Contributor

xemul commented Sep 7, 2023

io-properties:

disks:
- mountpoint: /var/lib/scylla
  read_bandwidth: 763580096
  read_iops: 109954
  write_bandwidth: 561926784
  write_iops: 61008

So disk is able ~{700Mb/s, 100kiops} reads and ~{500Mb/s, 60kiops} writes.

Commitlog consumes 200Mbs / 2kiops (writes)
Compaction comsumes 500Mbs / 5kiops (reads + writes), we can assume it's roughly in half-by-half
Memtable flush consumes 200Mbs / 2kiops (writes)
Statement (query) consumes 50Mb/s / 10k iops (reads)
Summary: 650Mbs / 6.5kiops writes and 300Mb/s / 12.5kiops reads

That's on the edge of disk abilities, it just cannot (well -- not allowed to by scheduler) do more.

@xemul
Copy link
Contributor

xemul commented Sep 7, 2023

scylladb/seastar#1641 still applies, though, and it does makes things worse

@xemul
Copy link
Contributor

xemul commented Sep 7, 2023

Some nodes consume all the available CPU
image

@xemul
Copy link
Contributor

xemul commented Sep 7, 2023

Also streaming is causing troubles by "stalling" CPUs with quota violations, which is bad, it makes Io scheduler's throttling notably worse
image
but it doesn't do as much IO according to its IO metrics (almost idling)
image

@xemul
Copy link
Contributor

xemul commented Sep 7, 2023

And the last worrisome thing is IO starvation times

image
image

They are pretty big, with huge queue length values it should've been much smaller. But it can be due to scylladb/seastar#1641

@kostja
Copy link
Contributor

kostja commented Sep 21, 2023

@temichus please re-try with the latest master, the problem should be addressed with Kamil's fix for schema versioning.

@kbr-scylla
Copy link
Contributor

@temichus also please start Scylla with --logging-log-level migration_manager=trace in the next run

@temichus
Copy link
Contributor Author

temichus commented Sep 25, 2023

@kostja @kbr-scylla

the issue was reproduced (5 runs 4 fails)
as requested test was run with append_scylla_args: --logger-log-level migration_manager=trace

Installation details

Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 5.4.0~dev-20230921.a56a4b6226e6 with build-id 616f734e7c7fb5e3ee8898792b3c415d2574a132

Cluster size: 3 nodes (i4i.xlarge)

Scylla Nodes used in this run:

  • longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3 (34.245.135.223 | 10.4.3.118) (shards: 4)
  • longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-2 (3.250.193.98 | 10.4.2.192) (shards: 4)
  • longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1 (52.209.146.47 | 10.4.3.22) (shards: 4)

OS / Image: ami-00f051bf1c684c01a (aws: undefined_region)

Test: longevity-schema-changes-3h-test
Test id: 4c9412b4-9dc5-443c-bb43-883f990f9283
Test name: scylla-staging/artsiom_mishuta/longevity-schema-changes-3h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 4c9412b4-9dc5-443c-bb43-883f990f9283
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 4c9412b4-9dc5-443c-bb43-883f990f9283

Logs:

Jenkins job URL
Argus

@kostja
Copy link
Contributor

kostja commented Sep 25, 2023

@kbr-scylla

@kbr-scylla
Copy link
Contributor

I'm looking at the logs.

One thing I noticed is that you don't provide original Scylla node logs with millisecond resolution timestamps like this:

INFO  2023-09-26 14:28:28,315 [shard 0:stre] schema_tables - Schema version changed to 322b57d2-5c68-11ee-cd0e-cb7b52c37710

the milliseconds would be useful for my investigation here TBH... but maybe I'll find something out without it

@kbr-scylla
Copy link
Contributor

Why don't the metrics have read timeout data?
Screenshot from 2023-09-26 15-21-08
cc @amnonh

@gleb-cloudius
Copy link
Contributor

Why don't the metrics have read timeout data? Screenshot from 2023-09-26 15-21-08 cc @amnonh

May be because we do not sent zero metrics any longer?

@kbr-scylla
Copy link
Contributor

They should be nonzero though, after all we're getting ReadTimeoutExceptions.
Also, for example, hint metrics look like this:
Screenshot from 2023-09-26 15-26-41

@kbr-scylla
Copy link
Contributor

I don't have read timeout metrics, but I have coordinator read p99 latency metrics. Sometimes those reach 2s. These events don't seem to correlate with schema read barriers. For example, around 13:48:15 - 13:48:30, on 10.4.3.118 shard 1:
Screenshot from 2023-09-26 15-54-19

Around this time I didn't see any "Performing read barrier" / "Mark schema ... as synced" logs on this node. The closest barrier messages are:

2023-09-25T13:40:34+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3    !DEBUG | scylla[5362]:  [shard 2:stat] migration_manager - Mark schema 1a573eb0-5ba9-11ee-a9cb-13ebead80e4f as synced

before, and

2023-09-25T13:50:28+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3    !DEBUG | scylla[5362]:  [shard 1:stat] migration_manager - Performing raft read barrier because schema is not synced, version: 7c605500-5baa-11ee-a07d-a112f1832725

after.

I only looked at few such data points though.

Also it seems that 2s is for some reason a favorite value for this metric. Why 2s in particular? Is it somehow related to write timeout setting which is 2s? But we're looking at reads here, which have 5s timeout setting. Maybe it's read repair, and it's using 2s for the internal write?

@kbr-scylla
Copy link
Contributor

I don't have read timeout metrics

Also the stressor app doesn't provide timestamps when the timeouts happened. Very convenient.

@kbr-scylla
Copy link
Contributor

Ok there's this one event when read p99 latency jumps to >5s on 10.4.3.22, so above timeout threshold:
Screenshot from 2023-09-26 16-03-22

the data points seem to be gathered every ~15 seconds... If I understand correctly, these high latency reads were observed anywhere between 15:07:00 and 15:07:30.

And indeed, we can see some read barriers finishing at 15:07:00:

2023-09-25T15:07:00+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1    !DEBUG | scylla[5299]:  [shard 2:stat] migration_manager - Mark schema 2c75cd30-5bb5-11ee-b4f0-8218bc6e52ca as synced

and there is a batch of Performing raft read barrier before this batch of Mark schema..., but that batch is starting at 15:06:58, so 2 seconds earlier. So even if there were some barrers took 1 or 2 seconds, they don't explain the 5s latency.

@kbr-scylla
Copy link
Contributor

Lots of messages like these around this time:

2023-09-25T15:07:00+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stat] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1    !DEBUG | scylla[5299]:  [shard 0:stre] migration_manager - Applying schema mutations from 10.4.2.192:0
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 14 live rows and 2474 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 14 live rows and 2474 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)

looks like these reads are happening during schema merging. cc @tgrabiec

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Sep 26, 2023

and there is a batch of Performing raft read barrier before this batch of Mark schema..., but that batch is starting at 15:06:58, so 2 seconds earlier. So even if there were some barrers took 1 or 2 seconds, they don't explain the 5s latency.

That's only on the coordinator node though, and the read requires a quorum. Perhaps the two other nodes are waiting on barriers.

Indeed I found some very long running barriers on 10.4.3.118 during this time:

2023-09-25T15:06:44+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3    !DEBUG | scylla[5362]:  [shard 0:stat] migration_manager - Performing raft read barrier because schema is not synced, version: 23b4be40-5bb5-11ee-a07d-a112f1832725

...

2023-09-25T15:06:59+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-3    !DEBUG | scylla[5362]:  [shard 0:stat] migration_manager - Mark schema 23b4be40-5bb5-11ee-a07d-a112f1832725 as synced

That's 15 seconds to do a read barrier!

I didn't find such long barrier on the third node (10.4.2.192), so theoretically we should have a QUORUM... But who knows, the situation is complex when we consider speculative reads. Maybe the metric used to perform the speculative read was already high, we picked 10.4.3.118 for the read, and when we decided to do the speculative read to other replica, it was too late.

@kbr-scylla
Copy link
Contributor

A read barrier latency metric could be useful perhaps...

@mykaul
Copy link
Contributor

mykaul commented Sep 26, 2023

That's 15 seconds to do a read barrier!

Do we want to print in the log how long it took to sync the schema? (instead / in addition to a metric? I think a log entry is more useful?)

@gleb-cloudius
Copy link
Contributor

Try looking at read_quorum, read_quorum_reply metrics. May be there is something interesting there.

@kbr-scylla
Copy link
Contributor

Screenshot from 2023-09-26 17-00-11

Used sum(rate(scylla_raft_messages_received{message_type="read_quorum", instance=~"[[node]]|^$", shard=~"[[shard]]|$^"}[1m])) etc.

@kbr-scylla
Copy link
Contributor

I'm trying to build the metric so it shows it for every node, but apparently I cannot into grafana

@kbr-scylla
Copy link
Contributor

Screenshot from 2023-09-26 17-05-23
The node sending read_quorum and receiving read_quorum_reply is .22

@kbr-scylla
Copy link
Contributor

Around 15:07:00

Screenshot from 2023-09-26 17-07-09

the spike is 15:07:00

@tgrabiec
Copy link
Contributor

Lots of messages like these around this time:

2023-09-25T15:07:00+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stat] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1    !DEBUG | scylla[5299]:  [shard 0:stre] migration_manager - Applying schema mutations from 10.4.2.192:0
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 15 live rows and 2473 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 14 live rows and 2474 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-09-25T15:07:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-4c9412b4-1  !WARNING | scylla[5299]:  [shard 0:stre] querier - Read 14 live rows and 2474 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)

looks like these reads are happening during schema merging. cc @tgrabiec

That is sort of expected for such a workload.

We could improve it by expiring tombstones sooner with raft, by using tombstone gc mode = repair, and treating state machine catch up as repair.

@mykaul
Copy link
Contributor

mykaul commented Sep 27, 2023

We need to deal with this. After all there are databases out there which sync every write, and they give reasonable performance. I don't think it's acceptable that Scylla only gives reasonable performance in commitlog periodic mode.

Never heard of such databases. Everyone have their tricks to overcome this 'when and how frequently I sync'.

@kbr-scylla
Copy link
Contributor

Are you saying that no databases give actual durability guarantees?
By sync every write I didn't mean there's 1-1 correspondence between writes and syncs, but that every write is covered by a sync before the database ACKs. Tricks like batching multiple writes into a single sync are ok as long as we make sure that we don't ack before syncing to disk (AFAIR Scylla is also batching).

@gleb-cloudius
Copy link
Contributor

I wonder if we have per commitlog metrics.

@xemul looked at related metrics for earlier run #15312 (comment)

I know we have commitlog metrics of course. But now we have two commitlogs, so my question is can we see metrics for each one separately. One for the statement may starve, but one for the raft/schema/local tables can be fine.

@mykaul
Copy link
Contributor

mykaul commented Sep 27, 2023

Are you saying that no databases give actual durability guarantees? By sync every write I didn't mean there's 1-1 correspondence between writes and syncs, but that every write is covered by a sync before the database ACKs. Tricks like batching multiple writes into a single sync are ok as long as we make sure that we don't ack before syncing to disk (AFAIR Scylla is also batching).

Example - https://docs.portworx.com/portworx-enterprise/concepts/io-profiles#the-db_remote-profile

@kbr-scylla
Copy link
Contributor

I restored monitor stack again, but it looks like there is no data?
https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/630/console
http://44.204.248.220:3000/

@gleb-cloudius
Copy link
Contributor

I restored monitor stack again, but it looks like there is no data? https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/630/console http://44.204.248.220:3000/

There is data for 23/09/06

@kbr-scylla
Copy link
Contributor

Damn it, I restored monitor from the original post.
Well, it's good enough too.

@kbr-scylla
Copy link
Contributor

So here's an interesting turn of events

Apparently schema commitlog is running on a separate scheduling group, the default scheduling group, called "main". The regular commit log uses this code:

db::commitlog::config db::commitlog::config::from_db_config(const db::config& cfg, seastar::scheduling_group sg, size_t shard_available_memory) {
    config c;

    c.sched_group = std::move(sg);
future<>
database::init_commitlog() {
    if (_commitlog) {
        return make_ready_future<>();
    }

    return db::commitlog::create_commitlog(db::commitlog::config::from_db_config(_cfg, _dbcfg.commitlog_scheduling_group, _dbcfg.available_memory)).then([this](db::commitlog&& log) {

but schema commitlog uses this:

void database::maybe_init_schema_commitlog() {
    assert(this_shard_id() == 0);

    if (!_feat.schema_commitlog && !_cfg.force_schema_commit_log()) {
        dblog.info("Not using schema commit log.");
        _listeners.push_back(_feat.schema_commitlog.when_enabled([] {
            dblog.warn("All nodes can now switch to use the schema commit log. Restart is needed for this to take effect.");
        }));
        _uses_schema_commitlog = false;
        return;
    }

    dblog.info("Using schema commit log.");
    _uses_schema_commitlog = true;

    db::commitlog::config c;
    c.commit_log_location = _cfg.schema_commitlog_directory();
    c.fname_prefix = db::schema_tables::COMMITLOG_FILENAME_PREFIX;
    c.metrics_category_name = "schema-commitlog";
    c.commitlog_total_space_in_mb = 2 * _cfg.schema_commitlog_segment_size_in_mb();
    c.commitlog_segment_size_in_mb = _cfg.schema_commitlog_segment_size_in_mb();
    c.mode = db::commitlog::sync_mode::BATCH;
    c.extensions = &_cfg.extensions();
    c.use_o_dsync = _cfg.commitlog_use_o_dsync();
    c.allow_going_over_size_limit = true; // for lower latency

    _schema_commitlog = std::make_unique<db::commitlog>(db::commitlog::create_commitlog(std::move(c)).get0());

c.sched_grup is never set, so we're using the default one.

And from the metrics, the default scheduling group seems to be doing fine?
Screenshot from 2023-09-27 16-05-26

@kbr-scylla
Copy link
Contributor

We need those raft=trace logs

@kbr-scylla
Copy link
Contributor

According to @xemul --- it could still be that we're waiting for the scheduler, scylladb/seastar#1641

but I don't know how to determine if that's the case

@kbr-scylla
Copy link
Contributor

I opened #15566

@temichus
Copy link
Contributor Author

temichus commented Sep 27, 2023

migration_manager=trace:group0_state_machine=trace:raft=trace

@temichus sorry, it was supposed to be migration_manager=trace:group0_raft_sm=trace:raft=trace

@kbr-scylla

here it is:

run with append_scylla_args: --logger-log-level migration_manager=trace:group0_raft_sm=trace:raft=trace

Installation details

Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 5.4.0~dev-20230927.0f22e8d196af with build-id 2c911e6e2b12c7d0c19f67f76711e0c1adfea3cb

Cluster size: 3 nodes (i4i.xlarge)

Scylla Nodes used in this run:

  • longevity-ParallelAddDropColumnMonk-db-node-fa6fb8ac-3 (34.244.177.244 | 10.4.2.20) (shards: 4)
  • longevity-ParallelAddDropColumnMonk-db-node-fa6fb8ac-2 (54.246.25.124 | 10.4.3.103) (shards: 4)
  • longevity-ParallelAddDropColumnMonk-db-node-fa6fb8ac-1 (34.244.157.242 | 10.4.2.3) (shards: 4)

OS / Image: ami-0c25786faf310fa10 (aws: undefined_region)

Test: longevity-schema-changes-3h-test
Test id: fa6fb8ac-a93b-4d4f-ba35-ea317dccb504
Test name: scylla-staging/artsiom_mishuta/longevity-schema-changes-3h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor fa6fb8ac-a93b-4d4f-ba35-ea317dccb504
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs fa6fb8ac-a93b-4d4f-ba35-ea317dccb504

Logs:

Jenkins job URL
Argus

@kbr-scylla
Copy link
Contributor

I identified multiple performance issues with group 0 read barriers (which indeed seem to be causing these timeouts) during my investigations, created issues to track them easier, it should be possible to tackle them one by one
#15620
#15621
#15622

and those are probably not all problems...

@kbr-scylla
Copy link
Contributor

xemul added a commit to xemul/scylla that referenced this issue Oct 19, 2023
* seastar bab1625c...17183ed4 (73):
  > thread_pool: Reference reactor, not point to
  > sstring: inherit publicly from string_view formatter
  > circleci: use conditional steps
  > weak_ptr: include used header
  > build: disable the -Wunused-* warnings for checkheaders
  > resource: move variable into smaller lexical scope
  > resource: use structured binding when appropriate
  > httpd: Added server and client addresses to request structure
  > io_queue: do not dereference moved-away shared pointer
  > treewide: explicitly define ctor and assignment operator
  > memory: use `err` for the error string
  > doc: Add document describing all the math behind IO scheduler
  > io_queue: Add flow-rate based self slowdown backlink
  > io_queue: Make main throttler uncapped
  > io_queue: Add queue-wide metrics
  > io_queue: Introduce "flow monitor"
  > io_queue: Count total number of dispatched and completed requests so far
  > io_queue: Introduce io_group::io_latency_goal()
  > tests: test the vector overload for when_all_succeed
  > core: add a vector overload to when_all_succeed
  > loop: Fix iterator_range_estimate_vector_capacity for random iters
  > loop: Add test for iterator_range_estimate_vector_capacity
  > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present
  > memory: s/throw()/noexcept/
  > build: enable -Wdeprecated compiler option
  > reactor: mark kernel_completion's dtor protected
  > tests: always wait for promise
  > http, json, net: define-generated copy ctor for polymorphic types
  > treewide: do not define constexpr static out-of-line
  > reactor: do not define dtor of kernel_completion
  > http/exception: stop using dynamic exception specification
  > metrics: replace vector with deque
  > metrics: change metadata vector to deque
  > utils/backtrace.hh: make simple_backtrace formattable
  > reactor: Unfriend disk_config_params
  > reactor: Move add_to_flush_poller() to internal namespace
  > reactor: Unfriend a bunch of sched group template calls
  > rpc_test: Test rpc send glitches
  > net: Implement batch flush support for existing sockets
  > iostream: Configure batch flushes if sink can do it
  > net: Added remote address accessors
  > circleci: update the image to CircleCI "standard" image
  > build: do not add header check target if no headers to check
  > build: pass target name to seastar_check_self_contained
  > build: detect glibc features using CMake
  > build: extract bits checking libc into CheckLibc.cmake
  > http/exception: add formatter for httpd::base_exception
  > http/client: Mark write_body() const
  > http/client: Introduce request::_bytes_written
  > http/client: Mark maybe_wait_for_continue() const
  > http/client: Mark send_request_head() const
  > http/client: Detach setup_request()
  > http/api_docs: copy in api_docs's copy constructor
  > script: do not inherit from object
  > scripts: addr2line: change StdinBacktraceIterator to a function
  > scripts: addr2line: use yield instead defining a class
  > tests: skip tests that require backtrace if execinfo.h is not found
  > backtrace: check for existence of execinfo.h
  > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used
  > core: add sleep_abortable instantiation for manual_clock
  > tls: Return EPIPE exception when writing to shutdown socket
  > http/client: Don't cache connection if server advertises it
  > http/client: Mark connection as "keep in cache"
  > core: fix strerror_r usage from glibc extension
  > reactor: access sigevent.sigev_notify_thread_id with a macro
  > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np
  > reactor: replace __mode_t with mode_t
  > reactor: change sys/poll.h to posix poll.h
  > rpc: Add unit test for per-domain metrics
  > rpc: Report client connections metrics
  > rpc: Count dead client stats
  > rpc: Add seastar::rpc::metrics
  > rpc: Make public queues length getters

io-scheduler fixes
refs: scylladb#15312
refs: scylladb#11805

http client fixes
refs: scylladb#13736
refs: scylladb#15509

rpc fixes
refs: scylladb#15462
xemul added a commit to xemul/scylla that referenced this issue Oct 19, 2023
* seastar bab1625c...17183ed4 (73):
  > thread_pool: Reference reactor, not point to
  > sstring: inherit publicly from string_view formatter
  > circleci: use conditional steps
  > weak_ptr: include used header
  > build: disable the -Wunused-* warnings for checkheaders
  > resource: move variable into smaller lexical scope
  > resource: use structured binding when appropriate
  > httpd: Added server and client addresses to request structure
  > io_queue: do not dereference moved-away shared pointer
  > treewide: explicitly define ctor and assignment operator
  > memory: use `err` for the error string
  > doc: Add document describing all the math behind IO scheduler
  > io_queue: Add flow-rate based self slowdown backlink
  > io_queue: Make main throttler uncapped
  > io_queue: Add queue-wide metrics
  > io_queue: Introduce "flow monitor"
  > io_queue: Count total number of dispatched and completed requests so far
  > io_queue: Introduce io_group::io_latency_goal()
  > tests: test the vector overload for when_all_succeed
  > core: add a vector overload to when_all_succeed
  > loop: Fix iterator_range_estimate_vector_capacity for random iters
  > loop: Add test for iterator_range_estimate_vector_capacity
  > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present
  > memory: s/throw()/noexcept/
  > build: enable -Wdeprecated compiler option
  > reactor: mark kernel_completion's dtor protected
  > tests: always wait for promise
  > http, json, net: define-generated copy ctor for polymorphic types
  > treewide: do not define constexpr static out-of-line
  > reactor: do not define dtor of kernel_completion
  > http/exception: stop using dynamic exception specification
  > metrics: replace vector with deque
  > metrics: change metadata vector to deque
  > utils/backtrace.hh: make simple_backtrace formattable
  > reactor: Unfriend disk_config_params
  > reactor: Move add_to_flush_poller() to internal namespace
  > reactor: Unfriend a bunch of sched group template calls
  > rpc_test: Test rpc send glitches
  > net: Implement batch flush support for existing sockets
  > iostream: Configure batch flushes if sink can do it
  > net: Added remote address accessors
  > circleci: update the image to CircleCI "standard" image
  > build: do not add header check target if no headers to check
  > build: pass target name to seastar_check_self_contained
  > build: detect glibc features using CMake
  > build: extract bits checking libc into CheckLibc.cmake
  > http/exception: add formatter for httpd::base_exception
  > http/client: Mark write_body() const
  > http/client: Introduce request::_bytes_written
  > http/client: Mark maybe_wait_for_continue() const
  > http/client: Mark send_request_head() const
  > http/client: Detach setup_request()
  > http/api_docs: copy in api_docs's copy constructor
  > script: do not inherit from object
  > scripts: addr2line: change StdinBacktraceIterator to a function
  > scripts: addr2line: use yield instead defining a class
  > tests: skip tests that require backtrace if execinfo.h is not found
  > backtrace: check for existence of execinfo.h
  > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used
  > core: add sleep_abortable instantiation for manual_clock
  > tls: Return EPIPE exception when writing to shutdown socket
  > http/client: Don't cache connection if server advertises it
  > http/client: Mark connection as "keep in cache"
  > core: fix strerror_r usage from glibc extension
  > reactor: access sigevent.sigev_notify_thread_id with a macro
  > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np
  > reactor: replace __mode_t with mode_t
  > reactor: change sys/poll.h to posix poll.h
  > rpc: Add unit test for per-domain metrics
  > rpc: Report client connections metrics
  > rpc: Count dead client stats
  > rpc: Add seastar::rpc::metrics
  > rpc: Make public queues length getters

io-scheduler fixes
refs: scylladb#15312
refs: scylladb#11805

http client fixes
refs: scylladb#13736
refs: scylladb#15509

rpc fixes
refs: scylladb#15462
avikivity pushed a commit that referenced this issue Oct 19, 2023
* seastar bab1625c...17183ed4 (73):
  > thread_pool: Reference reactor, not point to
  > sstring: inherit publicly from string_view formatter
  > circleci: use conditional steps
  > weak_ptr: include used header
  > build: disable the -Wunused-* warnings for checkheaders
  > resource: move variable into smaller lexical scope
  > resource: use structured binding when appropriate
  > httpd: Added server and client addresses to request structure
  > io_queue: do not dereference moved-away shared pointer
  > treewide: explicitly define ctor and assignment operator
  > memory: use `err` for the error string
  > doc: Add document describing all the math behind IO scheduler
  > io_queue: Add flow-rate based self slowdown backlink
  > io_queue: Make main throttler uncapped
  > io_queue: Add queue-wide metrics
  > io_queue: Introduce "flow monitor"
  > io_queue: Count total number of dispatched and completed requests so far
  > io_queue: Introduce io_group::io_latency_goal()
  > tests: test the vector overload for when_all_succeed
  > core: add a vector overload to when_all_succeed
  > loop: Fix iterator_range_estimate_vector_capacity for random iters
  > loop: Add test for iterator_range_estimate_vector_capacity
  > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present
  > memory: s/throw()/noexcept/
  > build: enable -Wdeprecated compiler option
  > reactor: mark kernel_completion's dtor protected
  > tests: always wait for promise
  > http, json, net: define-generated copy ctor for polymorphic types
  > treewide: do not define constexpr static out-of-line
  > reactor: do not define dtor of kernel_completion
  > http/exception: stop using dynamic exception specification
  > metrics: replace vector with deque
  > metrics: change metadata vector to deque
  > utils/backtrace.hh: make simple_backtrace formattable
  > reactor: Unfriend disk_config_params
  > reactor: Move add_to_flush_poller() to internal namespace
  > reactor: Unfriend a bunch of sched group template calls
  > rpc_test: Test rpc send glitches
  > net: Implement batch flush support for existing sockets
  > iostream: Configure batch flushes if sink can do it
  > net: Added remote address accessors
  > circleci: update the image to CircleCI "standard" image
  > build: do not add header check target if no headers to check
  > build: pass target name to seastar_check_self_contained
  > build: detect glibc features using CMake
  > build: extract bits checking libc into CheckLibc.cmake
  > http/exception: add formatter for httpd::base_exception
  > http/client: Mark write_body() const
  > http/client: Introduce request::_bytes_written
  > http/client: Mark maybe_wait_for_continue() const
  > http/client: Mark send_request_head() const
  > http/client: Detach setup_request()
  > http/api_docs: copy in api_docs's copy constructor
  > script: do not inherit from object
  > scripts: addr2line: change StdinBacktraceIterator to a function
  > scripts: addr2line: use yield instead defining a class
  > tests: skip tests that require backtrace if execinfo.h is not found
  > backtrace: check for existence of execinfo.h
  > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used
  > core: add sleep_abortable instantiation for manual_clock
  > tls: Return EPIPE exception when writing to shutdown socket
  > http/client: Don't cache connection if server advertises it
  > http/client: Mark connection as "keep in cache"
  > core: fix strerror_r usage from glibc extension
  > reactor: access sigevent.sigev_notify_thread_id with a macro
  > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np
  > reactor: replace __mode_t with mode_t
  > reactor: change sys/poll.h to posix poll.h
  > rpc: Add unit test for per-domain metrics
  > rpc: Report client connections metrics
  > rpc: Count dead client stats
  > rpc: Add seastar::rpc::metrics
  > rpc: Make public queues length getters

io-scheduler fixes
refs: #15312
refs: #11805

http client fixes
refs: #13736
refs: #15509

rpc fixes
refs: #15462

Closes #15774
@kostja kostja added this to the 6.0 milestone Dec 15, 2023
@temichus
Copy link
Contributor Author

temichus commented Dec 18, 2023

@kostja, @kbr-scylla
retested on 'scylla_version': '5.5.0~dev-0.20231214.37868e5fdc62 with build-id 5b37b2b68d2bf07185a01ee0bcf85d2388c9eebf
'test_id': 'a6f057a9-624c-4f83-ad9d-cfc29736d1b7'
Issue reproduced.

seastar_io_queue_delay even higher:
image

but Cassandra stress has no issue: IOException
it has only:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during SIMPLE write query at consistency QUORUM (2 replica were required but only 1 acknowledged the write)

@kbr-scylla
Copy link
Contributor

Thanks @temichus.

I'm aware that it's still failing. There are still some optimizations that have to be done inside migration_manager::get_schema_for_write. But now I'm prioritizing delivering of Raft topology over optimizing schema changes, so the latter thread is stuck.

The biggest issue was fixed (#15622) so it shouldn't have a large impact in real-world scenarios, where schema changes are rare.

@kbr-scylla
Copy link
Contributor

@kostja I can tell you right now that it won't be done for 6.0.

@kostja
Copy link
Contributor

kostja commented Dec 21, 2023

@kostja I can tell you right now that it won't be done for 6.0.

The part we wanted for 6.0 is in, so thank you.

@kostja kostja modified the milestones: 6.0, 6.1 Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants