Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LWT - Data population takes significantly longer than before #9331

Closed
roydahan opened this issue Sep 13, 2021 · 260 comments
Closed

LWT - Data population takes significantly longer than before #9331

roydahan opened this issue Sep 13, 2021 · 260 comments
Assignees
Milestone

Comments

@roydahan
Copy link

roydahan commented Sep 13, 2021

Installation details
Kernel version: 5.4.0-1035-aws
Scylla version (or git commit hash): 4.5.rc7-0.20210906.edead1caf
Cluster size: 4 nodes (i3.4xlarge)
Scylla running with shards number (live nodes):
longevity-lwt-500G-3d-4-5-db-node-2447519d-1 (63.33.190.129 | 10.0.0.59): 14 shards
longevity-lwt-500G-3d-4-5-db-node-2447519d-2 (34.243.2.11 | 10.0.3.58): 14 shards
longevity-lwt-500G-3d-4-5-db-node-2447519d-3 (54.74.79.228 | 10.0.1.88): 14 shards
longevity-lwt-500G-3d-4-5-db-node-2447519d-4 (54.216.64.60 | 10.0.3.78): 14 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0e0abb5a45374673b (aws: eu-west-1)

Test: longevity-lwt-500G-3d-test
Test name: longevity_lwt_test.LWTLongevityTest.test_lwt_longevity
Test config file(s):

Issue description

====================================
This long longevity test of LWT is trying to write 400000000 keys (about 500gb of data), using c-s "user profile" (see below).
The tool is being run from 3 different loaders, each writing one third of the dataset.
Writing the entire dataset should take around 8 hours (based on previous runs of this test).

In this case, it seems that the data population hasn't been able to complete during the entire test (3 days).
The write throughput is consistently decreasing during the first few hours until it reaches a very low number of op/s.

====================================

Live monitor: http://34.252.214.230:3000/d/dGxHpZSnz/longevity-lwt-500g-3d-test-scylla-per-server-metrics-nemesis-master?orgId=1&from=1631032607724&to=1631169036853
(Restore Monitor Stack command: $ hydra investigate show-monitor 2447519d-8e50-4bcc-9f14-c9046cb649f1)
screenshot: https://snapshot.raintank.io/dashboard/snapshot/yLJrkc9UvaQGg1fUTDB5IReSWskowZ3v

A monitor for 4.4 run: http://44.193.79.222:3000/d/KWzIAbI7k/longevity-lwt-500g-3d-test-scylla-per-server-metrics-nemesis-master?orgId=1&from=1615474723261&to=1615524652826
(Restore Monitor Stack command: $ hydra investigate show-monitor 79499b7f-2c8f-45ad-8292-3f137c85e025)
Screenshot: https://cloudius-jenkins-test.s3.amazonaws.com/79499b7f-2c8f-45ad-8292-3f137c85e025/20210312_030749/grafana-screenshot-longevity-lwt-500g-3d-test-scylla-per-server-metrics-nemesis-20210312_031129-longevity-lwt-500G-3d-4-4-monitor-node-79499b7f-1.png

Logs:
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/2447519d-8e50-4bcc-9f14-c9046cb649f1/20210910_191248/grafana-screenshot-longevity-lwt-500g-3d-test-scylla-per-server-metrics-nemesis-20210910_191711-longevity-lwt-500G-3d-4-5-monitor-node-2447519d-1.png
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/2447519d-8e50-4bcc-9f14-c9046cb649f1/20210910_192216/db-cluster-2447519d.tar.gz
events - https://cloudius-jenkins-test.s3.amazonaws.com/2447519d-8e50-4bcc-9f14-c9046cb649f1/20210910_192216/events.log.tar.gz
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/2447519d-8e50-4bcc-9f14-c9046cb649f1/20210910_192216/loader-set-2447519d.tar.gz
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/2447519d-8e50-4bcc-9f14-c9046cb649f1/20210910_192216/monitor-set-2447519d.tar.gz
events - https://cloudius-jenkins-test.s3.amazonaws.com/2447519d-8e50-4bcc-9f14-c9046cb649f1/20210910_192216/raw_events.log.tar.gz

Show all stored logs command: $ hydra investigate show-logs 2447519d-8e50-4bcc-9f14-c9046cb649f1

Jenkins job URL

the load is defined by:

prepare_write_cmd: ["cassandra-stress user profile=/tmp/c-s_lwt_big_data.yaml n=133333333 ops'(insert_query=1)' cl=QUORUM -mode native cql3 -rate threads=100 -pop seq=1..133333333",
                    "cassandra-stress user profile=/tmp/c-s_lwt_big_data.yaml n=133333333 ops'(insert_query=1)' cl=QUORUM -mode native cql3 -rate threads=100 -pop seq=133333334..266666666",
                    "cassandra-stress user profile=/tmp/c-s_lwt_big_data.yaml n=133333333 ops'(insert_query=1)' cl=QUORUM -mode native cql3 -rate threads=100 -pop seq=266666667..400000000"
                   ]
where c-s_lwt_big_data.yaml is:

# LWT test: create and update data using LWT.
### DML ###

# Keyspace Name
keyspace: cqlstress_lwt_example

# The CQL for creating a keyspace (optional if it already exists)
keyspace_definition: |
  CREATE KEYSPACE IF NOT EXISTS cqlstress_lwt_example WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

# Table name
table: blogposts

# The CQL for creating a table you wish to stress (optional if it already exists)
table_definition: |
  CREATE TABLE IF NOT EXISTS blogposts (
        domain int,
        published_date int,
        lwt_indicator int,
        url blob,
        url1 blob,
        url2 blob,
        url3 blob,
        url4 blob,
        author text,
        title text,
        c ascii,
        d varchar,
        e boolean,
        f inet,
        g int,
        h bigint,
        char ascii,
        vchar varchar,
        finished boolean,
        osirisjson blob,
        ip inet,
        size int,
        imagebase bigint,
        small_num smallint,
        tiny_num tinyint,
        vint varint,
        decimal_num decimal,
        double_num double,
        float_num float,
        start_date date,
        start_time time,
        check_date timestamp,
        pass_date timestamp,
        upload_time timeuuid,
        user_uid uuid,
        name_set set<text>,
        name_list list<text>,
        PRIMARY KEY(domain, published_date)
  ) WITH compaction = { 'class':'LeveledCompactionStrategy' }
    AND comment='A table to hold blog posts'

extra_definitions:
  - create MATERIALIZED VIEW blogposts_update_one_column_lwt_indicator as select domain, lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator >=-2000 and lwt_indicator < 0 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_update_one_column_lwt_indicator_after_update as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator = 30000000 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_update_2_columns_lwt_indicator as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator > 0 and lwt_indicator <= 2000 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_update_2_columns_lwt_indicator_after_update as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator = 20000000 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_not_updated_lwt_indicator as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator > 1000000 and lwt_indicator < 20000000 PRIMARY KEY(lwt_indicator, domain, published_date);
  - create MATERIALIZED VIEW blogposts_deletions as select lwt_indicator, author from blogposts where domain is not null and published_date is not null and lwt_indicator > 10000 and lwt_indicator < 100000 PRIMARY KEY(lwt_indicator, domain, published_date);

### Column Distribution Specifications ###

columnspec:
  - name: domain
    population: seq(1..500000000)  #500M possible domains to pick from

  - name: published_date
    cluster: uniform(1..2000)         #under each domain we will have max 2000 posts

  - name: lwt_indicator
    population: seq(1..1001000)

  - name: url
    size: fixed(200)

  - name: url1
    size: fixed(200)

  - name: url2
    size: fixed(200)

  - name: url3
    size: fixed(200)

  - name: url4
    size: fixed(200)

  - name: title                  #titles shouldn't go beyond 20 chars
    size: gaussian(10..20)

  - name: author
    size: uniform(5..20)         #author names should be short

### Batch Ratio Distribution Specifications ###

insert:
  partitions: fixed(1)            # Our partition key is the domain so only insert one per batch

  select:    fixed(1)/2000        # We have 2000 posts per domain so 1/2000 will allow 1 post per batch

  batchtype: UNLOGGED             # Unlogged batches


#
# A list of queries you wish to run against the schema
#
queries:
   insert_query:
      cql: insert into blogposts(domain, published_date, lwt_indicator,  url, url1, url2, url3, url4, author, title, c, d, e, f, g, h, char, vchar, finished, osirisjson, ip, size, imagebase, small_num, tiny_num, vint, decimal_num, double_num, float_num, start_date, start_time, check_date, pass_date, upload_time, user_uid, name_set, name_list) values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?) if not exists
      fields: samerow
   select:
      cql: select * from blogposts where domain = ? LIMIT 1
      fields: samerow
   lwt_update_one_column:
      cql: update blogposts set lwt_indicator = 30000000 where domain = ? and published_date = ? if lwt_indicator >=-2000 and lwt_indicator < 0
      fields: samerow
   lwt_update_two_columns:
      cql: update blogposts set lwt_indicator = 20000000, author = 'text' where domain = ? and published_date = ? if lwt_indicator > 0 and lwt_indicator <= 2000 and author != 'text'
      fields: samerow
   lwt_deletes:
     cql: delete from blogposts where domain = ? and published_date = ? if lwt_indicator > 10000 and lwt_indicator < 100000
     fields: samerow
@roydahan
Copy link
Author

One observation we had from the logs is huge amount of semaphore_timeouts as described in #7779 (comment)

@avikivity
Copy link
Member

/cc @kostja @elcallio

What commit worked? Probably related to reverting commitlog improvements.

@slivne
Copy link
Contributor

slivne commented Sep 13, 2021

there are no writes blocked on commitlog

Screenshot from 2021-09-13 17-52-44

@roydahan
Copy link
Author

I updated in the original description the link to 4.4 monitor

@slivne
Copy link
Contributor

slivne commented Sep 14, 2021

@avikivity ping - please look at 4.4 / 4.5 monitors do you see anything pointing to the possible issue ?

@slivne
Copy link
Contributor

slivne commented Sep 14, 2021

@roydahan - lets run on 4.4.4 (latest 4.4 version) and see if it reproduces on that.

@bhalevy
Copy link
Member

bhalevy commented Sep 14, 2021

Looking at the logs, the timeouts are happening mostly on node-1.

db-cluster-2447519d$ grep -c -- '-1 .*Operation timed out' timeouts.log 
905743
db-cluster-2447519d$ grep -c -- '-[234] .*Operation timed out' timeouts.log 
13332

@bhalevy
Copy link
Member

bhalevy commented Sep 14, 2021

Here's an example of a burst of errors from the log:

2021-09-07T17:44:23+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.1.88#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:44:23+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.1.88:53163 msg_id 605543:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:44:23+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.1.88:53163 msg_id 605542:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:44:23+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.1.88#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:44:23+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla: message repeated 2 times: [  [shard 5] storage_proxy - Failed to apply mutation from 10.0.1.88#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)]
2021-09-07T17:44:24+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.1.88:53163 msg_id 605979:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:44:24+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.1.88:53163 msg_id 605982:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
...
2021-09-07T17:44:57+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.1.88#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:44:57+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.1.88#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:44:57+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.3.58#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:44:58+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact system.paxos b978bf40-1002-11ec-ba6b-caa0a3b95357] Compacted 3 sstables to [/var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-117-big-Data.db:level=0, ]. 23MB to 21MB (~91% o
f original) in 254486ms = 85kB/s. ~420608 total partitions merged to 420425.
2021-09-07T17:44:58+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.3.58#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:44:58+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact system.paxos 5195a270-1003-11ec-ba6b-caa0a3b95357] Compacting [/var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-355-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data
/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-327-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-285-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-243-big-Data.db:level=0:origin=memtable, /var/
lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-341-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-229-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-397-big-Data.db:level=0:origin=
memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-299-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-313-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-257-big-Data.db:
level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-271-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-411-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-2
15-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-201-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-383-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053e
f614bb7c2d/md-369-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-453-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-439-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c
2fd0a34108c053ef614bb7c2d/md-425-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-187-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-159-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/syste
m/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-173-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-145-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-131-big-Data.db:level=0:origin=memtable, /var/lib/sc
ylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-117-big-Data.db:level=0:origin=compaction, ]
2021-09-07T17:44:58+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.3.58:64909 msg_id 615650:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:44:58+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.3.58#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:44:58+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | sshd[50277]: Accepted publickey for scyllaadm from 10.0.3.77 port 48506 ssh2: RSA SHA256:1YmYWh7roz4Y3FSY7N8CetmQNlK/K5FeBPITC8KF0u8
2021-09-07T17:44:58+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.3.58#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:44:58+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.3.58:64909 msg_id 615658:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
...
2021-09-07T17:45:02+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact cqlstress_lwt_example.blogposts_deletions 50e60180-1003-11ec-ba6b-caa0a3b95357] Compacted 4 sstables to [/var/lib/scylla/data/cqlstress_lwt_example/blogposts_deletions-b15e6d10100111ec8481ca9da3b95357/md-257-big-Data.db:level=0, ]. 501kB to 240kB (~48% of original) in 4202ms = 57kB/s. ~2816 total partitions merged to 2415.
2021-09-07T17:45:02+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact cqlstress_lwt_example.blogposts_deletions 5405b810-1003-11ec-ba6b-caa0a3b95357] Compacting [/var/lib/scylla/data/cqlstress_lwt_example/blogposts_deletions-b15e6d10100111ec8481ca9da3b95357/md-285-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_deletions-b15e6d10100111ec8481ca9da3b95357/md-271-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_deletions-b15e6d10100111ec8481ca9da3b95357/md-257-big-Data.db:level=0:origin=compaction, ]
2021-09-07T17:45:02+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.3.78:53149 msg_id 641064:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:45:02+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.3.78:53149 msg_id 641066:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:45:02+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.3.78:53149 msg_id 641068:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:45:02+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.3.78#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:45:03+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla: message repeated 2 times: [  [shard 5] storage_proxy - Failed to apply mutation from 10.0.3.78#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)]
2021-09-07T17:45:03+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.3.78:53149 msg_id 641965:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:45:03+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] rpc - client 10.0.3.78:53149 msg_id 641972:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
2021-09-07T17:45:03+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !WARNING | scylla:  [shard 5] storage_proxy - Failed to apply mutation from 10.0.3.78#5: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.)
2021-09-07T17:45:05+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | sshd[50358]: Accepted publickey for scyllaadm from 10.0.3.77 port 48530 ssh2: RSA SHA256:1YmYWh7roz4Y3FSY7N8CetmQNlK/K5FeBPITC8KF0u8
2021-09-07T17:45:05+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | sshd[50358]: pam_unix(sshd:session): session opened for user scyllaadm by (uid=0)
2021-09-07T17:45:05+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd-logind[666]: New session 421 of user scyllaadm.
2021-09-07T17:45:05+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd[1]: Started Session 421 of user scyllaadm.
2021-09-07T17:45:06+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact cqlstress_lwt_example.blogposts_deletions 5405b810-1003-11ec-ba6b-caa0a3b95357] Compacted 3 sstables to [/var/lib/scylla/data/cqlstress_lwt_example/blogposts_deletions-b15e6d10100111ec8481ca9da3b95357/md-299-big-Data.db:level=0, ]. 409kB to 241kB (~59% of original) in 4170ms = 57kB/s. ~2688 total partitions merged to 2431.
2021-09-07T17:45:07+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact cqlstress_lwt_example.blogposts_update_one_column_lwt_indicator 56e928f0-1003-11ec-ba6b-caa0a3b95357] Compacting [/var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-229-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-61-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-215-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-89-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-145-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-103-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-75-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-131-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-47-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-117-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-201-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-187-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-173-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-159-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-33-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-19-big-Data.db:level=0:origin=memtable, ]
2021-09-07T17:45:08+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | sshd[50358]: pam_unix(sshd:session): session closed for user scyllaadm
2021-09-07T17:45:08+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd[1]: session-421.scope: Succeeded.
2021-09-07T17:45:08+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd-logind[666]: Session 421 logged out. Waiting for processes to exit.
2021-09-07T17:45:08+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd-logind[666]: Removed session 421.
2021-09-07T17:45:08+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 4] compaction - [Compact system.paxos 2be77c60-1003-11ec-87a5-ca9ba3b95357] Compacted 2 sstables to [/var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-214-big-Data.db:level=0, ]. 29MB to 29MB (~98% of original) in 73422ms = 401kB/s. ~565760 total partitions merged to 565550.
2021-09-07T17:45:11+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact cqlstress_lwt_example.blogposts_update_one_column_lwt_indicator 56e928f0-1003-11ec-ba6b-caa0a3b95357] Compacted 16 sstables to [/var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_one_column_lwt_indicator-aeec5ba0100111ec923a54b3294ab14a/md-243-big-Data.db:level=0, ]. 619kB to 140kB (~22% of original) in 3008ms = 46kB/s. ~2048 total partitions merged to 107.
2021-09-07T17:45:11+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact cqlstress_lwt_example.blogposts_update_2_columns_lwt_indicator 59593e90-1003-11ec-ba6b-caa0a3b95357] Compacting [/var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_2_columns_lwt_indicator-af5de220100111ec8e161c3028c316f0/md-61-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_2_columns_lwt_indicator-af5de220100111ec8e161c3028c316f0/md-47-big-Data.db:level=0:origin=compaction, ]
2021-09-07T17:45:13+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | sshd[50438]: Accepted publickey for scyllaadm from 10.0.3.77 port 48552 ssh2: RSA SHA256:1YmYWh7roz4Y3FSY7N8CetmQNlK/K5FeBPITC8KF0u8
2021-09-07T17:45:13+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | sshd[50438]: pam_unix(sshd:session): session opened for user scyllaadm by (uid=0)
2021-09-07T17:45:13+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd-logind[666]: New session 422 of user scyllaadm.
2021-09-07T17:45:13+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd[1]: Started Session 422 of user scyllaadm.
2021-09-07T17:45:14+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact cqlstress_lwt_example.blogposts_update_2_columns_lwt_indicator 59593e90-1003-11ec-ba6b-caa0a3b95357] Compacted 2 sstables to [/var/lib/scylla/data/cqlstress_lwt_example/blogposts_update_2_columns_lwt_indicator-af5de220100111ec8e161c3028c316f0/md-75-big-Data.db:level=0, ]. 99kB to 93kB (~94% of original) in 2875ms = 32kB/s. ~256 total partitions merged to 51.
2021-09-07T17:45:15+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 5] compaction - [Compact system.paxos 5ba79b60-1003-11ec-ba6b-caa0a3b95357] Compacting [/var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-495-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-467-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/system/paxos-b7b7f0c2fd0a34108c053ef614bb7c2d/md-509-big-Data.db:level=0:origin=memtable, ]
2021-09-07T17:45:15+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | sshd[50438]: pam_unix(sshd:session): session closed for user scyllaadm
2021-09-07T17:45:15+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd[1]: session-422.scope: Succeeded.
2021-09-07T17:45:15+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd-logind[666]: Session 422 logged out. Waiting for processes to exit.
2021-09-07T17:45:15+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | systemd-logind[666]: Removed session 422.
2021-09-07T17:45:20+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 0] compaction - [Compact cqlstress_lwt_example.blogposts 8142f230-1002-11ec-80ac-ca9ca3b95357] Compacted 4 sstables to [/var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-98-big-Data.db:level=1, /var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-140-big-Data.db:level=1, /var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-182-big-Data.db:level=1, ]. 467MB to 468MB (~100% of original) in 371664ms = 1MB/s. ~318336 total partitions merged to 317990.
2021-09-07T17:45:20+00:00  longevity-lwt-500G-3d-4-5-db-node-2447519d-1 !INFO    | scylla:  [shard 0] compaction - [Compact cqlstress_lwt_example.blogposts 5ed3ae00-1003-11ec-80ac-ca9ca3b95357] Compacting [/var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-126-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-154-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-168-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-112-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-182-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-140-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/cqlstress_lwt_example/blogposts-adacafb0100111ec80a556cf069bba9f/md-98-big-Data.db:level=1:origin=compaction, ]

@kostja @elcallio @raphaelsc I wonder if lwt might be creating lots of small sstables causing the read timeouts, until they are compacted.

@roydahan
Copy link
Author

The test on 4.4.4 looks good, the throughput is normal, the population phase finished in several hours and nemesis started to run.

monitor: http://54.75.44.177:3000/d/ly9LCfSnk/longevity-lwt-500g-3d-test-on-4-4-scylla-per-server-metrics-nemesis-master?orgId=1&refresh=30s&from=now-24h&to=now

@slivne
Copy link
Contributor

slivne commented Sep 15, 2021

If we are generating more sstables it should be visisble in counters for both sstables and also for memtable flush. lets check this out.

@avikivity
Copy link
Member

there are no writes blocked on commitlog

Screenshot from 2021-09-13 17-52-44

Probably the counter is lying. With LWT all writes should be blocked on commitlog.

avikivity pushed a commit that referenced this issue Sep 19, 2021
This exception is ignored by design, but if it's left unhandled,
it generates `Exceptional future ignored` warnings, like the following.

Also, ignore f2 if f1 failed since we return early in this case.

```
[shard 5] seastar - Exceptional future ignored: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem), backtrace: 0x431689e 0x4316d40 0x43170e8 0x3f35486 0x218d14a 0x3f8002f 0x3f81217 0x3f9f868 0x3f4b76a /opt/scylladb/libreloc/libpthread.so.0+0x93f8 /opt/scylladb/libreloc/libc.so.6+0x101902#012
N7seastar12continuationINS_8internal22promise_base_with_typeISt7variantIJN5utils4UUIDEN7service5paxos7promiseEEEEEZZZZNS7_11paxos_state7prepareEN7tracing15trace_state_ptrENS_13lw_shared_ptrIK6schemaEERKN5query12read_commandERK13partition_keyS5_bNSI_16digest_algorithmENSt6chrono10time_pointINS_12lowres_clockENSQ_8durationIlSt5ratioILl1ELl1000EEEEEEENK3$_0clEvENUlvE_clEvENKUlSB_E_clESB_EUlT_E_ZNS_6futureISt5tupleIJNS13_IvEENS13_IS14_IJNSE_INSI_6resultEEE17cache_temperatureEEEEEEE14then_impl_nrvoIS12_NS13_IS9_EEEET0_OS11_EUlOSA_RS12_ONS_12future_stateIS1B_EEE_S1B_EE#012
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<seastar::with_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(seastar::basic_semaphore<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, unsigned long, seastar::lowres_clock::duration, std::result_of&&)::{lambda(seastar::basic_semaphore)#1}::operator()<seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::basic_semaphore)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock> >(seastar::future<std::variant<utils::UUID, service::paxos::promise> >&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, seastar::semaphore_units<seastar::semaphore_default_exception_factory, seastar::lowres_clock>&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#12
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::key_lock_map::with_locked_key<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#1}>(dht::token const&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::result_of)::{lambda()#1}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, {lambda()#1}>({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#12
seastar::continuation<seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::finally_body<service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}, false>, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_wrapped_nrvo<seastar::future<std::variant<utils::UUID, service::paxos::promise> >, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}>(service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&&)::{lambda(seastar::internal::promise_base_with_type<std::variant<utils::UUID, service::paxos::promise> >&&, service::paxos::paxos_state::prepare(tracing::trace_state_ptr, seastar::lw_shared_ptr<schema const>, query::read_command const&, partition_key const&, utils::UUID, bool, query::digest_algorithm, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >)::$_0::operator()() const::{lambda()#2}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#12
seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}::operator()() const::{lambda(std::variant<utils::UUID, service::paxos::promise>)#1}, seastar::future<std::variant<utils::UUID, service::paxos::promise> >::then_impl_nrvo<{lambda()#1}, {lambda()#1}<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > > >({lambda()#1}&&)::{lambda(seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >&&, {lambda()#1}&, seastar::future_state<std::variant<utils::UUID, service::paxos::promise> >&&)#1}, std::variant<utils::UUID, service::paxos::promise> >#12
seastar::continuation<seastar::internal::promise_base_with_type<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >, seastar::future<seastar::foreign_ptr<std::unique_ptr<std::variant<utils::UUID, service::paxos::promise>, std::default_delete<std::variant<utils::UUID, service::paxos::promise> > > > >::finally_body<seastar::smp::submit_to<service::storage_proxy::init_messaging_service()::$_51::operator()(seastar::rpc::client_info const&, seastar::rpc::opt_time_point, query::read_command, partition_key, utils::UUID, bool, query::digest_algorithm, std::optional<tracing::trace_info>) const::{lambda(seastar::lw_shared_ptr<schema const>)#1}::operator()(seastar::lw_shared_ptr<schema const>)::{lambda()#1}>(unsigned int, se
```

Refs #7779
Refs #9331

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210919053007.13960-1-bhalevy@scylladb.com>
@bhalevy
Copy link
Member

bhalevy commented Sep 19, 2021

@kostja please look into this issue asap (it's blocking 4.5)
and assign someone to get to the bottom of it.

@slivne
Copy link
Contributor

slivne commented Sep 19, 2021

@elcallio can you please also look at the metrics can you see anything suggesting writes are blocked on commitlog in some form. The run is with commitlog hard limit disabled.

@slivne
Copy link
Contributor

slivne commented Sep 19, 2021

@kostja / @elcallio - to make it easier to find

@elcallio
Copy link
Contributor

@slivne - I can't see anything from the first monitor, and the second is dead (?). In any case, we never show any commitlog stalls in these - I am starting to wonder if the graphana counter is wired properly?
I can't see any significant changes between 4.4 and the current 4.5 version, other than 5fcc206 where we added multi-entry support. Also ab55a1b (partial flush/threshold) and improved (?) pre-alloc 48ca01c.

Note that the partial flush could be a reason for perhaps smaller sstables being written (initially).

@slivne
Copy link
Contributor

slivne commented Sep 20, 2021 via email

@elcallio
Copy link
Contributor

@slivne - I don't think fixing monitors really helps any. As I said, the CL blocked/write counters never seem to contain anything, Which makes me think they are either not read properly, or not presented properly (or don't lend themselves to presenting).
I have no theories, esp. since noone seems to be very certain what it is that blocks...

@slivne
Copy link
Contributor

slivne commented Sep 21, 2021 via email

@bentsi
Copy link
Contributor

bentsi commented Sep 21, 2021

Which monitors do you need?
this one(45.rc7) is working:
http://34.252.214.230:3000/d/dGxHpZSnz/longevity-lwt-500g-3d-test-scylla-per-server-metrics-nemesis-master?orgId=1&from=1631032607724&to=1631169036853
the monitor with 4.4.4 is not, I will check whether I can bring it up

@elcallio
Copy link
Contributor

If I read the metric correctly, we only had "many" memtable flushes in the beginning of the run (first 3-4h or so). Way before any of the log messages above... Not sure if that means anything though...

@kostja
Copy link
Contributor

kostja commented Sep 21, 2021

I've been running this for a few hours locally with --commitlog-use-hard-size-limit=true; did not observe a performance drop that would be unusual. running without the option.
;

@kostja
Copy link
Contributor

kostja commented Sep 22, 2021

Left the test running overnight with --commitlog-use-hard-size-limit=false, did not observe a performance drop. 131GB of data in blogposts is stored in ~6000 files on disk, which seems to be adequate (~400 files were used for 6GB)

@bhalevy
Copy link
Member

bhalevy commented Sep 22, 2021

@kostja just to make sure, are you testing master or 4.5?

@kostja
Copy link
Contributor

kostja commented Sep 22, 2021

master

@fgelcer
Copy link

fgelcer commented Sep 22, 2021

master

the issue was found and reported against 4.5

@kostja
Copy link
Contributor

kostja commented Sep 22, 2021

@fgelcer I'm sorry but this is not the issue. The issue is not found and I am trying to find it. An issue contains a test case that I as an engineer can easily reproduce. A bunch of monitors left around after a 3-day run help me improve my telepathy skills but are a very inefficient way to nail down the problem.
If there is a strong reason to believe that the issue was fixed in master, please share it.

@roydahan
Copy link
Author

Left the test running overnight with --commitlog-use-hard-size-limit=false, did not observe a performance drop. 131GB of data in blogposts is stored in ~6000 files on disk, which seems to be adequate (~400 files were used for 6GB)

@kostja I don't know how you ran it and for how long, but the test should write 500GB of data in 8 hours.

@slivne
Copy link
Contributor

slivne commented Sep 23, 2021

@kostja - QA have a specific use cases hat fails on a specific setup using a specific version - the issue reproduces on that environment and use case - lets use that - to first understand where the issue is and then try to minimize / change / reproduce on a smaller setup etc.

The bug exists and it reproduces.

@eliransin
Copy link
Contributor

I can't seem to get resolved lines out of that AMI... The scylla-packaged python addr2line is missing, and trying to use scripts with local (apt) packages fails due to version mismatches (?). Even trying to use native elfutils on the scylla exec just dies... Is there debug info installed into the AMI?

yes

@elcallio
Copy link
Contributor

elcallio commented May 3, 2022

Unfortunately, I could still not get any info out of the image, neither the booted up ubuntu, or copying the scylla executable to another machine. No idea why the ubuntu addr2line fails, but it does...

@eliransin
Copy link
Contributor

I will try to help with this.

@eliransin
Copy link
Contributor

The stack trace for the reactor stall in #9331 (comment):

seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) + 210 in section .text of /opt/scylladb/libexec/scylla
seastar::internal::cpu_stall_detector::generate_trace() + 288 in section .text of /opt/scylladb/libexec/scylla
seastar::reactor::block_notifier(int) + 80 in section .text of /opt/scylladb/libexec/scylla
No symbol matches 0x7f01168fda1f.
logalloc::region_impl::free(void*, unsigned long) + 719 in section .text of /opt/scylladb/libexec/scylla
bplus::tree<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12ul, (bplus::key_search)0, (bplus::with_debug)0>::iterator::erase(cached_file::page_idx_less_comparator) + 1692 in section .text of /opt/scylladb/libexec/scylla
cached_file::cached_page::on_evicted() + 116 in section .text of /opt/scylladb/libexec/scylla
std::_Function_handler<seastar::memory::reclaiming_result (), cache_tracker::cache_tracker(mutation_application_stats&, seastar::bool_class<register_metrics_tag>)::$_0>::_M_invoke(std::_Any_data const&) + 146 in section .text of /opt/scylladb/libexec/scylla
logalloc::region_impl::evict_some() + 77 in section .text of /opt/scylladb/libexec/scylla
logalloc::reclaim_from_evictable(logalloc::region_impl&, unsigned long, seastar::bool_class<is_preemptible_tag>) + 1144 in section .text of /opt/scylladb/libexec/scylla
logalloc::tracker::impl::compact_and_evict_locked(unsigned long, unsigned long, seastar::bool_class<is_preemptible_tag>) + 5523 in section .text of /opt/scylladb/libexec/scylla
logalloc::segment_pool::allocate_segment(unsigned long) + 395 in section .text of /opt/scylladb/libexec/scylla
logalloc::allocating_section::reserve() + 119 in section .text of /opt/scylladb/libexec/scylla
cached_file::get_page_ptr(unsigned long, unsigned long, seastar::io_priority_class const&, tracing::trace_state_ptr)::{lambda(seastar::temporary_buffer<char>&&)#1}::operator()(seastar::temporary_buffer<char>&&) + 188 in section .text of /opt/scylladb/libexec/scylla
seastar::continuation<seastar::internal::promise_base_with_type<std::unique_ptr<cached_file::cached_page, cached_file::cached_page::cached_page_del> >, cached_file::get_page_ptr(unsigned long, unsigned long, seastar::io_priority_class const&, tracing::trace_state_ptr)::{lambda(seastar::temporary_buffer<char>&&)#1}, seastar::future<seastar::temporary_buffer<char> >::then_impl_nrvo<{lambda(seastar::temporary_buffer<char>&&)#1}, cached_file::get_page_ptr(unsigned long, unsigned long, seastar::io_priority_class const&, tracing::trace_state_ptr)::{lambda(seastar::temporary_buffer<char>&&)#1}<std::unique_ptr<cached_file::cached_page, cached_file::cached_page::cached_page_del> > >({lambda(seastar::temporary_buffer<char>&&)#1}&&)::{lambda(seastar::internal::promise_base_with_type<std::unique_ptr<cached_file::cached_page, cached_file::cached_page::cached_page_del> >&&, {lambda(seastar::temporary_buffer<char>&&)#1}&, seastar::future_state<seastar::temporary_buffer<char> >&&)#1}, seastar::temporary_buffer<char> >::run_and_dispose() + 100 in section .text of /opt/scylladb/libexec/scylla
seastar::reactor::run_some_tasks() + 340 in section .text of /opt/scylladb/libexec/scylla
seastar::reactor::do_run() + 3303 in section .text of /opt/scylladb/libexec/scylla
std::_Function_handler<void (), seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::$_98>::_M_invoke(std::_Any_data const&) + 1397 in section .text of /opt/scylladb/libexec/scylla
seastar::posix_thread::start_routine(void*) + 10 in section .text of /opt/scylladb/libexec/scylla
No symbol matches 0x9298.
No symbol matches 0x100352.

@elcallio would you analyze and determine if it is commitlog (more sepcifically hard limit) related or not?
some technicalities in case there are more and different stack traces to analyze:
I produced this using gdb on an instance with one of the images listed in #9331 (comment) ,those are AMIs are built on my machine and they
are simply 4.6.rc5 with hard limit on by default.
The way I generated this stacktrace:

  1. start an instance with one of the AMIs listed (region dependant), i used an i3.xlarge
  2. ssh into the machine and stop all scylla processes running (not mandatory)
  3. sudo apt-get update
  4. sudo apt-get install build-essential gdb
  5. Create a file in the home dir with one line per address which says: info symbol 0x??????? - i called it scylla_stack
  6. gdb /opt/scylladb/libexec/scylla - this is going to take a while, a few minutes at least to load the symbols.
  7. in gdb promp: source ~/scylla_stack
  8. You will get the input above.

For completeness and so you can do a one to one match for the addresses, the scylla_stack file content:

info symbol 0x3db1022
info symbol 0x3dafcd0
info symbol 0x3db0f30
info symbol 0x7f01168fda1f
info symbol 0x1409dff
info symbol 0x164a25c
info symbol 0x1649b94
info symbol 0x134b1b2
info symbol 0x13dd21d
info symbol 0x13e9278
info symbol 0x13e4e03
info symbol 0x13d46fb
info symbol 0x13f2e97
info symbol 0x17454ac
info symbol 0x1748bc4
info symbol 0x3dc3a54
info symbol 0x3dc4e47
info symbol 0x3de2475
info symbol 0x3d9c08a
info symbol 0x9298
info symbol 0x100352

@eliransin
Copy link
Contributor

@elcallio BTW I am leaving the machine on for a few hours in case you want to experiment with it.
You can log into my gdb session without the hassle of waiting for the symbols to load with:
ssh -i ~/.ssh/scylla-qa-ec2 centos@52.215.81.66
then run: tmux a
Let me know when I can put the instance down, if not asked otherwise I will kill it towards the end of the day.

@elcallio
Copy link
Contributor

FWIW, the trace looks wholly unrelated to anything commit log.

@eliransin
Copy link
Contributor

@roydahan the stall doesn't seem to come from commitlog or more specifically hard limit.
What are next steps in order to try and push this forward?

@roydahan
Copy link
Author

If you think it's ready, I think we can merge.
If you know some related areas you would like us to test we can do that.

benipeled pushed a commit to benipeled/scylla that referenced this issue May 31, 2022
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.

Refs scylladb#9331.

Scylla ent:
Refs scylladb#2111.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
(cherry picked from commit 52302c3)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220215162147.61628-1-raphaelsc@scylladb.com>
@eliransin
Copy link
Contributor

eliransin commented Jun 13, 2022

I am doing a comparative test now:
I reran the AMI and got a similar number of stalls. (~420 stalls).
Now I am rerunning the original version (4.6.rc5) to see if those stalls exist there. If the same stalls exist we can proceed with
enabling the feature, however if not, we will need to find out what triggers those stalls.
Jenkins job link to a clean 4.6.rc5 run:
https://jenkins.scylladb.com/view/scylla-4.6/job/scylla-4.6/job/SCT-Reproducers/job/Issue-9331/job/longevity-cdc-400gb-3d/5

@eliransin
Copy link
Contributor

After running a build without hardlimit I found a similar amount of stalls:
I used the following search pattern:
find -name "messages.log" | xargs cat | grep "Reactor stalled" | wc -l
For the latest run with hard limit on I got:
475
And for the latest run with hard limit off which is official 4.6.rc5 I got:
570

All in all it is comparable.

Histograms of the stalls looks similar which also the mean stall time is similar (+/- 10%):

image

@eliransin
Copy link
Contributor

@aleksbykov the last thing we are left to do according to @roydahan comment #9331 (comment) is to run write throughput test and
see how it is impacted.
What we have already done is, making sure that the insert time is not too long compared to the non hard limit version, make
sure that coredumps are not in the commitlog area, analyze observed stalls and see if it is hard limit related.
In order to enable the feature I will go over the other hard limit bugs and make sure that either there are none or if there are
extremely problematic workloads that we should additionally check in order to qualify it.

@kostja
Copy link
Contributor

kostja commented Jan 13, 2023

@eliransin can this be closed?

@DoronArazii DoronArazii added this to the 5.x milestone Jan 16, 2023
@kostja kostja added area/admission control tests/longevity Issue detected during longevity labels Jul 18, 2023
@kostja kostja removed their assignment Jul 18, 2023
@DoronArazii
Copy link

@eliransin / @wmitros what are we tracking here ATM?

avikivity added a commit that referenced this issue Oct 26, 2023
…fault' from Eliran Sinvani

This miniset, completes the prerequisites for enabling commitlog hard limit on by default.
Namely, start flushing and evacuating segments halfway to the limit in order to never hit it under normal circumstances.
It is worth mentioning that hitting the limit is an exceptional condition which it's root cause need to be resolved, however,
once we do hit the limit, the performance impact that is inflicted as a result of this enforcement is irrelevant.

Tests: unit tests.
          LWT write test (#9331)
A whitebox testing has been performed by @wmitros , the test aimed at putting as much pressure as possible on the commitlog segments by using a write pattern that rewrites the partitions in the memtable keeping it at ~85% occupancy so the dirty memory manager will not kick in. The test compared 3 configurations:
1. The default configuration
2. Hard limit on (without changing the flush threshold)
3. the changes in this PR applied.
The last exhibited the "best" behavior in terms of metrics, the graphs were the flattest and less jaggy from the others.

Closes #10974

* github.com:scylladb/scylladb:
  commitlog: enforce commitlog size hard limit by default
  commitlog: set flush threshold to half of the  limit size
  commitlog: unfold flush threshold assignment

Fixes #9625
avikivity added a commit that referenced this issue Oct 30, 2023
…fault' from Eliran Sinvani

This miniset, completes the prerequisites for enabling commitlog hard limit on by default.
Namely, start flushing and evacuating segments halfway to the limit in order to never hit it under normal circumstances.
It is worth mentioning that hitting the limit is an exceptional condition which it's root cause need to be resolved, however,
once we do hit the limit, the performance impact that is inflicted as a result of this enforcement is irrelevant.

Tests: unit tests.
          LWT write test (#9331)
A whitebox testing has been performed by @wmitros , the test aimed at putting as much pressure as possible on the commitlog segments by using a write pattern that rewrites the partitions in the memtable keeping it at ~85% occupancy so the dirty memory manager will not kick in. The test compared 3 configurations:
1. The default configuration
2. Hard limit on (without changing the flush threshold)
3. the changes in this PR applied.
The last exhibited the "best" behavior in terms of metrics, the graphs were the flattest and less jaggy from the others.

Closes #10974

* github.com:scylladb/scylladb:
  commitlog: enforce commitlog size hard limit by default
  commitlog: set flush threshold to half of the  limit size
  commitlog: unfold flush threshold assignment
@eliransin eliransin assigned eliransin and unassigned elcallio Oct 31, 2023
@mykaul
Copy link
Contributor

mykaul commented Nov 19, 2023

@eliransin can this be closed?

@eliransin / @kostja ?

@nyh
Copy link
Contributor

nyh commented Nov 19, 2023

I think the creator of this issue (@roydahan) or one of the people who made the 257 comments in this issue who were ever able to reproduce this issue (are there such people?) need to say if this issue can still be reproduced. If it can't be reproduced, we should close it. If it's still can be reproduced, it's still a bug and definitely shouldn't be closed!

@kostja
Copy link
Contributor

kostja commented Nov 19, 2023

@mykaul I agee with @nyh comment. There has never been a clear success criteria for this issue. The description and the title are by now vastly obsolete. We used it to track commit log hard limit work for a while, so it hangs on the commit log hard limit still not being a default. I don't think it is the proper way to use the issue database.

So I will take the liberty of closing this, whatever issues are still out there after 4 years deserve to be tracked in a new ticket.

@kostja kostja closed this as completed Nov 19, 2023
@bhalevy
Copy link
Member

bhalevy commented Nov 19, 2023

after 4 years

2 + epsilon. let's not exaggerate.

avikivity added a commit that referenced this issue Nov 29, 2023
…fault' from Eliran Sinvani

This miniset, completes the prerequisites for enabling commitlog hard limit on by default.
Namely, start flushing and evacuating segments halfway to the limit in order to never hit it under normal circumstances.
It is worth mentioning that hitting the limit is an exceptional condition which it's root cause need to be resolved, however,
once we do hit the limit, the performance impact that is inflicted as a result of this enforcement is irrelevant.

Tests: unit tests.
          LWT write test (#9331)
A whitebox testing has been performed by @wmitros , the test aimed at putting as much pressure as possible on the commitlog segments by using a write pattern that rewrites the partitions in the memtable keeping it at ~85% occupancy so the dirty memory manager will not kick in. The test compared 3 configurations:
1. The default configuration
2. Hard limit on (without changing the flush threshold)
3. the changes in this PR applied.
The last exhibited the "best" behavior in terms of metrics, the graphs were the flattest and less jaggy from the others.

Closes #10974

* github.com:scylladb/scylladb:
  commitlog: enforce commitlog size hard limit by default
  commitlog: set flush threshold to half of the  limit size
  commitlog: unfold flush threshold assignment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests