cassandra-stress starts to timeout and fail after ~20min with GCE local disks #7341

fruch · 2020-10-06T07:18:00Z

Installation details
Scylla version (or git commit hash): 4.1.7 (4.1.7-0.20200918.2251a1c577)
Cluster size: 4
OS (RHEL/CentOS/Ubuntu/AWS AMI): Centos8/Ubuntu1804

Hardware details (for performance issues)
Platform (physical/VM/cloud instance type/docker): GCE (n1-highmem-16)
Hardware: sockets= cores= hyperthreading= memory=
Disks: (SSD/HDD, count) Nvme 8/6 disks

Summary

Trying to fill up a GCE setup with 1Tb of data, for testing backup to Google Storage.

This is the stress command we are using:

cassandra-stress write cl=QUORUM n=1100200300 -schema 'replication(factor=3) compaction(strategy=LeveledCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=150 -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..1100200300

After ~20min we start getting timeouts that fails the load with the following errors:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during SIMPLE write query at consistency QUORUM (2 replica were required but only 1 acknowledged the write)

Logs (Centos8)

db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/519a5474-7368-4a4d-8ac2-485540923422/20201005_113703/db-cluster-519a5474.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/519a5474-7368-4a4d-8ac2-485540923422/20201005_113703/loader-set-519a5474.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/519a5474-7368-4a4d-8ac2-485540923422/20201005_113703/monitor-set-519a5474.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/519a5474-7368-4a4d-8ac2-485540923422/20201005_113703/sct-runner-519a5474.zip
monitor - hydra investigate show-monitor 519a5474-7368-4a4d-8ac2-48554092342

Further thing we tried

we tried this also with ubuntu1804, suspecting maybe it's a kernel related issue, but got the similar results

talking with @tarzanek, we looking into the io_properties.yaml, he suggested to lower it down, to let scylla throttle it a bit
this is the output of iotune, while measuring:

disks:
  - mountpoint: /var/lib/scylla
    read_iops: 721717
    read_bandwidth: 2949972224
    write_iops: 400594
    write_bandwidth: 1504807040

and this is the io_properties.yaml we overwrite it with: (the parameter that are not part of master, which are calculated base on number of local disks)

disks:
  - mountpoint: /var/lib/scylla
    read_iops: 680000
    read_bandwidth: 2778726400
    write_iops: 360000
    write_bandwidth: 1468006400

After applying the suggested io_properties.yaml, we are still facing those issues
monitor -http://35.196.159.20:3000/d/Z3dCVz5Mk/scylla-per-server-metrics-nemesis-master?orgId=1&from=1601970238340&to=1601972849329

The text was updated successfully, but these errors were encountered:

tarzanek · 2020-10-06T11:29:32Z

so we have the reason, the nodes run with mq and share cpus with OS, so NIC traffic and SIs interfere

btw. neither of the tuning runs by default:

Do you want to enable Network Interface Card (NIC) and disk(s) optimization?
Yes - optimize the NIC queue and disks settings. Selecting Yes greatly improves performance. No - skip this step.
[yes/NO]

Do you want to set the CPU scaling governor to Performance level on boot?
Yes - sets the CPU scaling governor to performance level. No - skip this step.
[YES/no]
This computer doesn't supported CPU scaling configuration.

fruch · 2020-10-06T13:00:07Z

so we have the reason, the nodes run with mq and share cpus with OS, so NIC traffic and SIs interfere

btw. neither of the tuning runs by default:

Do you want to enable Network Interface Card (NIC) and disk(s) optimization?
Yes - optimize the NIC queue and disks settings. Selecting Yes greatly improves performance. No - skip this step.
[yes/NO]

Do you want to set the CPU scaling governor to Performance level on boot?
Yes - sets the CPU scaling governor to performance level. No - skip this step.
[YES/no]
This computer doesn't supported CPU scaling configuration.

--setup-nic-and-disks seem like we need to add this to scylla_setup we didn't passing that in SCT so far.
we'll try again

as for CPU scaling, I don't seem to be able to turn it on, I'm not even sure we can do that on GCE, didn't found any information regarding that.

fruch · 2020-10-06T13:22:57Z

talking offline with @vladzcloudius

he mentioned this issue:
scylladb/seastar#729

and this rule of thumb, base on the number of cpu might work better:

0-4: mq
5-8: sq
9-: sq_split

I'll give it a try too

tarzanek · 2020-10-06T14:52:30Z

hmm and looking at my comment I wanted to say it's about
cpuset, not cpuscaling governor,

so basically /etc/scylla.d/cpuset.conf
is empty
while for 16 vcpus it should set correct pairs
https://github.com/scylladb/scylla/blob/master/dist/common/scripts/scylla_cpuset_setup
is called from
https://github.com/scylladb/scylla/blob/master/dist/common/scripts/scylla_sysconfig_setup#L81
so the fix needs to happen in perftune while called for --get-cpu-mask

https://github.com/scylladb/seastar/blob/master/scripts/perftune.py#L1355

tarzanek · 2020-10-06T14:55:34Z

that command on your GCP does:

# /opt/scylladb/scripts/perftune.py --tune net --nic eth0 --get-cpu-mask
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/perftune.py", line 1326, in <module>
    args.cpu_mask = run_hwloc_calc(['all'])
  File "/opt/scylladb/scripts/libexec/perftune.py", line 58, in run_hwloc_calc
    return run_read_only_command(['hwloc-calc'] + prog_args).rstrip()
  File "/opt/scylladb/scripts/libexec/perftune.py", line 46, in run_read_only_command
    return __run_one_command(prog_args, stderr=stderr, check=check)
  File "/opt/scylladb/scripts/libexec/perftune.py", line 30, in __run_one_command
    proc = subprocess.Popen(prog_args, stdout = subprocess.PIPE, stderr = stderr)
  File "/opt/scylladb/python3/lib64/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/opt/scylladb/python3/lib64/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'hwloc-calc': 'hwloc-calc'

tarzanek · 2020-10-06T14:57:52Z

after yum install hwloc
and reruning scylla_setup
I get:

cat /etc/scylla.d/cpuset.conf
# DO NO EDIT
# This file should be automatically configure by scylla_cpuset_setup
#
# CPUSET="--cpuset 0 --smp 1"
CPUSET="--cpuset 0-15 "

which is also not what we want
the question here is what does perftune of nic lock the rx queues to

tarzanek · 2020-10-06T15:01:53Z

dry run of perftune:

[root@mgr-backup-1tb-manager--db-node-8782ad1e-0-4 ~]# /opt/scylladb/scripts/perftune.py --dry-run --tune net --nic eth0
# irqbalance is not running
# Setting a physical interface eth0...
# Distributing all IRQs
echo 00000001 > /proc/irq/53/smp_affinity
echo 00000001 > /proc/irq/72/smp_affinity
echo 00000001 > /proc/irq/73/smp_affinity
echo 00000100 > /proc/irq/46/smp_affinity
echo 00000100 > /proc/irq/48/smp_affinity
echo 00000002 > /proc/irq/55/smp_affinity
echo 00000002 > /proc/irq/69/smp_affinity
echo 00000200 > /proc/irq/65/smp_affinity
echo 00000200 > /proc/irq/70/smp_affinity
echo 00000004 > /proc/irq/66/smp_affinity
echo 00000004 > /proc/irq/76/smp_affinity
echo 00000400 > /proc/irq/59/smp_affinity
echo 00000400 > /proc/irq/58/smp_affinity
echo 00000008 > /proc/irq/52/smp_affinity
echo 00000008 > /proc/irq/50/smp_affinity
echo 00000800 > /proc/irq/78/smp_affinity
echo 00000800 > /proc/irq/60/smp_affinity
echo 00000010 > /proc/irq/54/smp_affinity
echo 00000010 > /proc/irq/75/smp_affinity
echo 00001000 > /proc/irq/77/smp_affinity
echo 00001000 > /proc/irq/49/smp_affinity
echo 00000020 > /proc/irq/47/smp_affinity
echo 00000020 > /proc/irq/61/smp_affinity
echo 00002000 > /proc/irq/51/smp_affinity
echo 00002000 > /proc/irq/71/smp_affinity
echo 00000040 > /proc/irq/62/smp_affinity
echo 00000040 > /proc/irq/56/smp_affinity
echo 00004000 > /proc/irq/57/smp_affinity
echo 00004000 > /proc/irq/64/smp_affinity
echo 00000080 > /proc/irq/63/smp_affinity
echo 00000080 > /proc/irq/74/smp_affinity
echo 00008000 > /proc/irq/67/smp_affinity
echo 00008000 > /proc/irq/68/smp_affinity
echo 0000ffff > /sys/class/net/eth0/queues/rx-13/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-9/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-11/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-7/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-5/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-3/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-1/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-14/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-12/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-8/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-10/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-6/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-4/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-2/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-15/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-0/rps_cpus
sysctl net.core.rps_sock_flow_entries
# Setting net.core.rps_sock_flow_entries to 32768
sysctl -w net.core.rps_sock_flow_entries=32768
echo 2048 > /sys/class/net/eth0/queues/rx-13/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-9/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-11/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-7/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-5/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-3/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-14/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-12/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-8/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-10/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-6/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-4/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-2/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-15/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
# Enable ntuple filtering HW offload for eth0...
ethtool -K eth0 ntuple on
echo 00000001 > /sys/class/net/eth0/queues/tx-6/xps_cpus
echo 00000100 > /sys/class/net/eth0/queues/tx-12/xps_cpus
echo 00000002 > /sys/class/net/eth0/queues/tx-4/xps_cpus
echo 00000200 > /sys/class/net/eth0/queues/tx-10/xps_cpus
echo 00000004 > /sys/class/net/eth0/queues/tx-2/xps_cpus
echo 00000400 > /sys/class/net/eth0/queues/tx-0/xps_cpus
echo 00000008 > /sys/class/net/eth0/queues/tx-9/xps_cpus
echo 00000800 > /sys/class/net/eth0/queues/tx-15/xps_cpus
echo 00000010 > /sys/class/net/eth0/queues/tx-7/xps_cpus
echo 00001000 > /sys/class/net/eth0/queues/tx-13/xps_cpus
echo 00000020 > /sys/class/net/eth0/queues/tx-5/xps_cpus
echo 00002000 > /sys/class/net/eth0/queues/tx-11/xps_cpus
echo 00000040 > /sys/class/net/eth0/queues/tx-3/xps_cpus
echo 00004000 > /sys/class/net/eth0/queues/tx-1/xps_cpus
echo 00000080 > /sys/class/net/eth0/queues/tx-8/xps_cpus
echo 00008000 > /sys/class/net/eth0/queues/tx-14/xps_cpus
echo 4096 > /proc/sys/net/core/somaxconn
echo 4096 > /proc/sys/net/ipv4/tcp_max_syn_backlog

so we just lock to all cpus
ok, the best thing would be to check the status of the machine when the timeouts happen to see if above is good decision or not

vladzcloudius · 2020-10-06T15:46:52Z

As wrong as the configuration might have been, @fruch, I don't think that's the root cause of timeouts.

The reason is that you are pushing the rate to the maximum (since you don't set the rate target in the c-s command line) which gets the I/O to its limit pretty fast (see the amount of background writes).

On top of that LCS kicks in and starts consuming a lot of I/O budget (rates get as high as 1GB/sec in the compaction class):

Which adds to the insult.

I think you need to pace the c-s down here or not get offended by a few timeouts:

fruch · 2020-10-06T17:10:19Z

As wrong as the configuration might have been, @fruch, I don't think that's the root cause of timeouts.

The reason is that you are pushing the rate to the maximum (since you don't set the rate target in the c-s command line) which gets the I/O to its limit pretty fast (see the amount of background writes).

On top of that LCS kicks in and starts consuming a lot of I/O budget (rates get as high as 1GB/sec in the compaction class):

1.2Gbs is much better then before, we failed at ~600Mbs
@vladzcloudius I'm not sure I'm following the reasons, why we could throw at i3.4xlarge in AWS the same amount of data, and scylla throttle it quite nicely, and we don't need to throttle from c-s side.
the local disks are that slow in GCE ?

Also I don't know if I can call those a few timeouts

@tarzanek
this command isn't working, I think it's not using the corrects PATH and the relocatable python, from within the scylla_prepare I think it does work

/opt/scylladb/scripts/perftune.py --tune net --nic eth0 --get-cpu-mask

tarzanek · 2020-10-06T18:04:26Z

I opened #7350 since running tools used to work, now it's not, scripts will break, automation scripts will too

fruch · 2020-10-07T09:35:42Z

@tarzanek @vladzcloudius

I've playing around with it yesterday

I've change the command limit the load from c-s end,

cassandra-stress write cl=QUORUM n=1100200300 -schema keyspace=keyspace1 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=64 throttle=40000/s -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..1100200300

One surge of timeouts have seen at the begining, but after that is stable for a while:
http://34.75.108.171:3000/d/eJdUlZ5Mk/scylla-per-server-metrics-nemesis-master?orgId=1&refresh=30s&from=now-1h&to=now

again compared to run on our AMIs in AWS, 40K is around half of what I remember we can get,

downloaded the monitor to take a peak from out runs of 4.1 with 1Tb
hydra investigate show-monitor 5db61478-f573-4aa4-99fe-dd5e2ec41b7

http://3.216.132.220:3000/d/manager-2-2/scylla-manager-metrics?orgId=1&refresh=30s

Latency is 15-40ms, but scylla seem to be throttling it quite fine, out of the box.

the bottle neck is just in a different place ? i.e. in AWS seem like it's the CPU, and in GCE it's the disk ?

fruch · 2020-10-07T11:08:45Z

It was holding for ~3 hours, but reach ~360Gb on each node, and started failing.

slivne · 2020-10-07T12:54:22Z

@israel Fruchter <fruch@scylladb.com> which kernel are you using we know that in the tobe gcp image we use a different kernel then whats shipped in cenots7 / centos8

…

On Wed, Oct 7, 2020 at 2:09 PM Israel Fruchter ***@***.***> wrote: It was holding for ~3 hours, but reach ~360Gb on each node, and started failing. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7341 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2OCCGOEBPGNMIYDWGW2LLSJRD5HANCNFSM4SFUGCQQ> .

tarzanek · 2020-10-07T18:32:43Z

SIs after using sq_split :

top - 18:31:27 up  1:42,  1 user,  load average: 6.95, 8.36, 8.79
Tasks: 576 total,   2 running, 574 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 70.8 id,  0.0 wa,  4.6 hi, 24.6 si,  0.0 st
%Cpu1  : 30.9 us, 20.6 sy,  0.0 ni, 41.2 id,  0.0 wa,  1.5 hi,  5.9 si,  0.0 st
%Cpu2  : 29.2 us, 21.5 sy,  0.0 ni, 41.5 id,  0.0 wa,  1.5 hi,  6.2 si,  0.0 st
%Cpu3  : 28.4 us, 23.9 sy,  0.0 ni, 40.3 id,  0.0 wa,  1.5 hi,  6.0 si,  0.0 st
%Cpu4  : 28.1 us, 20.3 sy,  0.0 ni, 45.3 id,  0.0 wa,  1.6 hi,  4.7 si,  0.0 st
%Cpu5  : 29.2 us, 20.0 sy,  0.0 ni, 44.6 id,  0.0 wa,  1.5 hi,  4.6 si,  0.0 st
%Cpu6  : 27.9 us, 25.0 sy,  0.0 ni, 41.2 id,  0.0 wa,  1.5 hi,  4.4 si,  0.0 st
%Cpu7  : 31.8 us, 18.2 sy,  0.0 ni, 42.4 id,  0.0 wa,  1.5 hi,  6.1 si,  0.0 st
%Cpu8  :  1.5 us,  3.1 sy,  0.0 ni, 76.9 id,  0.0 wa,  3.1 hi, 15.4 si,  0.0 st
%Cpu9  : 29.7 us, 20.3 sy,  0.0 ni, 43.8 id,  0.0 wa,  1.6 hi,  4.7 si,  0.0 st
%Cpu10 : 30.2 us, 20.6 sy,  0.0 ni, 41.3 id,  0.0 wa,  1.6 hi,  6.3 si,  0.0 st
%Cpu11 : 26.6 us, 21.9 sy,  0.0 ni, 43.8 id,  0.0 wa,  1.6 hi,  6.2 si,  0.0 st
%Cpu12 : 30.3 us, 22.7 sy,  0.0 ni, 40.9 id,  0.0 wa,  1.5 hi,  4.5 si,  0.0 st
%Cpu13 : 29.2 us, 21.5 sy,  0.0 ni, 43.1 id,  0.0 wa,  1.5 hi,  4.6 si,  0.0 st
%Cpu14 : 29.2 us, 21.5 sy,  0.0 ni, 43.1 id,  0.0 wa,  1.5 hi,  4.6 si,  0.0 st
%Cpu15 : 30.8 us, 21.5 sy,  0.0 ni, 41.5 id,  0.0 wa,  1.5 hi,  4.6 si,  0.0 st
MiB Mem : 104478.7 total,   2688.7 free,  97823.3 used,   3966.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5675.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  48534 scylla    20   0   16.0t  94.6g  58696 R 692.4  92.7 565:45.12 scylla
    622 root       0 -20       0      0      0 I   1.5   0.0   0:02.26 kworker/9:1H-kblockd
    832 root      20   0  109412  11196   8432 S   1.5   0.0   0:00.57 systemd-udevd
  74425 root      20   0   65080   5300   4080 R   1.5   0.0   0:00.05 top
      1 root      20   0  245168  14152   9220 S   0.0   0.0   0:11.84 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.06 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp

so far most writes were retried, almost no time outs
throttled at 40k, most probably 50k, 60k or even 70k might work, threads might be decreased to 4 x 14 so 56 to get optimum usage
(and if you on loader also do cpu nic offloading in similar way(perftune can be used) and set affinity for c-s to leftover cpus the latencies on client will be even better)

fruch · 2020-10-08T06:19:47Z

Update
http://35.231.33.180:3000/d/SfffwMcMk/scylla-per-server-metrics-nemesis-master?orgId=1&from=1602088311871&to=1602112341050

after patching the cpuset to: 1-7,9-15
A run throttled at 40K ops by c-s,
has lasted ~5.5 hours, and failed on timeout "storm"
filled up ~600Gb on each node in that time

for compering we had one run without throttling , it lasted ~1.5 hours, and did got to 60K-70K ops, as @tarzanek estimated
http://35.231.218.186:3000/d/ieTTV75Gz/scylla-per-server-metrics-nemesis-master?orgId=1&from=1602103316079&to=1602113512352

both case compaction was starting to raise when the "storm" starts.

a note c-s, is ignoring timeouts, and retry 10 times, it fails when a request is failing 10 times in a row. in our test we rarely ignore all errors in c-s.

tarzanek · 2020-10-08T08:16:34Z

so next step is using just 56 threads, ev. try to increase limit to 50k

the only other optimization would be kernel
check and compare kernels in current Centos 8 which you use to some Google blessed kernels (Ubuntu 18 or Ubuntu 20)
that said in above the SIs weren't that high, so the spike was either created by # of threads or some spike with busy neighbours?

slivne · 2020-10-08T09:19:45Z

@fruch which kernel version - we know that the gce image is not using the centos8 provided one - we had to change it to get a better support for the local drives

slivne · 2020-10-08T09:20:50Z

@bentsi ^^

fruch · 2020-10-08T10:04:36Z

@fruch which kernel version - we know that the gce image is not using the centos8 provided one - we had to change it to get a better support for the local drives

4.18.0-193.14.2.el8_2.x86_64

@bentsi you think we could build a gce image based on 4.1.7 (is the one you have is base on master ?)

avikivity · 2020-10-08T10:08:33Z

@fruch which kernel version - we know that the gce image is not using the centos8 provided one - we had to change it to get a better support for the local drives

4.18.0-193.14.2.el8_2.x86_64

@bentsi you think we could build a gce image based on 4.1.7 (is the one you have is base on master ?)

That kernel contains the google nvme fixes.

fruch · 2020-10-08T10:22:52Z

@avi @slivne
Do you need anything form those clusters ? anything else to check on them ? or I can kill them ?

Should I try ubuntu18 or ubuntu20 ? or try out the GCE image with the ml-kernel ? (or both, or all 3 options 😭 )

One more question:
Should we skip for the backup part we actually were aiming to test,

use NullCompaction for the ingest of the data ? (or nodetool disablecompaction if available)
or
ignore all errors in c-s, and keep going. meaning we might have a bit less data than planned

vladzcloudius · 2020-10-08T14:52:55Z

@fruch If you see timeouts - this means you are overloading the cluster.
In this case you should reduce the rate - not the concurrency which is already pretty low.

The system is able to pull out as much as it can. AWS and GCE are not the same.

You see timeout later in the process because of heavier compactions - you use LCS, which has a heavy write amplification.

Unless you try to estimate the performance of LCS on GCE you should take this into an account and apply lower rates.

fruch · 2020-10-08T19:41:44Z

I've took the initial image we have for GCE (after we fixed a few SCT issues)
https://www.googleapis.com/compute/v1/projects/skilled-adapter-452/global/images/scylla-666-development-0-20200816-7e01ae089e1

it has kernel: 5.8.2-1.el8.elrepo.x86_64

but it seems to be much much worse

http://104.196.186.175:3000/d/CHwBAIcGz/scylla-per-server-metrics-nemesis-master?orgId=1&refresh=30s&from=now-1h&to=now

the new monitor (branch-3.5) is cooler, but we get only ~8K writes:

this one doesn't have the patches for the hard coded io_proprieties.yaml values, and not the patches we did to change the cpuset based on sq_split as suggested before.

fruch · 2020-10-08T19:49:47Z

@fruch If you see timeouts - this means you are overloading the cluster.
In this case you should reduce the rate - not the concurrency which is already pretty low.

The system is able to pull out as much as it can. AWS and GCE are not the same.

You see timeout later in the process because of heavier compactions - you use LCS, which has a heavy write amplification.

Unless you try to estimate the performance of LCS on GCE you should take this into an account and apply lower rates.

I've ditched LCS, it help a bit. but still compaction take a toll.
I guess I don't have to much to compare to, and we should compare to result we are having in AWS ? or at least not to expect to have something close ?

fruch · 2020-10-11T11:54:36Z

Update:
@bentsi would build me a new GCE image, based on latest master, to try it again, since the one we had didn't had very good results. (we only got to 8K requests a sec)

Trying out Ubuntu 20.04, I've run into this small issue: #7383
Trying now Ubuntu 18.04 (kernel 5.4.0-1025-gcp), with patch to use sq_split as suggest by @vladzcloudius and @tarzanek

fruch · 2020-10-12T16:07:14Z

I tried one approach, of stopping compaction all together using NullCompaction, and it see like it's much more stable like that:

it's almost done writing the 1Tb:
http://34.73.252.148:3000/d/CGwxZ1cMz/scylla-per-server-metrics-nemesis-master?orgId=1&from=now-6h&to=now&refresh=30s

I'm writing from 2 loader at 80K rate-limit, and writes to disk are at 130-190 MB/s, (When compaction was on, we go to almost 1.2 GB/s writes to disk, each time timeouts started)

something in scylla limits/throttle of the disk isn't working as we expect... (and used to see in AWS)
or the write_bandwidth: 1468006400 we hardcode base on iotune runs done, is too high ?

tarzanek · 2020-10-12T18:01:02Z

The hardcoded values you take from the patch I've sent should be what Google guarantees
(+ I verified with multiple iotune runs and lowered it to match Google)
so if THOSE values are too high, then this is Google SLA breach ...

slivne · 2020-10-15T09:35:26Z

Its not the first time we have issues with google local drives / drivers.

Lets try to verify if the bandwidth shown holds for a long duration not io_tune duration and if there is a bug.

@roydahan / @avikivity which tool should we use to verify the local disks provide the guarantees on a long duration (a coupld of days).

vladzcloudius · 2021-03-18T13:59:00Z

Switching nvme write_cache to write-through made the write bandwidth stable at ~1.6GB/s all the time.

That's interesting. 5.11 kernel, right, @xemul?
I'll prepare a perftune.py option to disable write_cache.
@tarzanek I'll ping you when it's ready.

xemul · 2021-03-18T14:29:52Z

@vladzcloudius , no this time it was ubuntu 20.04 default one.

vladzcloudius · 2021-03-18T17:58:44Z

@vladzcloudius , no this time it was ubuntu 20.04 default one.

And do we have a "rate hole" with write_cache with this kernel, @xemul ?

vladzcloudius · 2021-03-18T22:33:27Z

Switching nvme write_cache to write-through made the write bandwidth stable at ~1.6GB/s all the time.

That's interesting. 5.11 kernel, right, @xemul?
I'll prepare a perftune.py option to disable write_cache.
@tarzanek I'll ping you when it's ready.

scylladb/seastar#881
@xemul @avikivity @fruch @tarzanek FYI ^

xemul · 2021-03-19T07:02:50Z

And do we have a "rate hole" with write_cache with this kernel, @xemul ?

@vladzcloudius , yes. I've seen it on pretty much every kernel except for some ancient one, where the performance was just bad (this "hole" spaned all the time).

avikivity · 2021-03-21T09:57:21Z

I get
Going to run './sct.py investigate show-monitor 36169f91-5b9e-4435-8fe9-8b714195d863'...
Error: mkdir `secrets`: Permission denied: OCI permission denied
I use podman instead of docker.
You should definitely ask for podman support in SCT to be on the roadmap.

Please add podman support in SCT to the roadmap.

ChrisHampu · 2021-03-24T16:34:37Z

Asked to share this from Scylla Slack. Chiming in due to running into similar timeout issues that were experienced in this thread that may have causally been solved by the write through config change.

Scylla version: 4.2.3 (due to #8032)
Kernel: 5.4.0-1030-gke #32~18.04.1-Ubuntu SMP Tue Nov 17 16:43:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Previous setup
Instance: 6x n2-custom-10-24576 on GKE with 2 local SCSI SSDs with default cache config
After pushing ~30k write op/s over extended period I would see severe timeouts and latency spikes at high frequency like so:

New setup
Instance: 6x n2-highmem-8 on GKE with 8 local SCSI SSDs with write through cache
After 12+ hours of constant writes pushing 60k write op/s I captured not a single timeout or spike in average write latency.

The n2-highmem-8 GKE instance wasn't detected and was tuned to following.
read_iops: 720319 read_bandwidth: 2568414720 write_iops: 400339 write_bandwidth: 1639328512
Did not manage to save the tuning from the custom instance.

While there were multiple variables changed between the setups (cpu, memory, # disks) such that it's not exactly apples to apples, it's worth further investigating the write cache as having potentially made the major difference.
Also note that cassandra-stress wasn't used here in favor of our own custom workload, so this benchmark is purely anecdotal. The workload was effectively the same between runs with 30-45 distributed workers pushing small 1kb writes.

slivne · 2021-04-14T11:12:14Z

seastar patch was merged - so there is a way to disable write_back cache but we need now to add the setting when creating the devices in GCE image

penberg · 2021-04-14T11:29:41Z

I am trying to understand the scope of what's still needed. We need to pass --write-back-cache=false to perftune.py on GCE, right? Which instances? Where's the right place to wire up this? @syuu1228 @fruch @tarzanek

vladzcloudius · 2021-04-14T19:18:48Z

I am trying to understand the scope of what's still needed. We need to pass --write-back-cache=false to perftune.py on GCE, right?

Correct, @penberg

slivne · 2021-04-19T11:04:32Z

@penberg ping

This adds support for disabling writeback cache by adding a new DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which makes the "scylla_prepare" script (that is run before Scylla starts up) call perftune.py with appropriate parameters. Also add a "--disable-writeback-cache" option to "scylla_sysconfig_setup", which can be called by scylla-machine image scripts, for example. Refs: scylladb#7341

As outlined in scylladb/scylladb#7341, we need to disable writeback cache on GCE for better performance.

This adds support for disabling writeback cache by adding a new DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which makes the "scylla_prepare" script (that is run before Scylla starts up) call perftune.py with appropriate parameters. Also add a "--disable-writeback-cache" option to "scylla_sysconfig_setup", which can be called by scylla-machine image scripts, for example. Refs: scylladb#7341

This adds support for disabling writeback cache by adding a new DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which makes the "scylla_prepare" script (that is run before Scylla starts up) call perftune.py with appropriate parameters. Also add a "--disable-writeback-cache" option to "scylla_sysconfig_setup", which can be called by scylla-machine image scripts, for example. Refs: scylladb#7341 Tests: dtest (next-gating)

This adds support for disabling writeback cache by adding a new DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which makes the "scylla_prepare" script (that is run before Scylla starts up) call perftune.py with appropriate parameters. Also add a "--disable-writeback-cache" option to "scylla_sysconfig_setup", which can be called by scylla-machine image scripts, for example. Refs: #7341 Tests: dtest (next-gating) Closes #8526

tarzanek · 2021-04-22T12:29:47Z

just on the testing side of this (since I see the graphs of patched and unpatched systems in front of me)
customer runs janusgraph spark job every X minutes and the drop of disk read speed (10G/s) with disabled writeback cache is just visible when they write, picture attached (I think we even go higher than max speed 9 360 MB/s - so wondering if io_properties are even getting obeyed ;-) )

without the patch the reads maximum seem to be stuck around 7.5G , so this was very good sleuthing Pavel, thank you!

As outlined in scylladb/scylladb#7341, we need to disable writeback cache on GCE for better performance.

roydahan · 2021-05-20T09:51:13Z

Last run on master looks much better.

https://snapshot.raintank.io/dashboard/snapshot/Yyepci6dFyB1XPxbwWad6GbH4uNBHgS7

However, some stress runs still failing, investigating why.

roydahan · 2021-05-20T16:38:08Z

The failing stress commands are not related.
I think we can close this issue.

@slivne FYI

This adds support for disabling writeback cache by adding a new DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which makes the "scylla_prepare" script (that is run before Scylla starts up) call perftune.py with appropriate parameters. Also add a "--disable-writeback-cache" option to "scylla_sysconfig_setup", which can be called by scylla-machine image scripts, for example. Refs: scylladb#7341 Tests: dtest (next-gating) Closes scylladb#8526 (cherry picked from commit 0ddbed2) Fixes scylladb#1784.

fruch added the symptom/performance Issues causing performance problems label Oct 6, 2020

slivne added the GCE label Oct 15, 2020

slivne assigned penberg and unassigned xemul Apr 14, 2021

penberg mentioned this issue Apr 21, 2021

dist: Add support for disabling writeback cache #8526

Closed

penberg added a commit to scylladb/scylla-machine-image that referenced this issue Apr 21, 2021

gce: Disable writeback cache

c5a87b6

As outlined in scylladb/scylladb#7341, we need to disable writeback cache on GCE for better performance.

penberg mentioned this issue Apr 21, 2021

gce: Disable writeback cache scylladb/scylla-machine-image#124

Merged

bentsi pushed a commit to scylladb/scylla-machine-image that referenced this issue Apr 25, 2021

gce: Disable writeback cache

2eaaed2

As outlined in scylladb/scylladb#7341, we need to disable writeback cache on GCE for better performance.

slivne closed this as completed May 27, 2021

avikivity mentioned this issue Sep 29, 2022

scylla_install_image:remove writeback_cache option scylladb/scylla-machine-image#394

Merged

cassandra-stress starts to timeout and fail after ~20min with GCE local disks #7341

cassandra-stress starts to timeout and fail after ~20min with GCE local disks #7341

Comments

fruch commented Oct 6, 2020 • edited

Summary

Logs (Centos8)

Further thing we tried

tarzanek commented Oct 6, 2020

fruch commented Oct 6, 2020

fruch commented Oct 6, 2020

tarzanek commented Oct 6, 2020 • edited

tarzanek commented Oct 6, 2020

tarzanek commented Oct 6, 2020 • edited

tarzanek commented Oct 6, 2020

vladzcloudius commented Oct 6, 2020

fruch commented Oct 6, 2020 • edited

tarzanek commented Oct 6, 2020

fruch commented Oct 7, 2020

fruch commented Oct 7, 2020

slivne commented Oct 7, 2020 via email

tarzanek commented Oct 7, 2020 • edited

fruch commented Oct 8, 2020

tarzanek commented Oct 8, 2020

slivne commented Oct 8, 2020

slivne commented Oct 8, 2020

fruch commented Oct 8, 2020

avikivity commented Oct 8, 2020

fruch commented Oct 8, 2020 • edited

vladzcloudius commented Oct 8, 2020 • edited

fruch commented Oct 8, 2020

fruch commented Oct 8, 2020

fruch commented Oct 11, 2020

fruch commented Oct 12, 2020 • edited

tarzanek commented Oct 12, 2020

slivne commented Oct 15, 2020

vladzcloudius commented Mar 18, 2021

xemul commented Mar 18, 2021

vladzcloudius commented Mar 18, 2021 • edited

vladzcloudius commented Mar 18, 2021

xemul commented Mar 19, 2021

avikivity commented Mar 21, 2021

ChrisHampu commented Mar 24, 2021 • edited

slivne commented Apr 14, 2021 • edited

penberg commented Apr 14, 2021

vladzcloudius commented Apr 14, 2021

slivne commented Apr 19, 2021

tarzanek commented Apr 22, 2021

roydahan commented May 20, 2021 • edited

roydahan commented May 20, 2021

fruch commented Oct 6, 2020 •

edited

tarzanek commented Oct 6, 2020 •

edited

tarzanek commented Oct 6, 2020 •

edited

fruch commented Oct 6, 2020 •

edited

tarzanek commented Oct 7, 2020 •

edited

fruch commented Oct 8, 2020 •

edited

vladzcloudius commented Oct 8, 2020 •

edited

fruch commented Oct 12, 2020 •

edited

vladzcloudius commented Mar 18, 2021 •

edited

ChrisHampu commented Mar 24, 2021 •

edited

slivne commented Apr 14, 2021 •

edited

roydahan commented May 20, 2021 •

edited