Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cassandra-stress starts to timeout and fail after ~20min with GCE local disks #7341

Closed
fruch opened this issue Oct 6, 2020 · 89 comments
Closed
Assignees
Labels
symptom/performance Issues causing performance problems
Milestone

Comments

@fruch
Copy link
Contributor

fruch commented Oct 6, 2020

Installation details
Scylla version (or git commit hash): 4.1.7 (4.1.7-0.20200918.2251a1c577)
Cluster size: 4
OS (RHEL/CentOS/Ubuntu/AWS AMI): Centos8/Ubuntu1804

Hardware details (for performance issues)
Platform (physical/VM/cloud instance type/docker): GCE (n1-highmem-16)
Hardware: sockets= cores= hyperthreading= memory=
Disks: (SSD/HDD, count) Nvme 8/6 disks

Summary

Trying to fill up a GCE setup with 1Tb of data, for testing backup to Google Storage.

This is the stress command we are using:

cassandra-stress write cl=QUORUM n=1100200300 -schema 'replication(factor=3) compaction(strategy=LeveledCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=150 -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..1100200300

After ~20min we start getting timeouts that fails the load with the following errors:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during SIMPLE write query at consistency QUORUM (2 replica were required but only 1 acknowledged the write)

Logs (Centos8)

Further thing we tried

we tried this also with ubuntu1804, suspecting maybe it's a kernel related issue, but got the similar results

talking with @tarzanek, we looking into the io_properties.yaml, he suggested to lower it down, to let scylla throttle it a bit
this is the output of iotune, while measuring:

disks:
  - mountpoint: /var/lib/scylla
    read_iops: 721717
    read_bandwidth: 2949972224
    write_iops: 400594
    write_bandwidth: 1504807040

and this is the io_properties.yaml we overwrite it with: (the parameter that are not part of master, which are calculated base on number of local disks)

disks:
  - mountpoint: /var/lib/scylla
    read_iops: 680000
    read_bandwidth: 2778726400
    write_iops: 360000
    write_bandwidth: 1468006400

After applying the suggested io_properties.yaml, we are still facing those issues
monitor -http://35.196.159.20:3000/d/Z3dCVz5Mk/scylla-per-server-metrics-nemesis-master?orgId=1&from=1601970238340&to=1601972849329

@fruch fruch added the symptom/performance Issues causing performance problems label Oct 6, 2020
@tarzanek
Copy link
Contributor

tarzanek commented Oct 6, 2020

so we have the reason, the nodes run with mq and share cpus with OS, so NIC traffic and SIs interfere

btw. neither of the tuning runs by default:

Do you want to enable Network Interface Card (NIC) and disk(s) optimization?
Yes - optimize the NIC queue and disks settings. Selecting Yes greatly improves performance. No - skip this step.
[yes/NO]
Do you want to set the CPU scaling governor to Performance level on boot?
Yes - sets the CPU scaling governor to performance level. No - skip this step.
[YES/no]
This computer doesn't supported CPU scaling configuration.

@fruch
Copy link
Contributor Author

fruch commented Oct 6, 2020

so we have the reason, the nodes run with mq and share cpus with OS, so NIC traffic and SIs interfere

btw. neither of the tuning runs by default:

Do you want to enable Network Interface Card (NIC) and disk(s) optimization?
Yes - optimize the NIC queue and disks settings. Selecting Yes greatly improves performance. No - skip this step.
[yes/NO]
Do you want to set the CPU scaling governor to Performance level on boot?
Yes - sets the CPU scaling governor to performance level. No - skip this step.
[YES/no]
This computer doesn't supported CPU scaling configuration.

--setup-nic-and-disks seem like we need to add this to scylla_setup we didn't passing that in SCT so far.
we'll try again

as for CPU scaling, I don't seem to be able to turn it on, I'm not even sure we can do that on GCE, didn't found any information regarding that.

@fruch
Copy link
Contributor Author

fruch commented Oct 6, 2020

talking offline with @vladzcloudius

he mentioned this issue:
scylladb/seastar#729

and this rule of thumb, base on the number of cpu might work better:

0-4: mq
5-8: sq
9-: sq_split

I'll give it a try too

@tarzanek
Copy link
Contributor

tarzanek commented Oct 6, 2020

hmm and looking at my comment I wanted to say it's about
cpuset, not cpuscaling governor,

so basically /etc/scylla.d/cpuset.conf
is empty
while for 16 vcpus it should set correct pairs
https://github.com/scylladb/scylla/blob/master/dist/common/scripts/scylla_cpuset_setup
is called from
https://github.com/scylladb/scylla/blob/master/dist/common/scripts/scylla_sysconfig_setup#L81
so the fix needs to happen in perftune while called for --get-cpu-mask

https://github.com/scylladb/seastar/blob/master/scripts/perftune.py#L1355

@tarzanek
Copy link
Contributor

tarzanek commented Oct 6, 2020

that command on your GCP does:

# /opt/scylladb/scripts/perftune.py --tune net --nic eth0 --get-cpu-mask
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/perftune.py", line 1326, in <module>
    args.cpu_mask = run_hwloc_calc(['all'])
  File "/opt/scylladb/scripts/libexec/perftune.py", line 58, in run_hwloc_calc
    return run_read_only_command(['hwloc-calc'] + prog_args).rstrip()
  File "/opt/scylladb/scripts/libexec/perftune.py", line 46, in run_read_only_command
    return __run_one_command(prog_args, stderr=stderr, check=check)
  File "/opt/scylladb/scripts/libexec/perftune.py", line 30, in __run_one_command
    proc = subprocess.Popen(prog_args, stdout = subprocess.PIPE, stderr = stderr)
  File "/opt/scylladb/python3/lib64/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/opt/scylladb/python3/lib64/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'hwloc-calc': 'hwloc-calc'

@tarzanek
Copy link
Contributor

tarzanek commented Oct 6, 2020

after yum install hwloc
and reruning scylla_setup
I get:

cat /etc/scylla.d/cpuset.conf
# DO NO EDIT
# This file should be automatically configure by scylla_cpuset_setup
#
# CPUSET="--cpuset 0 --smp 1"
CPUSET="--cpuset 0-15 "

which is also not what we want
the question here is what does perftune of nic lock the rx queues to

@tarzanek
Copy link
Contributor

tarzanek commented Oct 6, 2020

dry run of perftune:

[root@mgr-backup-1tb-manager--db-node-8782ad1e-0-4 ~]# /opt/scylladb/scripts/perftune.py --dry-run --tune net --nic eth0
# irqbalance is not running
# Setting a physical interface eth0...
# Distributing all IRQs
echo 00000001 > /proc/irq/53/smp_affinity
echo 00000001 > /proc/irq/72/smp_affinity
echo 00000001 > /proc/irq/73/smp_affinity
echo 00000100 > /proc/irq/46/smp_affinity
echo 00000100 > /proc/irq/48/smp_affinity
echo 00000002 > /proc/irq/55/smp_affinity
echo 00000002 > /proc/irq/69/smp_affinity
echo 00000200 > /proc/irq/65/smp_affinity
echo 00000200 > /proc/irq/70/smp_affinity
echo 00000004 > /proc/irq/66/smp_affinity
echo 00000004 > /proc/irq/76/smp_affinity
echo 00000400 > /proc/irq/59/smp_affinity
echo 00000400 > /proc/irq/58/smp_affinity
echo 00000008 > /proc/irq/52/smp_affinity
echo 00000008 > /proc/irq/50/smp_affinity
echo 00000800 > /proc/irq/78/smp_affinity
echo 00000800 > /proc/irq/60/smp_affinity
echo 00000010 > /proc/irq/54/smp_affinity
echo 00000010 > /proc/irq/75/smp_affinity
echo 00001000 > /proc/irq/77/smp_affinity
echo 00001000 > /proc/irq/49/smp_affinity
echo 00000020 > /proc/irq/47/smp_affinity
echo 00000020 > /proc/irq/61/smp_affinity
echo 00002000 > /proc/irq/51/smp_affinity
echo 00002000 > /proc/irq/71/smp_affinity
echo 00000040 > /proc/irq/62/smp_affinity
echo 00000040 > /proc/irq/56/smp_affinity
echo 00004000 > /proc/irq/57/smp_affinity
echo 00004000 > /proc/irq/64/smp_affinity
echo 00000080 > /proc/irq/63/smp_affinity
echo 00000080 > /proc/irq/74/smp_affinity
echo 00008000 > /proc/irq/67/smp_affinity
echo 00008000 > /proc/irq/68/smp_affinity
echo 0000ffff > /sys/class/net/eth0/queues/rx-13/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-9/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-11/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-7/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-5/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-3/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-1/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-14/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-12/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-8/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-10/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-6/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-4/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-2/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-15/rps_cpus
echo 0000ffff > /sys/class/net/eth0/queues/rx-0/rps_cpus
sysctl net.core.rps_sock_flow_entries
# Setting net.core.rps_sock_flow_entries to 32768
sysctl -w net.core.rps_sock_flow_entries=32768
echo 2048 > /sys/class/net/eth0/queues/rx-13/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-9/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-11/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-7/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-5/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-3/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-14/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-12/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-8/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-10/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-6/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-4/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-2/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-15/rps_flow_cnt
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
# Enable ntuple filtering HW offload for eth0...
ethtool -K eth0 ntuple on
echo 00000001 > /sys/class/net/eth0/queues/tx-6/xps_cpus
echo 00000100 > /sys/class/net/eth0/queues/tx-12/xps_cpus
echo 00000002 > /sys/class/net/eth0/queues/tx-4/xps_cpus
echo 00000200 > /sys/class/net/eth0/queues/tx-10/xps_cpus
echo 00000004 > /sys/class/net/eth0/queues/tx-2/xps_cpus
echo 00000400 > /sys/class/net/eth0/queues/tx-0/xps_cpus
echo 00000008 > /sys/class/net/eth0/queues/tx-9/xps_cpus
echo 00000800 > /sys/class/net/eth0/queues/tx-15/xps_cpus
echo 00000010 > /sys/class/net/eth0/queues/tx-7/xps_cpus
echo 00001000 > /sys/class/net/eth0/queues/tx-13/xps_cpus
echo 00000020 > /sys/class/net/eth0/queues/tx-5/xps_cpus
echo 00002000 > /sys/class/net/eth0/queues/tx-11/xps_cpus
echo 00000040 > /sys/class/net/eth0/queues/tx-3/xps_cpus
echo 00004000 > /sys/class/net/eth0/queues/tx-1/xps_cpus
echo 00000080 > /sys/class/net/eth0/queues/tx-8/xps_cpus
echo 00008000 > /sys/class/net/eth0/queues/tx-14/xps_cpus
echo 4096 > /proc/sys/net/core/somaxconn
echo 4096 > /proc/sys/net/ipv4/tcp_max_syn_backlog

so we just lock to all cpus
ok, the best thing would be to check the status of the machine when the timeouts happen to see if above is good decision or not

@vladzcloudius
Copy link
Contributor

As wrong as the configuration might have been, @fruch, I don't think that's the root cause of timeouts.

The reason is that you are pushing the rate to the maximum (since you don't set the rate target in the c-s command line) which gets the I/O to its limit pretty fast (see the amount of background writes).

On top of that LCS kicks in and starts consuming a lot of I/O budget (rates get as high as 1GB/sec in the compaction class):

image

Which adds to the insult.

I think you need to pace the c-s down here or not get offended by a few timeouts:

image

@fruch
Copy link
Contributor Author

fruch commented Oct 6, 2020

As wrong as the configuration might have been, @fruch, I don't think that's the root cause of timeouts.

The reason is that you are pushing the rate to the maximum (since you don't set the rate target in the c-s command line) which gets the I/O to its limit pretty fast (see the amount of background writes).

On top of that LCS kicks in and starts consuming a lot of I/O budget (rates get as high as 1GB/sec in the compaction class):

1.2Gbs is much better then before, we failed at ~600Mbs
@vladzcloudius I'm not sure I'm following the reasons, why we could throw at i3.4xlarge in AWS the same amount of data, and scylla throttle it quite nicely, and we don't need to throttle from c-s side.
the local disks are that slow in GCE ?

Also I don't know if I can call those a few timeouts
image

@tarzanek
this command isn't working, I think it's not using the corrects PATH and the relocatable python, from within the scylla_prepare I think it does work

/opt/scylladb/scripts/perftune.py --tune net --nic eth0 --get-cpu-mask

@tarzanek
Copy link
Contributor

tarzanek commented Oct 6, 2020

I opened #7350 since running tools used to work, now it's not, scripts will break, automation scripts will too

@fruch
Copy link
Contributor Author

fruch commented Oct 7, 2020

@tarzanek @vladzcloudius

I've playing around with it yesterday

I've change the command limit the load from c-s end,

cassandra-stress write cl=QUORUM n=1100200300 -schema keyspace=keyspace1 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=64 throttle=40000/s -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..1100200300

One surge of timeouts have seen at the begining, but after that is stable for a while:
http://34.75.108.171:3000/d/eJdUlZ5Mk/scylla-per-server-metrics-nemesis-master?orgId=1&refresh=30s&from=now-1h&to=now

again compared to run on our AMIs in AWS, 40K is around half of what I remember we can get,

downloaded the monitor to take a peak from out runs of 4.1 with 1Tb
hydra investigate show-monitor 5db61478-f573-4aa4-99fe-dd5e2ec41b7

http://3.216.132.220:3000/d/manager-2-2/scylla-manager-metrics?orgId=1&refresh=30s
image

Latency is 15-40ms, but scylla seem to be throttling it quite fine, out of the box.

the bottle neck is just in a different place ? i.e. in AWS seem like it's the CPU, and in GCE it's the disk ?

@fruch
Copy link
Contributor Author

fruch commented Oct 7, 2020

It was holding for ~3 hours, but reach ~360Gb on each node, and started failing.

@slivne
Copy link
Contributor

slivne commented Oct 7, 2020 via email

@tarzanek
Copy link
Contributor

tarzanek commented Oct 7, 2020

SIs after using sq_split :

top - 18:31:27 up  1:42,  1 user,  load average: 6.95, 8.36, 8.79
Tasks: 576 total,   2 running, 574 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 70.8 id,  0.0 wa,  4.6 hi, 24.6 si,  0.0 st
%Cpu1  : 30.9 us, 20.6 sy,  0.0 ni, 41.2 id,  0.0 wa,  1.5 hi,  5.9 si,  0.0 st
%Cpu2  : 29.2 us, 21.5 sy,  0.0 ni, 41.5 id,  0.0 wa,  1.5 hi,  6.2 si,  0.0 st
%Cpu3  : 28.4 us, 23.9 sy,  0.0 ni, 40.3 id,  0.0 wa,  1.5 hi,  6.0 si,  0.0 st
%Cpu4  : 28.1 us, 20.3 sy,  0.0 ni, 45.3 id,  0.0 wa,  1.6 hi,  4.7 si,  0.0 st
%Cpu5  : 29.2 us, 20.0 sy,  0.0 ni, 44.6 id,  0.0 wa,  1.5 hi,  4.6 si,  0.0 st
%Cpu6  : 27.9 us, 25.0 sy,  0.0 ni, 41.2 id,  0.0 wa,  1.5 hi,  4.4 si,  0.0 st
%Cpu7  : 31.8 us, 18.2 sy,  0.0 ni, 42.4 id,  0.0 wa,  1.5 hi,  6.1 si,  0.0 st
%Cpu8  :  1.5 us,  3.1 sy,  0.0 ni, 76.9 id,  0.0 wa,  3.1 hi, 15.4 si,  0.0 st
%Cpu9  : 29.7 us, 20.3 sy,  0.0 ni, 43.8 id,  0.0 wa,  1.6 hi,  4.7 si,  0.0 st
%Cpu10 : 30.2 us, 20.6 sy,  0.0 ni, 41.3 id,  0.0 wa,  1.6 hi,  6.3 si,  0.0 st
%Cpu11 : 26.6 us, 21.9 sy,  0.0 ni, 43.8 id,  0.0 wa,  1.6 hi,  6.2 si,  0.0 st
%Cpu12 : 30.3 us, 22.7 sy,  0.0 ni, 40.9 id,  0.0 wa,  1.5 hi,  4.5 si,  0.0 st
%Cpu13 : 29.2 us, 21.5 sy,  0.0 ni, 43.1 id,  0.0 wa,  1.5 hi,  4.6 si,  0.0 st
%Cpu14 : 29.2 us, 21.5 sy,  0.0 ni, 43.1 id,  0.0 wa,  1.5 hi,  4.6 si,  0.0 st
%Cpu15 : 30.8 us, 21.5 sy,  0.0 ni, 41.5 id,  0.0 wa,  1.5 hi,  4.6 si,  0.0 st
MiB Mem : 104478.7 total,   2688.7 free,  97823.3 used,   3966.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5675.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  48534 scylla    20   0   16.0t  94.6g  58696 R 692.4  92.7 565:45.12 scylla
    622 root       0 -20       0      0      0 I   1.5   0.0   0:02.26 kworker/9:1H-kblockd
    832 root      20   0  109412  11196   8432 S   1.5   0.0   0:00.57 systemd-udevd
  74425 root      20   0   65080   5300   4080 R   1.5   0.0   0:00.05 top
      1 root      20   0  245168  14152   9220 S   0.0   0.0   0:11.84 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.06 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp

so far most writes were retried, almost no time outs
throttled at 40k, most probably 50k, 60k or even 70k might work, threads might be decreased to 4 x 14 so 56 to get optimum usage
(and if you on loader also do cpu nic offloading in similar way(perftune can be used) and set affinity for c-s to leftover cpus the latencies on client will be even better)

@fruch
Copy link
Contributor Author

fruch commented Oct 8, 2020

Update
http://35.231.33.180:3000/d/SfffwMcMk/scylla-per-server-metrics-nemesis-master?orgId=1&from=1602088311871&to=1602112341050

after patching the cpuset to: 1-7,9-15
A run throttled at 40K ops by c-s,
has lasted ~5.5 hours, and failed on timeout "storm"
filled up ~600Gb on each node in that time

for compering we had one run without throttling , it lasted ~1.5 hours, and did got to 60K-70K ops, as @tarzanek estimated
http://35.231.218.186:3000/d/ieTTV75Gz/scylla-per-server-metrics-nemesis-master?orgId=1&from=1602103316079&to=1602113512352

both case compaction was starting to raise when the "storm" starts.

a note c-s, is ignoring timeouts, and retry 10 times, it fails when a request is failing 10 times in a row. in our test we rarely ignore all errors in c-s.

@tarzanek
Copy link
Contributor

tarzanek commented Oct 8, 2020

so next step is using just 56 threads, ev. try to increase limit to 50k

the only other optimization would be kernel
check and compare kernels in current Centos 8 which you use to some Google blessed kernels (Ubuntu 18 or Ubuntu 20)
that said in above the SIs weren't that high, so the spike was either created by # of threads or some spike with busy neighbours?

@slivne
Copy link
Contributor

slivne commented Oct 8, 2020

@fruch which kernel version - we know that the gce image is not using the centos8 provided one - we had to change it to get a better support for the local drives

@slivne
Copy link
Contributor

slivne commented Oct 8, 2020

@bentsi ^^

@fruch
Copy link
Contributor Author

fruch commented Oct 8, 2020

@fruch which kernel version - we know that the gce image is not using the centos8 provided one - we had to change it to get a better support for the local drives

4.18.0-193.14.2.el8_2.x86_64

@bentsi you think we could build a gce image based on 4.1.7 (is the one you have is base on master ?)

@avikivity
Copy link
Member

@fruch which kernel version - we know that the gce image is not using the centos8 provided one - we had to change it to get a better support for the local drives

4.18.0-193.14.2.el8_2.x86_64

@bentsi you think we could build a gce image based on 4.1.7 (is the one you have is base on master ?)

That kernel contains the google nvme fixes.

@fruch
Copy link
Contributor Author

fruch commented Oct 8, 2020

@avi @slivne
Do you need anything form those clusters ? anything else to check on them ? or I can kill them ?

Should I try ubuntu18 or ubuntu20 ? or try out the GCE image with the ml-kernel ? (or both, or all 3 options 😭 )

One more question:
Should we skip for the backup part we actually were aiming to test,

  1. use NullCompaction for the ingest of the data ? (or nodetool disablecompaction if available)
    or
  2. ignore all errors in c-s, and keep going. meaning we might have a bit less data than planned

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Oct 8, 2020

@fruch If you see timeouts - this means you are overloading the cluster.
In this case you should reduce the rate - not the concurrency which is already pretty low.

The system is able to pull out as much as it can. AWS and GCE are not the same.

You see timeout later in the process because of heavier compactions - you use LCS, which has a heavy write amplification.

Unless you try to estimate the performance of LCS on GCE you should take this into an account and apply lower rates.

@fruch
Copy link
Contributor Author

fruch commented Oct 8, 2020

I've took the initial image we have for GCE (after we fixed a few SCT issues)
https://www.googleapis.com/compute/v1/projects/skilled-adapter-452/global/images/scylla-666-development-0-20200816-7e01ae089e1

it has kernel: 5.8.2-1.el8.elrepo.x86_64

but it seems to be much much worse

http://104.196.186.175:3000/d/CHwBAIcGz/scylla-per-server-metrics-nemesis-master?orgId=1&refresh=30s&from=now-1h&to=now

the new monitor (branch-3.5) is cooler, but we get only ~8K writes:
image

this one doesn't have the patches for the hard coded io_proprieties.yaml values, and not the patches we did to change the cpuset based on sq_split as suggested before.

@fruch
Copy link
Contributor Author

fruch commented Oct 8, 2020

@fruch If you see timeouts - this means you are overloading the cluster.
In this case you should reduce the rate - not the concurrency which is already pretty low.

The system is able to pull out as much as it can. AWS and GCE are not the same.

You see timeout later in the process because of heavier compactions - you use LCS, which has a heavy write amplification.

Unless you try to estimate the performance of LCS on GCE you should take this into an account and apply lower rates.

I've ditched LCS, it help a bit. but still compaction take a toll.
I guess I don't have to much to compare to, and we should compare to result we are having in AWS ? or at least not to expect to have something close ?

@fruch
Copy link
Contributor Author

fruch commented Oct 11, 2020

Update:
@bentsi would build me a new GCE image, based on latest master, to try it again, since the one we had didn't had very good results. (we only got to 8K requests a sec)

Trying out Ubuntu 20.04, I've run into this small issue: #7383
Trying now Ubuntu 18.04 (kernel 5.4.0-1025-gcp), with patch to use sq_split as suggest by @vladzcloudius and @tarzanek

@fruch
Copy link
Contributor Author

fruch commented Oct 12, 2020

I tried one approach, of stopping compaction all together using NullCompaction, and it see like it's much more stable like that:

it's almost done writing the 1Tb:
http://34.73.252.148:3000/d/CGwxZ1cMz/scylla-per-server-metrics-nemesis-master?orgId=1&from=now-6h&to=now&refresh=30s

I'm writing from 2 loader at 80K rate-limit, and writes to disk are at 130-190 MB/s, (When compaction was on, we go to almost 1.2 GB/s writes to disk, each time timeouts started)
image

something in scylla limits/throttle of the disk isn't working as we expect... (and used to see in AWS)
or the write_bandwidth: 1468006400 we hardcode base on iotune runs done, is too high ?

@tarzanek
Copy link
Contributor

The hardcoded values you take from the patch I've sent should be what Google guarantees
(+ I verified with multiple iotune runs and lowered it to match Google)
so if THOSE values are too high, then this is Google SLA breach ...

@slivne
Copy link
Contributor

slivne commented Oct 15, 2020

Its not the first time we have issues with google local drives / drivers.

Lets try to verify if the bandwidth shown holds for a long duration not io_tune duration and if there is a bug.

@roydahan / @avikivity which tool should we use to verify the local disks provide the guarantees on a long duration (a coupld of days).

@slivne slivne added the GCE label Oct 15, 2020
@vladzcloudius
Copy link
Contributor

Switching nvme write_cache to write-through made the write bandwidth stable at ~1.6GB/s all the time.

That's interesting. 5.11 kernel, right, @xemul?
I'll prepare a perftune.py option to disable write_cache.
@tarzanek I'll ping you when it's ready.

@xemul
Copy link
Contributor

xemul commented Mar 18, 2021

@vladzcloudius , no this time it was ubuntu 20.04 default one.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Mar 18, 2021

@vladzcloudius , no this time it was ubuntu 20.04 default one.

And do we have a "rate hole" with write_cache with this kernel, @xemul ?

@vladzcloudius
Copy link
Contributor

Switching nvme write_cache to write-through made the write bandwidth stable at ~1.6GB/s all the time.

That's interesting. 5.11 kernel, right, @xemul?
I'll prepare a perftune.py option to disable write_cache.
@tarzanek I'll ping you when it's ready.

scylladb/seastar#881
@xemul @avikivity @fruch @tarzanek FYI ^

@xemul
Copy link
Contributor

xemul commented Mar 19, 2021

And do we have a "rate hole" with write_cache with this kernel, @xemul ?

@vladzcloudius , yes. I've seen it on pretty much every kernel except for some ancient one, where the performance was just bad (this "hole" spaned all the time).

@avikivity
Copy link
Member

I get

Going to run './sct.py investigate show-monitor 36169f91-5b9e-4435-8fe9-8b714195d863'...
Error: mkdir `secrets`: Permission denied: OCI permission denied

I use podman instead of docker.

You should definitely ask for podman support in SCT to be on the roadmap.

Please add podman support in SCT to the roadmap.

@ChrisHampu
Copy link

ChrisHampu commented Mar 24, 2021

Asked to share this from Scylla Slack. Chiming in due to running into similar timeout issues that were experienced in this thread that may have causally been solved by the write through config change.

Scylla version: 4.2.3 (due to #8032)
Kernel: 5.4.0-1030-gke #32~18.04.1-Ubuntu SMP Tue Nov 17 16:43:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Previous setup
Instance: 6x n2-custom-10-24576 on GKE with 2 local SCSI SSDs with default cache config
After pushing ~30k write op/s over extended period I would see severe timeouts and latency spikes at high frequency like so:
Screenshot_2021-03-17 Advanced - Grafana(4)

New setup
Instance: 6x n2-highmem-8 on GKE with 8 local SCSI SSDs with write through cache
After 12+ hours of constant writes pushing 60k write op/s I captured not a single timeout or spike in average write latency.
image (3)

The n2-highmem-8 GKE instance wasn't detected and was tuned to following.
read_iops: 720319 read_bandwidth: 2568414720 write_iops: 400339 write_bandwidth: 1639328512
Did not manage to save the tuning from the custom instance.

While there were multiple variables changed between the setups (cpu, memory, # disks) such that it's not exactly apples to apples, it's worth further investigating the write cache as having potentially made the major difference.
Also note that cassandra-stress wasn't used here in favor of our own custom workload, so this benchmark is purely anecdotal. The workload was effectively the same between runs with 30-45 distributed workers pushing small 1kb writes.

@slivne slivne assigned penberg and unassigned xemul Apr 14, 2021
@slivne
Copy link
Contributor

slivne commented Apr 14, 2021

seastar patch was merged - so there is a way to disable write_back cache but we need now to add the setting when creating the devices in GCE image

@penberg
Copy link
Contributor

penberg commented Apr 14, 2021

I am trying to understand the scope of what's still needed. We need to pass --write-back-cache=false to perftune.py on GCE, right? Which instances? Where's the right place to wire up this? @syuu1228 @fruch @tarzanek

@vladzcloudius
Copy link
Contributor

I am trying to understand the scope of what's still needed. We need to pass --write-back-cache=false to perftune.py on GCE, right?

Correct, @penberg

@slivne
Copy link
Contributor

slivne commented Apr 19, 2021

@penberg ping

penberg added a commit to penberg/scylla that referenced this issue Apr 21, 2021
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: scylladb#7341
penberg added a commit to penberg/scylla that referenced this issue Apr 21, 2021
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: scylladb#7341
penberg added a commit to scylladb/scylla-machine-image that referenced this issue Apr 21, 2021
As outlined in scylladb/scylladb#7341, we need to disable writeback cache
on GCE for better performance.
penberg added a commit to penberg/scylla that referenced this issue Apr 21, 2021
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: scylladb#7341
penberg added a commit to penberg/scylla that referenced this issue Apr 21, 2021
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: scylladb#7341
penberg added a commit to penberg/scylla that referenced this issue Apr 21, 2021
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: scylladb#7341
Tests: dtest (next-gating)
avikivity pushed a commit that referenced this issue Apr 21, 2021
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: #7341
Tests: dtest (next-gating)

Closes #8526
avikivity pushed a commit that referenced this issue Apr 22, 2021
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: #7341
Tests: dtest (next-gating)

Closes #8526
@tarzanek
Copy link
Contributor

just on the testing side of this (since I see the graphs of patched and unpatched systems in front of me)
customer runs janusgraph spark job every X minutes and the drop of disk read speed (10G/s) with disabled writeback cache is just visible when they write, picture attached (I think we even go higher than max speed 9 360 MB/s - so wondering if io_properties are even getting obeyed ;-) )

n2_highmem_64_with24_nvmes_nowritebackcache

without the patch the reads maximum seem to be stuck around 7.5G , so this was very good sleuthing Pavel, thank you!

bentsi pushed a commit to scylladb/scylla-machine-image that referenced this issue Apr 25, 2021
As outlined in scylladb/scylladb#7341, we need to disable writeback cache
on GCE for better performance.
@roydahan
Copy link

roydahan commented May 20, 2021

Last run on master looks much better.

https://snapshot.raintank.io/dashboard/snapshot/Yyepci6dFyB1XPxbwWad6GbH4uNBHgS7

However, some stress runs still failing, investigating why.

@roydahan
Copy link

The failing stress commands are not related.
I think we can close this issue.

@slivne FYI

@slivne slivne closed this as completed May 27, 2021
denesb pushed a commit to denesb/scylla that referenced this issue Oct 20, 2021
This adds support for disabling writeback cache by adding a new
DISABLE_WRITEBACK_CACHE option to "scylla-server" sysconfig file, which
makes the "scylla_prepare" script (that is run before Scylla starts up)
call perftune.py with appropriate parameters. Also add a
"--disable-writeback-cache" option to "scylla_sysconfig_setup", which
can be called by scylla-machine image scripts, for example.

Refs: scylladb#7341
Tests: dtest (next-gating)

Closes scylladb#8526

(cherry picked from commit 0ddbed2)

Fixes scylladb#1784.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
symptom/performance Issues causing performance problems
Projects
None yet
Development

Successfully merging a pull request may close this issue.