ena NIC msi_irq is empty on ubuntu kernel 5.19 - causing a severe throughput degradation for i3.4xlarge (and above?) #13560

aleksbykov · 2023-04-18T10:19:29Z

Installation details
Scylla version (or git commit hash): 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2
Cluster size: 3
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0501eb17c8c79b6d2

Performance latency test jobs:

Performance latency 1tb test run prepare command to populate dataset with size 1TB with c-s command:

["cassandra-stress write no-warmup cl=ALL n=250000000 -schema 'replication(factor=3)' -mode cql3 native -rate threads=200 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..250000000",
 "cassandra-stress write no-warmup cl=ALL n=250000000 -schema 'replication(factor=3)' -mode cql3 native -rate threads=200 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=250000001..500000000",
"cassandra-stress write no-warmup cl=ALL n=250000000 -schema 'replication(factor=3)' -mode cql3 native -rate threads=200 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=500000001..750000000",
 "cassandra-stress write no-warmup cl=ALL n=250000000 -schema 'replication(factor=3)' -mode cql3 native -rate threads=200 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=750000001..1000000000"]

All cassandra stress commands run with cl=ALL.
in 30 minutes after prepare stress-commands started, all nodes start reported a lot of Reactor stalls about 6-20 ms and lsa-time records to log. Decoded reactor stalled attached:reactor_stalls_decoded_nodes.zip

on monitoring: http://3.237.101.163:3000/d/sZoKwKP4k/scylla-enterprise-perf-regression-latency-shard-aware-1tb-test-scylla-per-server-metrics-nemesis-master?orgId=1&from=1681736652253&to=1681773381114
we see next problems for compactions:

VS latest successfull run for scylla: 5.3.0~dev-0.20230316.5705df77a155

and next for 'Writes currently blocked on dirty'

VS latest successfull run for scylla: 5.3.0~dev-0.20230316.5705df77a155

Latest successful run is for scylla: 5.3.0~dev-0.20230316.5705df77a155
job: https://jenkins.scylladb.com/view/New%20Performance%20Jobs/job/scylla-master/job/scylla-master-perf-regression-latency-shard-aware-1TB-test/14

issue start appeared from: 5.3.0dev-0.20230331.160c184d0b0f with build-id 13d781a3205d092514f6642c9787a566aba7c110
latest successfull run: 5.3.0dev-0.20230316.5705df77a155

Possible scylla commit which bring the issue: 472b155

DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/540d1d32-9700-4c1c-aecc-168260285dd3/20230417_225030/db-cluster-540d1d32.tar.gz

The text was updated successfully, but these errors were encountered:

roydahan · 2023-04-18T15:58:14Z

@raphaelsc / @bhalevy my suspicion is that it's something related to compactions, looks like the compactions are not "steady" during the entire write and the effect is severe.
Was there any change in the last month of master that may explain this?

@aleksbykov let's try to bisect it to a smaller range.
We already know the last successful and the first time it failed (not this run).
Let's find a build between them and re-run it with it.

bhalevy · 2023-04-18T16:21:05Z

all nodes start reported a lot of Reactor stalls about 6-20 ms and lsa-time records to log. Decoded reactor stalled attached:reactor_stalls_decoded_nodes.zip

The reactor stalls are relatively short. For example, on node1:

Processed 2486 stalls lasting a total of 15126 milliseconds.
min=6 avg=6.1 median=6 p95=7 p99=8 p999=9 max=13

This graph is printed in bottom-up order, where callees are printed first.
Use --direction=top-down to print callers first.

[level#index/out_of pct%] below denotes:
  level  - nesting level in the graph
  index  - index of node among to its siblings
  out_of - number of siblings
  pct    - percentage of total stall time of this call relative to its siblings

Command line options: --address-threshold=0x100000000 --executable=/home/abykov/tmp/performance/540d1d32-9700-4c1c-aecc-168260285dd3/decode_reactor_stalls/scylla/libexec/scylla --direction=bottom-up --branch-threshold=0.05 --file=../reactor_stalls_node1.log

++[0#1/1 100%] addr=0x542c633 total=15126 count=2486 avg=6:
|              seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:60
|              (inlined by) seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar/./seastar/src/core/reactor.cc:797
|              (inlined by) seastar::print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:816
++           - addr=0x542b9e0:
|              seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar/./seastar/src/core/reactor.cc:1386
++           - addr=0x542ce0f:
|              seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar/./seastar/src/core/reactor.cc:1123
|              (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar/./seastar/src/core/reactor.cc:1143
|              (inlined by) seastar::reactor::block_notifier at ./build/release/seastar/./seastar/src/core/reactor.cc:1423
++           - addr=0x3cb1f: ?? ??:0
  ++[1#1/669 7%] addr=0x1e23ba5 total=1052 count=173 avg=6:
  |              utils::uleb64_decode_forwards<void (&)(char const*, unsigned long), void (&)(char const*, unsigned long)> at ././utils/vle.hh:105
  |              (inlined by) logalloc::region_impl::object_descriptor::decode_forwards at ./utils/logalloc.cc:1619
  |              (inlined by) logalloc::region_impl::for_each_live<logalloc::region_impl::compact_segment_locked(logalloc::segment*, logalloc::segment_descriptor&)::{lambda(logalloc::region_impl::object_descriptor const*, void*, unsigned long)#1}> at ./utils/logalloc.cc:1717
  |              (inlined by) logalloc::region_impl::compact_segment_locked at ./utils/logalloc.cc:1873
  | ++[2#1/3 64%] addr=0x1e1a56b total=671 count=110 avg=6:
  | |             logalloc::region_impl::compact at ./utils/logalloc.cc:2213
  | |             (inlined by) logalloc::reclaim_from_evictable at ./utils/logalloc.cc:2532
  | | ++[3#1/1 100%] addr=0x1e18824 total=6973 count=1145 avg=6:
  | | |              logalloc::tracker::impl::compact_and_evict_locked at ./utils/logalloc.cc:2691
  | |   ++[4#1/1 100%] addr=0x1e203f0 total=9103 count=1497 avg=6:
  | |   |              logalloc::tracker::impl::compact_and_evict at ./utils/logalloc.cc:2618
  | |   |              (inlined by) logalloc::segment_pool::allocate_segment at ./utils/logalloc.cc:1279
  | |   |              (inlined by) logalloc::segment_pool::refill_emergency_reserve at ./utils/logalloc.cc:1292
  | |   |              (inlined by) logalloc::allocating_section::reserve at ./utils/logalloc.cc:2871
  | |     ++[5#1/8 35%] addr=0x1df131d total=4295 count=705 avg=6:
  | |     |             logalloc::allocating_section::with_reserve<logalloc::allocating_section::operator()<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#3}::operator()() const::{lambda()#1}>(logalloc::region&, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#3}::operator()() const::{lambda()#1}&&)::{lambda()#1}> at ././utils/logalloc.hh:469
  | |     |             (inlined by) logalloc::allocating_section::operator()<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#3}::operator()() const::{lambda()#1}> at ././utils/logalloc.hh:520
  | |     |             (inlined by) operator() at ./row_cache.cc:971
  | |     |             (inlined by) with_allocator<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#3}> at ././utils/allocation_strategy.hh:313
  | |     |             (inlined by) operator() at ./row_cache.cc:961
  | |     |             (inlined by) std::__invoke_impl<void, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:61
  | |     |             (inlined by) std::__invoke<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:96
  | |     |             (inlined by) std::__apply_impl<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}, std::tuple<>> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/tuple:1852
  | |     |             (inlined by) std::apply<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}, std::tuple<> > at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/tuple:1863
  | |     |             (inlined by) seastar::futurize<void>::apply<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}> at ././seastar/include/seastar/core/future.hh:1963
  | |     |             (inlined by) operator() at ././seastar/include/seastar/core/thread.hh:258
  | |     | ++[6#1/1 100%] addr=0x56b7576 total=4301 count=706 avg=6:
  | |     | |              seastar::noncopyable_function<void ()>::operator() at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:210
  | |     | |              (inlined by) seastar::thread_context::main at ./build/release/seastar/./seastar/src/core/thread.cc:299
  | |     |+[5#2/8 30%] addr=0x1ad2213 total=3660 count=601 avg=6:
  | |     |             logalloc::allocating_section::with_reserve<logalloc::allocating_section::operator()<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_9::operator()() const::{lambda()#1}>(logalloc::region&, replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_9::operator()() const::{lambda()#1}&&)::{lambda()#1}> at ././utils/logalloc.hh:469
  | |     |             (inlined by) logalloc::allocating_section::operator()<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_9::operator()() const::{lambda()#1}> at ././utils/logalloc.hh:520
  | |     |             (inlined by) operator() at ./replica/memtable.cc:794
  | |     |             (inlined by) with_allocator<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_9> at ././utils/allocation_strategy.hh:313
  | |     |             (inlined by) replica::memtable::apply at ./replica/memtable.cc:793
  | |     |+          - addr=0x1a0efda:
  | |     |             replica::table::do_apply<frozen_mutation const&, seastar::lw_shared_ptr<schema const>&> at ./replica/table.cc:2267
  | |     |             (inlined by) operator() at ./replica/table.cc:2290
  | |     |             (inlined by) seastar::futurize<void>::invoke<replica::table::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const>, db::rp_handle&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::$_43&> at ././seastar/include/seastar/core/future.hh:1996
  | |     | ++[6#1/2 96%] addr=0x1a0f4a3 total=3727 count=612 avg=6:
  | |     | |             replica::dirty_memory_manager_logalloc::region_group::concrete_allocating_function<replica::table::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const>, db::rp_handle&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::$_43>::allocate at ./replica/dirty_memory_manager.hh:118
  | |     | ++          - addr=0x1b2068d:
  | |     | |             replica::dirty_memory_manager_logalloc::region_group::execute_one at ./replica/dirty_memory_manager.cc:106
  | |     | |             (inlined by) replica::dirty_memory_manager_logalloc::region_group::release_queued_allocations at ./replica/dirty_memory_manager.cc:117

And the stalls look similar to #13160
cc @tgrabiec @michoecho

michoecho · 2023-04-18T19:10:17Z

The reason of everything else is that something is wrong specifically with shard 0. Somehow it needs more CPU to do the same job as other shards:

michoecho · 2023-04-18T19:12:03Z

It's doesn't appear to be just an infinite loop on shard 0 or something, because the extra load happens both in the write phase and the read phase of the test, but not in between.

michoecho · 2023-04-18T19:15:47Z

I'll check if it happens locally. If not, it's probably a test setup issue. If yes, I'll bisect it.

michoecho · 2023-04-18T19:27:26Z

It doesn't happen on my PC with 1da0270.
I'll also try with the AMI.

michoecho · 2023-04-18T20:33:00Z

This is a kernel and/or i3.4xlarge regression (edit: I didn't test other instance types). The NIC doesn't report its IRQ numbers for some reason (/sys/class/net/eth0/device/msi_irqs/ is empty), so perftune can't distribute networking IRQs properly (it doesn't do anything about them), and by default they concentrate on the shard 0 CPU. (Don't ask me why).
A manual echo 00010001 | sudo tee /proc/irq/*/smp_affinity fixes the performance problem.

I don't know what we can do about this.

michoecho · 2023-04-18T20:51:03Z

my suspicion is that it's something related to compactions, looks like the compactions are not "steady" during the entire write and the effect is severe.

@roydahan Compactions being bursty is an effect, not a cause. Since shards other than 0 are very underloaded, they are able to devote most of their CPU to compactions (normally they would devote a small but smoothly growing fraction of their time), and complete them in a fast burst.

vladzcloudius · 2023-04-18T22:05:08Z

This is a kernel and/or i3.4xlarge regression (edit: I didn't test other instance types). The NIC doesn't report its IRQ numbers for some reason (/sys/class/net/eth0/device/msi_irqs/ is empty), so perftune can't distribute networking IRQs properly (it doesn't do anything about them), and by default they concentrate on the shard 0 CPU. (Don't ask me why). A manual echo 00010001 | sudo tee /proc/irq/*/smp_affinity fixes the performance problem.

I don't know what we can do about this.

@aleksbykov @michoecho please, send the following info from the node where you saw the above:

ethtool -i eth0
uname -a

I strongly doubt there is any kernel issue here.
Much more likely is that Enhanced Networking is not enabled for this VM and a fully virtual NIC device is used.

Based on the perftune output there wasn't any IRQ detected, meaning the NIC in question wasn't exposing either MSI-X, or MSI, or INT#x vectors to the guest OS.
Which is only possible to fully virtualized devices to the best of my knowledge.

michoecho · 2023-04-18T22:14:12Z

@aleksbykov @michoecho please, send the following info from the node where you saw the above:

ethtool -i eth0

driver: ena
version: 5.19.0-1022-aws
firmware-version:
expansion-rom-version:
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

uname -a

Linux ip-10-0-1-194 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

vladzcloudius · 2023-04-18T22:54:34Z

Looks legit. How I can get access to that VM?

michoecho · 2023-04-18T22:56:48Z

Looks legit. How I can get access to that VM?

It's not some particular VM. Just launch ami-0501eb17c8c79b6d2 (us-east-1) on i3.4xlarge.

Edit: mine is already shut down, so I can't give you access to it to save you the effort.

vladzcloudius · 2023-04-18T23:02:02Z

my suspicion is that it's something related to compactions, looks like the compactions are not "steady" during the entire write and the effect is severe.

@roydahan Compactions being bursty is an effect, not a cause. Since shards other than 0 are very underloaded, they are able to devote most of their CPU to compactions (normally they would devote a small but smoothly growing fraction of their time), and complete them in a fast burst.

This statement is incorrect in general, @michoecho.
Compaction will never use a "small smoothly growing fraction of CPU" - they will always use every CPU or other resource available as long as they run. The same relates to any running context in the seastar. Only if some running context is IO bound you can see some "free" CPU while it runs. However this is almost never the case with compactions which are mainly CPU bound. That's why when compactions are running on a specific shard its "load" is always at 100%.

The actual amount of CPU out of those 100% used by compactions depends on other running contexts and relative amount of shares compared to compaction ones.

michoecho · 2023-04-18T23:04:27Z

my suspicion is that it's something related to compactions, looks like the compactions are not "steady" during the entire write and the effect is severe.

@roydahan Compactions being bursty is an effect, not a cause. Since shards other than 0 are very underloaded, they are able to devote most of their CPU to compactions (normally they would devote a small but smoothly growing fraction of their time), and complete them in a fast burst.

This statement is incorrect in general, @michoecho. Compaction will never use a "small smoothly growing fraction of CPU" - they will always use every CPU or other resource available as long as they run. The same relates to any running context is the seastar. Only if some running context is IO bound you can see some "free" CPU while it runs. However this is almost never the case with compactions which are mainly CPU bound. That's why when compactions are running on a specific shard its "load" is always at 100%.

The actual amount of CPU out of those 100% used by compactions depends on other running contexts and relative amount of shares compared to compaction ones.

Yes. By "normally" I meant "in a regular run of this particular test", not in general.

mykaul · 2023-04-19T08:45:10Z

Linux ip-10-0-1-194 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Didn't we agree to remain on 5.15? (https://www.omgubuntu.co.uk/2022/01/ubuntu-22-04-lts-will-use-linux-5-15-kernel ) - how did you get 5.19?

aleksbykov · 2023-04-19T08:51:01Z

Latest succesful run is with : Scylla version 5.3.0~dev-0.20230325.e8fb718e4ad4 with build-id 6eed28a1ac2addc02aceea60af4d6ee4acd56955 PASSED ami-078e6867d914fbfb0

Linux perf-regression-latency-ubuntu-db-node-363ae036-1 5.15.0-1031-aws #35-Ubuntu SMP Fri Feb 10 02:07:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

scyllaadm@perf-regression-latency-ubuntu-db-node-363ae036-1:~$ ls /sys/class/net/eth0/device/msi_irqs/
210  211  212  213  214  215  216  217  218

scyllaadm@perf-regression-latency-ubuntu-db-node-363ae036-1:~$  ethtool -i eth0
driver: ena
version: 5.15.0-1031-aws
firmware-version: 
expansion-rom-version: 
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

First failed run is with: Scylla version 5.3.0-dev-0.20230328.c7131a05741d with build-id 6358d7ada913b1dfc96849ddb519b7a243afe0bd FAILED ami-020e718640eafe444

between them we don't have AMI.

for Scylla version 5.3.0 - dev-0.20230328.c7131a05741d with build-id 6358d7ada913b1dfc96849ddb519b7a243afe0bd
instance have:
empty /sys/class/net/eth0/device/msi_irqs/
scyllaadm@perf-regression-latency-ubuntu-db-node-a292366c-1:~$ ethtool -i eth0

driver: ena
version: 5.19.0-1022-aws
firmware-version: 
expansion-rom-version: 
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

scyllaadm@perf-regression-latency-ubuntu-db-node-a292366c-1:~$ uname -a

Linux perf-regression-latency-ubuntu-db-node-a292366c-1 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

mykaul · 2023-04-19T09:02:01Z

@yaronkaikov - Ubuntu 22.0.4 LTS kernel should be 5.15, with an optional, for HW enablement, 5.19 - but we've never moved to 5.19 explicitly - how come the AMI uses it?

mykaul · 2023-04-19T09:14:18Z

@vladzcloudius - do we have any tests for perftune that can catch such issues?

yaronkaikov · 2023-04-19T09:17:15Z

@yaronkaikov - Ubuntu 22.0.4 LTS kernel should be 5.15, with an optional, for HW enablement, 5.19 - but we've never moved to 5.19 explicitly - how come the AMI uses it?

Since we are using the latest image available , we don't pin the kernel version

michoecho · 2023-04-19T09:19:22Z

@mykaul I guess our AMIs use the linux-aws apt package, which is currently 5.19, instead of linux-aws-lts, which is 5.15.
Edit: it's called linux-aws-lts-22.04, not just linux-aws-lts.

mykaul · 2023-04-19T09:19:58Z

@mykaul I guess our AMIs use the linux-aws apt package, which is currently 5.19, instead of linux-aws-lts, which is 5.15.

@yaronkaikov - this might be the issue. We were supposed to keep using LTS.

michoecho · 2023-04-19T10:17:09Z

@mykaul I guess our AMIs use the linux-aws apt package, which is currently 5.19, instead of linux-aws-lts, which is 5.15.

@yaronkaikov - this might be the issue. We were supposed to keep using LTS.

I just verified that the current linux-aws-lts-22.04 (5.15.0-1034-aws) is not affected by the problem. It reports IRQs as expected.

avikivity · 2023-04-19T11:07:50Z

Still, we need to report the problem, so it gets fixed before we're forced to move to a newer kernel. Also users may be using that newer kernel.

vladzcloudius · 2023-04-26T14:12:16Z

Tested a smaller i4i instance (2xl) - still works:

scyllaadm@ip-10-99-17-182:~$ uname -a
Linux ip-10-99-17-182 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
scyllaadm@ip-10-99-17-182:~$ sudo ls -al /sys/devices/pci0000:00/0000:00:05.0/msi_irqs/
total 0
drwxr-xr-x 2 root root    0 Apr 26 14:11 .
drwxr-xr-x 6 root root    0 Apr 26 14:10 ..
-r--r--r-- 1 root root 4096 Apr 26 14:11 36
-r--r--r-- 1 root root 4096 Apr 26 14:11 37
-r--r--r-- 1 root root 4096 Apr 26 14:11 38
-r--r--r-- 1 root root 4096 Apr 26 14:11 39
-r--r--r-- 1 root root 4096 Apr 26 14:11 40
-r--r--r-- 1 root root 4096 Apr 26 14:11 41
-r--r--r-- 1 root root 4096 Apr 26 14:11 42
-r--r--r-- 1 root root 4096 Apr 26 14:11 43
-r--r--r-- 1 root root 4096 Apr 26 14:11 44

yaronkaikov · 2023-04-27T06:38:12Z

Closed with scylladb/scylla-machine-image#443

mykaul · 2023-04-27T08:29:25Z

Closed with scylladb/scylla-machine-image#443

It's not closed, in the sense that we do need/want to support that 5.19 kernel.

mykaul · 2023-04-27T08:29:56Z

I posted some analysis on amzn/amzn-drivers#268. In gist, what they wrote makes very little sense. I think they messed something up in this particular kernel assembly.

So perhaps it is an Ubuntu issue (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2016991 ) after all?

vladzcloudius · 2023-04-27T14:52:55Z

I posted some analysis on amzn/amzn-drivers#268. In gist, what they wrote makes very little sense. I think they messed something up in this particular kernel assembly.

So perhaps it is an Ubuntu issue (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2016991 ) after all?

It's definitely not a Ubuntu issue. If it was it would not work on i4i and with older (Amazon!) kernel.

It's quite obviously an Amazon kernel bug. They (Amazon) have to fix it.
We on the other hand should not use their kernels with this bug. In this context it means we should roll back to an older kernel without an issue in our AMIs.

@yaronkaikov @mykaul

vladzcloudius · 2023-04-27T14:58:00Z

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

mykaul · 2023-04-27T15:14:26Z

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

I think it's safe. I also slightly prefer having a single kernel version (as much as possible) across cloud providers (you could ask also about serverless - we have haven't decided yet - https://github.com/scylladb/serverless-issues/issues/11 )

I could not find a 'certified by Amazon', but I assume it's because Ubuntu Pro might be.

vladzcloudius · 2023-04-27T15:33:28Z

See here:
https://aws.amazon.com/blogs/aws/use-your-own-kernel-with-amazon-ec2/

They confirm that "Ubuntu EC2 Variant" would work reliably.

Also see this: https://ubuntu.com/blog/introducing-the-ubuntu-aws-rolling-kernel-2

Bottom line, vanilla LTS kernels are not safe on Ec2 AMIs.

There are many reasons for that: there are a few AWS specific device drivers and Xen related bits that are only relevant for Ec2 users. And hence (I guess) they are maintained in a much faster way in this kernel stream than in a mainline LTS stream.

michoecho · 2023-04-27T15:38:36Z

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

vladzcloudius · 2023-04-27T15:48:23Z

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

It comes from me missing this fact. ;) I assumed a vanilla LTS is used. Thanks for pointing this out, @michoecho.

@yaronkaikov have we verified that a latest aws-lts kernel doesn't suffer from the same bug?

michoecho · 2023-04-27T15:51:27Z

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

It comes from me missing this fact. ;) I assumed a vanilla LTS is used. Thanks for pointing this out, @michoecho.

@yaronkaikov have we verified that a latest aws-lts kernel doesn't suffer from the same bug?

I have (#13560 (comment)), but a sanity check from someone else would be appreciated.

vladzcloudius · 2023-04-27T17:07:14Z

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

It comes from me missing this fact. ;) I assumed a vanilla LTS is used. Thanks for pointing this out, @michoecho.
@yaronkaikov have we verified that a latest aws-lts kernel doesn't suffer from the same bug?

I have (#13560 (comment)), but a sanity check from someone else would be appreciated.

First of all - great!
However this GH issue has a 5.19 in its name while the kernel that you've checked in 5.15. So, I'm still not sure if we want to close this GH issue yet.

yaronkaikov · 2023-04-27T17:32:41Z

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

It comes from me missing this fact. ;) I assumed a vanilla LTS is used. Thanks for pointing this out, @michoecho.

@yaronkaikov have we verified that a latest aws-lts kernel doesn't suffer from the same bug?

i didn't , it's part of the rc5 so QA will verify it

mykaul · 2023-05-01T09:32:14Z

This specific issue was fixed with 5.2.0-rc5, by ensuring we use kernel 5.15 which doesn't suffer from it (and is truly LTS, btw). There's a separate issue (in seastar, and elsewhere) to track the Linux kernel regression which should be fixed. Therefore, closing as completed.

vladzcloudius · 2023-05-01T13:05:29Z

This specific issue was fixed with 5.2.0-rc5, by ensuring we use kernel 5.15 which doesn't suffer from it (and is truly LTS, btw). There's a separate issue (in seastar, and elsewhere) to track the Linux kernel regression which should be fixed. Therefore, closing as completed.

@mykaul there is nothing to fix in seastar in this context.
A proper place to track AMI kernel related issues is the AMI-related repo (scylla-machine-image ?)

cc @syuu1228

mykaul · 2023-05-01T13:10:55Z

This specific issue was fixed with 5.2.0-rc5, by ensuring we use kernel 5.15 which doesn't suffer from it (and is truly LTS, btw). There's a separate issue (in seastar, and elsewhere) to track the Linux kernel regression which should be fixed. Therefore, closing as completed.

@mykaul there is nothing to fix in seastar in this context. A proper place to track AMI kernel related issues is the AMI-related repo (scylla-machine-image ?)

I thought we'd want in perftune to handle the case where we get an empty list.
The issue itself is already fixed via (ami,azure):Install latest LTS kernel during image build scylla-machine-image#443 and (gcp):Install latest LTS kernel during image build scylla-machine-image#446

cc @syuu1228

vladzcloudius · 2023-05-01T13:19:42Z

This specific issue was fixed with 5.2.0-rc5, by ensuring we use kernel 5.15 which doesn't suffer from it (and is truly LTS, btw). There's a separate issue (in seastar, and elsewhere) to track the Linux kernel regression which should be fixed. Therefore, closing as completed.

@mykaul there is nothing to fix in seastar in this context. A proper place to track AMI kernel related issues is the AMI-related repo (scylla-machine-image ?)
* I thought we'd want in perftune to handle the case where we get an empty list.

No, we don't. It's a bogus state of the kernel from our perspective.

* The issue itself is already fixed via [(ami,azure):Install latest LTS kernel during image build  scylla-machine-image#443](https://github.com/scylladb/scylla-machine-image/pull/443) and [(gcp):Install latest LTS kernel during image build scylla-machine-image#446](https://github.com/scylladb/scylla-machine-image/pull/446)

I know (hence not re-opening this one). But what you probably want is to have some tracking for the next time we upgrade the kernel in the AMI.

BTW, upgrading to the "latest LTS" is a risk too as I had explained some time ago.
What you want to do is upgrade to a new version in a controlled manner after you have tested that new kernel in QA - in order to avoid issues like this one.

And to do this you need to pin the kernel.

And when you upgrade you must make sure to upgrade SC installations to that kernel too.

Tomer has recently performed an investigation about kernels' version in SC and it's nothing but terrible.

We really have to take it under the control and upgrading to the "latest LTS" every time you build a new AMI is not helping us to get there...

cc @tomer-sandler @syuu1228

cc @syuu1228

roydahan · 2023-05-02T09:49:43Z

I reopened the issue because in GCP we still have kernel 5.19 and we need to verify for several types of instances if there is an issue or not.

@vladzcloudius we need your help how to test it and identify in the easiest way where the IRQ is set correctly and where not.

mykaul · 2023-05-02T13:10:07Z

I reopened the issue because in GCP we still have kernel 5.19 and we need to verify for several types of instances if there is an issue or not.

@vladzcloudius we need your help how to test it and identify in the easiest way where the IRQ is set correctly and where not.

The issue does not exist simply because the ENA driver is AWS specific.
To see if the issue is there or not, you can run the performance tests and see if core 0 and such are way too overloaded - just as you've seen in the original issue. I suggest closing this one.

michoecho · 2023-05-02T13:40:19Z

To see if the issue is there or not, you can run the performance tests and see if core 0 and such are way too overloaded - just as you've seen in the original issue

The overload of shard 0 was an accidental effect, not a direct and predictable result of the bug. So checking for the bug via black box performance tests is a bad idea. (But of course if the issue is AWS-specific then there is no reason to do anything about it for GCP).

mykaul · 2023-05-02T14:01:32Z

To see if the issue is there or not, you can run the performance tests and see if core 0 and such are way too overloaded - just as you've seen in the original issue

The overload of shard 0 was an accidental effect, not a direct and predictable result of the bug. So checking for the bug via black box performance tests is a bad idea. (But of course if the issue is AWS-specific then there is no reason to do anything about it for GCP).

Of course - we can look at the interrupt mapping, which was supposed to, but did not, happen.

vladzcloudius · 2023-05-02T14:26:58Z

I reopened the issue because in GCP we still have kernel 5.19 and we need to verify for several types of instances if there is an issue or not.

@vladzcloudius we need your help how to test it and identify in the easiest way where the IRQ is set correctly and where not.

For this particular issue (a kernel bug) it shows up as an empty content of the
/sys/class/net/<NIC name>/device/msi_irqs/ directory.

When the kernel is working correctly both in AWS and in GCP this directory has to have files with names as IRQs of the corresponding NIC. Like I shown here: #13560 (comment)

I believe all you need to verify in the context of this GH issue is that the above holds.

As to other perftune.py testing - let's have a different context for that (GH issues?) and I'll be happy to help.

@roydahan

fruch · 2023-05-03T19:20:58Z

@vponomaryov verified that the IRQs are showing on GCP n1-highmem-8, n1-highmem-16, n1-highmem-32
https://github.com/scylladb/qa-tasks/issues/1145#issuecomment-1533079418

mykaul added symptom/performance Issues causing performance problems status/regression labels Apr 18, 2023

mykaul changed the title ~~Severed degradation in write throughput~~ Severe degradation in write throughput Apr 19, 2023

mykaul added the triage/master Looking for assignee label Apr 19, 2023

DoronArazii assigned yaronkaikov Apr 19, 2023

DoronArazii added P1 Urgent status/release blocker Preventing from a release to be promoted labels Apr 19, 2023

yaronkaikov closed this as completed Apr 27, 2023

yaronkaikov removed their assignment Apr 27, 2023

vladzcloudius reopened this Apr 27, 2023

mykaul mentioned this issue May 1, 2023

Large partitions read performance degraded by 6 times on 2023.1.x compared to 2022.2.x #13740

Closed

2 tasks

mykaul closed this as completed May 1, 2023

roydahan reopened this May 2, 2023

roydahan closed this as completed May 4, 2023

ena NIC msi_irq is empty on ubuntu kernel 5.19 - causing a severe throughput degradation for i3.4xlarge (and above?) #13560

ena NIC msi_irq is empty on ubuntu kernel 5.19 - causing a severe throughput degradation for i3.4xlarge (and above?) #13560

Comments

aleksbykov commented Apr 18, 2023

roydahan commented Apr 18, 2023

bhalevy commented Apr 18, 2023

michoecho commented Apr 18, 2023

michoecho commented Apr 18, 2023

michoecho commented Apr 18, 2023

michoecho commented Apr 18, 2023

michoecho commented Apr 18, 2023 • edited

michoecho commented Apr 18, 2023 • edited

vladzcloudius commented Apr 18, 2023 • edited

michoecho commented Apr 18, 2023

vladzcloudius commented Apr 18, 2023

michoecho commented Apr 18, 2023 • edited

vladzcloudius commented Apr 18, 2023 • edited

michoecho commented Apr 18, 2023

mykaul commented Apr 19, 2023

aleksbykov commented Apr 19, 2023 • edited

mykaul commented Apr 19, 2023

mykaul commented Apr 19, 2023

yaronkaikov commented Apr 19, 2023

michoecho commented Apr 19, 2023 • edited

mykaul commented Apr 19, 2023

michoecho commented Apr 19, 2023

avikivity commented Apr 19, 2023

vladzcloudius commented Apr 26, 2023

yaronkaikov commented Apr 27, 2023

mykaul commented Apr 27, 2023

mykaul commented Apr 27, 2023

vladzcloudius commented Apr 27, 2023 • edited

vladzcloudius commented Apr 27, 2023

mykaul commented Apr 27, 2023

vladzcloudius commented Apr 27, 2023 • edited

michoecho commented Apr 27, 2023

vladzcloudius commented Apr 27, 2023

michoecho commented Apr 27, 2023 • edited

vladzcloudius commented Apr 27, 2023

yaronkaikov commented Apr 27, 2023

mykaul commented May 1, 2023

vladzcloudius commented May 1, 2023

mykaul commented May 1, 2023

vladzcloudius commented May 1, 2023 • edited

roydahan commented May 2, 2023

mykaul commented May 2, 2023

michoecho commented May 2, 2023

mykaul commented May 2, 2023

vladzcloudius commented May 2, 2023 • edited

fruch commented May 3, 2023

michoecho commented Apr 18, 2023 •

edited

michoecho commented Apr 18, 2023 •

edited

vladzcloudius commented Apr 18, 2023 •

edited

michoecho commented Apr 18, 2023 •

edited

vladzcloudius commented Apr 18, 2023 •

edited

aleksbykov commented Apr 19, 2023 •

edited

michoecho commented Apr 19, 2023 •

edited

vladzcloudius commented Apr 27, 2023 •

edited

vladzcloudius commented Apr 27, 2023 •

edited

michoecho commented Apr 27, 2023 •

edited

vladzcloudius commented May 1, 2023 •

edited

vladzcloudius commented May 2, 2023 •

edited